Jovanovic, P.; Romero, O.; Simitsis, A.; Abello, A. IEEE transactions on knowledge and data engineering Vol. 28, num. 5, p. 1203-1216 DOI: 10.1109/TKDE.2016.2515609 Date of publication: 2016-01-07 Journal article
Business intelligence (BI) systems depend on efficient integration of disparate and often heterogeneous data. The integration of data is governed by data-intensive flows and is driven by a set of information requirements. Designing such flows is in general a complex process, which due to the complexity of business environments is hard to be done manually. In this paper, we deal with the challenge of efficient design and maintenance of data-intensive flows and propose an incremental approach, namely CoAl, for semi-automatically consolidating data-intensive flows satisfying a given set of information requirements. CoAl works at the logical level and consolidates data flows from either high-level information requirements or platform-specific programs. As CoAl integrates a new data flow, it opts for maximal reuse of existing flows and applies a customizable cost model tuned for minimizing the overall cost of a unified solution. We demonstrate the efficiency and effectiveness of our approach through an experimental evaluation using our implemented prototype.
Varga, J.; Etcheverry, L.; Vaisman, A.; Romero, O.; Pedersen, T.; Thomsen, C. IEEE International Conference on Data Engineering p. 1346-1349 DOI: 10.1109/ICDE.2016.7498341 Presentation's date: 2016 Presentation of work at congresses
Publication and sharing of multidimensional (MD) data on the Semantic Web (SW) opens new opportunities for the use of On-Line Analytical Processing (OLAP). The RDF Data Cube (QB) vocabulary, the current standard for statistical data publishing, however, lacks key MD concepts such as dimension hierarchies and aggregate functions. QB4OLAP was proposed to remedy this. However, QB4OLAP requires extensive manual annotation and users must still write queries in SPARQL, the standard query language for RDF, which typical OLAP users are not familiar with. In this demo, we present QB2OLAP, a tool for enabling OLAP on existing QB data. Without requiring any RDF, QB(4OLAP), or SPARQL skills, it allows semi-automatic transformation of a QB data set into a QB4OLAP one via enrichment with QB4OLAP semantics, exploration of the enriched schema, and querying with the high-level OLAP language QL that exploits the QB4OLAP semantics and is automatically translated to SPARQL.
In the recent years the problems of using generic storage (i.e., relational) techniques for very specific applications have been detected and outlined and, as a consequence, some alternatives to Relational DBMSs (e.g., HBase) have bloomed. Most of these alternatives sit on the cloud and benefit from cloud computing, which is nowadays a reality that helps us to save money by eliminating the hardware as well as software fixed costs and just pay per use. On top of this, specific querying frameworks to exploit the brute force in the cloud (e.g., MapReduce) have also been devised. The question arising next tries to clear out if this (rather naive) exploitation of the cloud is an alternative to tuning DBMSs or it still makes sense to consider other options when retrieving data from these settings.; In this paper, we study the feasibility of solving OLAP queries with Hadoop (the Apache project implementing MapReduce) while benefiting from secondary indexes and partitioning in HBase. Our main contribution is the comparison of different access plans and the definition of criteria (i.e., cost estimation) to choose among them in terms of consumed resources (namely CPU, bandwidth and I/O).
Data integration aims to facilitate the exploitation of heterogeneous data by providing the user with a unified view of data residing in different sources. Currently, ontologies are commonly used to represent this unified view in terms of a global target schema due to their flexibility and expressiveness. However, most approaches still assume a predefined target schema and focus on generating the mappings between this schema and the sources.
In this paper, we propose a solution that supports data integration tasks by employing semi-automatic ontology construction to create a target schema on the fly. To that end, we revisit existing ontology extraction, matching and merging techniques and integrate them into a single end-to-end system. Moreover, we extend the used techniques with the automatic generation of mappings between the extracted ontologies and the underlying data sources. Finally, to demonstrate the usefulness of our solution, we integrate it with an independent data integration system.
Ghrab, A.; Romero, O.; Skhiri, S.; Vaisman, A.; Zimányi, E. East-European Conference on Advances in Databases and Information Systems p. 92-105 DOI: 10.1007/978-3-319-23135-8_7 Presentation's date: 2015-09 Presentation of work at congresses
Graphs are widespread structures providing a powerful abstraction for modeling networked data. Large and complex graphs have emerged in various domains such as social networks, bioinformatics, and chemical data. However, current warehousing frameworks are not equipped to handle efficiently the multidimensional modeling and analysis of complex graph data. In this paper, we propose a novel framework for building OLAP cubes from graph data and analyzing the graph topological properties. The framework supports the extraction and design of the candidate multidimensional spaces in property graphs. Besides property graphs, a new database model tailored for multidimensional modeling and enabling the exploration of additional candidate multidimensional spaces is introduced. We present novel techniques for OLAP aggregation of the graph, and discuss the case of dimension hierarchies in graphs. Furthermore, the architecture and the implementation of our graph warehousing framework are presented and show the effectiveness of our approach.
Jovanovic, P.; Romero, O.; Simitsis, A.; Abello, A.; Candón, H.; Nadal, S. International Conference on Extending Database Technology p. 549-552 DOI: 10.5441/002/edbt.2015.55 Presentation's date: 2015-03-25 Presentation of work at congresses
The design lifecycle of a data warehousing (DW) system is primarily led by requirements of its end-users and the complexity of underlying data sources. The process of designing a multidimensional (MD) schema and back-end extracttransform-load (ETL) processes, is a long-term and mostly manual task. As enterprises shift to more real-time and ’on-the-fly’ decision making, business intelligence (BI) systems require automated means for efficiently adapting a physical DW design to frequent changes of business needs. To address this problem, we present Quarry, an end-to-end system for assisting users of various technical skills in managing the incremental design and deployment of MD schemata and ETL processes. Quarry automates the physical design of a DW system from high-level information requirements. Moreover, Quarry provides tools for efficiently accommodating MD schema and ETL process designs to new or changed information needs of its end-users. Finally, Quarry facilitates the deployment of the generated DW design over an extensible list of execution engines. On-site, we will use a variety of examples to show how Quarry facilitates the complexity of the DW design lifecycle.
Abello, A.; Romero, O.; Pedersen, T.; Berlanga, R.; Nebot, V.; Aramburu, M.; Simitsis, A. IEEE transactions on knowledge and data engineering Vol. 27, num. 2, p. 571-588 DOI: 10.1109/TKDE.2014.2330822 Date of publication: 2015-02-01 Journal article
This paper describes the convergence of some of the most influential technologies in the last few years, namely data warehousing (DW), on-line analytical processing (OLAP), and the Semantic Web (SW). OLAP is used by enterprises to derive important business-critical knowledge from data inside the company. However, the most interesting OLAP queries can no longer be answered on internal data alone, external data must also be discovered (most often on the web), acquired, integrated, and (analytically) queried, resulting in a new type of OLAP, exploratory OLAP. When using external data, an important issue is knowing the precise semantics of the data. Here, SW technologies come to the rescue, as they allow semantics (ranging from very simple to very complex) to be specified for web-available resources. SW technologies do not only support capturing the "passive" semantics, but also support active inference and reasoning on the data. The paper first presents a characterization of DW/OLAP environments, followed by an introduction to the relevant SW foundation concepts. Then, it describes the relationship of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms. Next, the paper goes on to survey the use of SW technologies for data modeling and data provisioning, including semantic data annotation and semantic-aware extract, transform, and load (ETL) processes. Finally, all the findings are discussed and a number of directions for future research are outlined, including SW support for intelligent MD querying, using SW technologies for providing context to data warehouses, and scalability issues.
Varga, J.; Romero, O.; Pedersen, T.; Thomsen, C. International Workshop On Data Warehousing and OLAP p. 57-66 DOI: 10.1145/2666158.2666182 Presentation's date: 2014-11-07 Presentation of work at congresses
Next generation BI systems emerge as platforms where traditional BI tools meet semi-structured and unstructured data coming from the Web. In these settings, the user-centric orientation represents a key characteristic for the acceptance and wide usage by numerous and diverse end users in their data analysis tasks. System and user related metadata are the base for enabling user assistance features. However, current approaches typically store these metadata in ad-hoc manners. In this paper, we propose a generic and extensible approach for the definition and modeling of the relevant metadata artifacts. We present SM4AM, a Semantic Metamodel for Analytical Metadata created as an RDF formalization of the Analytical Metadata artifacts needed for user assistance exploitation purposes in next generation BI systems. We consider the Linked Data initiative and its relevance for user assistance functionalities. We discuss the metamodel benefits and present directions for future work.
Varga, J.; Romero, O.; Pedersen, T.; Thomsen, C. International Conference on Data Warehousing and Knowledge Discovery p. 89-101 DOI: 10.1007/978-3-319-10160-6_9 Presentation's date: 2014-09 Presentation of work at congresses
Next generation Business Intelligence (BI) systems require integration of heterogeneous data sources and a strong user-centric orientation. Both needs entail machine-processable metadata to enable automation and allow end users to gain access to relevant data for their decision making processes. Although evidently needed, there is no clear picture about the necessary metadata artifacts, especially considering user support requirements. Therefore, we propose a comprehensive metadata framework to support the user assistance activities and their automation in the context of next generation BI systems. This framework is based on the findings of a survey of current user-centric approaches mainly focusing on query recommendation assistance. Finally, we discuss the benefits of the framework and present the plans for future work.
Designing data warehouse (DW) systems in highly dynamic enterprise environments is not an easy task. At each moment, the multidimensional (MD) schema needs to satisfy the set of information requirements posed by the business users. At the same time, the diversity and heterogeneity of the data sources need to be considered in order to properly retrieve needed data. Frequent arrival of new business needs requires that the system is adaptable to changes. To cope with such an inevitable complexity (both at the beginning of the design process and when potential evolution events occur), in this paper we present a semi-automatic method called ORE, for creating DW designs in an iterative fashion based on a given set of information requirements. Requirements are first considered separately. For each requirement, ORE expects the set of possible MD interpretations of the source data needed for that requirement (in a form similar to an MD schema). Incrementally, ORE builds the unified MD schema that satisfies the entire set of requirements and meet some predefined quality objectives. We have implemented ORE and performed a number of experiments to study our approach. We have also conducted a limited-scale case study to investigate its usefulness to designers.
Martin, C.; Urpi, T.; Burgues, X.; Romero, O.; Abello, A.; Casany, M.J.; Quer, C.; Rodríguez, M. Elena Congrés Internacional de Docència Universitària i Innovació p. 1-18 Presentation's date: 2014-07-02 Presentation of work at congresses
El canvi al nou Espai Europeu d'Educació Superior va portar a la Facultat d’Informàtica de Barcelona de la Universitat Politècnica de Catalunya a incorporar competències genèriques tranversals en els seus plans d’estudi. En aquest article es presenta com s’ha integrat la competència actitud adequada davant el treball en les assignatures de bases de dades del Grau en Enginyeria Informàtica en la especialitat d’Enginyeria del Software, el mètode d’avaluació utilitzat i es comenten els resultats obtinguts en els darrers tres anys.
Raventos, R.; García, S.; Romero, O.; Abello, A.; Viñas, J. European Business Intelligence Summer School p. 1-38 DOI: 10.1007/978-3-319-17551-5_1 Presentation's date: 2014-07 Presentation of work at congresses
The Chagas disease is classified as a life-threatening disease by the World Health Organization (WHO) and is currently causing death to 534,000 people every year. In order to advance with the disease control, the WHO presented a strategy that included the development of the Chagas Information Database (CID) for surveillance to raise awareness about Chagas. CID is defined as a decision-support system to support national and international authorities in both their day-by-day and long-term decision making. The requirements engineering to develop this project was particularly complex and Pohl’s framework was followed. This paper describes the results of applying the framework in this project. Thus, it focuses on the requirements engineering stage. The difficulties found motivated the further study and analysis of the complexity of requirements engineering in decision-support systems and the feasibility of using said framework.
The vision of an interconnected and open Web of data is, still, a chimera far from being accomplished. Fortunately, though, one can find several evidences in this direction and despite the technical challenges behind such approach recent advances have shown its feasibility. Semantic-aware formalisms (such as RDF and ontology languages) have been successfully put in practice in approaches such as Linked Data, whereas movements like Open Data have stressed the need of a new open access paradigm to guarantee free access to Web data.
In front of such promising scenario, traditional business intelligence (BI) techniques and methods have been shown not to be appropriate. BI was born to support decision making within the organizations and the data warehouse, the most popular IT construct to support BI, has been typically nurtured with data either owned or accessible within the organization. With the new linked open data paradigm BI systems must meet new requirements such as providing on-demand analysis tasks over any relevant (either internal or external) data source in right-time. In this paper we discuss the technical challenges behind such requirements, which we refer to as exploratory BI, and envision a new kind of BI system to support this scenario.
The traditional way to manage Information Technologies (IT) in the companies is having a data center, and licensing monolithic applications based on the number of CPUs, allowed connections, etc. This also holds for Business Intelligence environments. Nevertheless, technologies have evolved and today other approaches are possible. Specifically, the service paradigm allows to outsource hardware as well as software in a pay-as-you-go model.
Jovanovic, P.; Romero, O.; Simitsis, A.; Abello, A. International Conference on Conceptual Modeling p. 391-395 DOI: 10.1007/978-3-642-33999-8_47 Presentation's date: 2012 Presentation of work at congresses
We present our tool for assisting designers in the error-prone and time-consuming tasks carried out at the early stages of a data warehousing project. Our tool semi-automatically produces multidimensional (MD) and ETL conceptual designs from a given set of business requirements (like SLAs) and data source descriptions. Subsequently, our tool translates both the MD and ETL conceptual designs produced into physical designs, so they can be further deployed on a DBMS and an ETL engine. In this paper, we describe the system architecture and present our demonstration proposal by means of an example.
Romero, O.; Jovanovic, P.; Simitsis, A.; Abello, A. International Conference on Data Warehousing and Knowledge Discovery p. 65-80 DOI: 10.1007/978-3-642-32584-7_6 Presentation's date: 2012 Presentation of work at congresses
Data warehouse (DW) design is based on a set of requirements expressed as service level agreements (SLAs) and business level objects (BLOs). Populating a DW system from a set of information sources is realized with extract-transform-load (ETL) processes based on SLAs and BLOs. The entire task is complex, time consuming, and hard to be performed manually. This paper presents our approach to the requirement-driven creation of ETL designs. Each requirement is considered separately and a respective ETL design is produced. We propose an incremental method for consolidating these individual designs and creating an ETL design that satisfies all given requirements. Finally, the design produced is sent to an ETL engine for execution. We illustrate our approach through an example based on TPC-H and report on our experimental findings that show the effectiveness and quality of our approach.
Romero, O.; Marcel, P.; Abello, A.; Peralta, V.; Bellatreche, L. International Conference on Data Warehousing and Knowledge Discovery p. 224-239 DOI: 10.1007/978-3-642-23544-3_17 Presentation's date: 2011-09-02 Presentation of work at congresses
Romero, O.; Simitsis, A.; Abello, A. International Conference on Data Warehousing and Knowledge Discovery p. 80-95 DOI: 10.1007/978-3-642-23544-3_7 Presentation's date: 2011-08-29 Presentation of work at congresses
Romero, O.; Abello, A. International Conference on Scientific and Statistical Database Management p. 594-595 DOI: 10.1007/978-3-642-22351-8_51 Presentation's date: 2011-07-21 Presentation of work at congresses
This chapter describes the convergence of two of the most influential technologies in the last decade, namely business intelligence (BI) and the Semantic Web (SW). Business intelligence is used by almost any enterprise to derive important business-critical knowledge from both internal and (increasingly) external data. When using external data, most often found on the Web, the most important issue is knowing the precise semantics of the data. Without this, the results cannot be trusted. Here, Semantic Web technologies come to the rescue, as they allow semantics ranging from very simple to very complex to be specified for any web-available resource. SW technologies do not only support capturing the “passive” semantics, but also support active inference and reasoning on the data. The chapter first presents a motivating running example, followed by an introduction to the relevant SW foundation concepts. The chapter then goes on to survey the use of SW technologies for data integration, including semantic data annotation and semantics-aware extract, transform, and load processes (ETL). Next, the chapter describes the relationship of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms, and the use of advanced SW reasoning functionality on MD models. Finally, the chapter describes in detail a number of directions for future research, including SW support for intelligent BI querying, using SW technologies for providing context to data warehouses, and scalability issues. The overall conclusion is that SW technologies are very relevant for the future of BI, but that several new developments are needed to reach the full potential.
At the early stages of a data warehouse design project, the main objective is to collect the business requirements and needs, and translate them into an appropriate conceptual, multidimensional design. Typically, this task is performed manually, through a series of interviews involving two different parties: the business analysts and technical designers. Producing an appropriate conceptual design is an errorprone task that undergoes several rounds of reconciliation and redesigning, until the business needs are satisfied. It is
of great importance for the business of an enterprise to facilitate and automate such a process. The goal of our research is to provide designers with a semi-automatic means for producing conceptual multidimensional designs and also, conceptual
representation of the extract-transform-load (ETL)processes that orchestrate the data flow from the operational sources to the data warehouse constructs. In particular, we
describe a method that combines information about the data sources along with the business requirements, for validating
and completing –if necessary– these requirements, producing a multidimensional design, and identifying the ETL operations
needed. We present our method in terms of the
TPC-DS benchmark and show its applicability and usefulness.
Les experiències prèvies en l'àmbit dels magatzems de dades (o data warehouse), mostren que l'esquema multidimensional del data warehouse ha de ser fruit d'un enfocament híbrid; això és, una proposta que consideri tant els requeriments d'usuari com les fonts de dades durant el procés de disseny.Com a qualsevol altre sistema, els requeriments són necessaris per garantir que el sistema desenvolupat satisfà les necessitats de l'usuari. A més, essent aquest un procés de reenginyeria, les fonts de dades s'han de tenir en compte per: (i) garantir que el magatzem de dades resultant pot ésser poblat amb dades de l'organització, i, a més, (ii) descobrir capacitats d'anàlisis no evidents o no conegudes per l'usuari.Actualment, a la literatura s'han presentat diversos mètodes per donar suport al procés de modelatge del magatzem de dades. No obstant això, les propostes basades en un anàlisi dels requeriments assumeixen que aquestos són exhaustius, i no consideren que pot haver-hi informació rellevant amagada a les fonts de dades. Contràriament, les propostes basades en un anàlisi exhaustiu de les fonts de dades maximitzen aquest enfocament, i proposen tot el coneixement multidimensional que es pot derivar des de les fonts de dades i, conseqüentment, generen massa resultats. En aquest escenari, l'automatització del disseny del magatzem de dades és essencial per evitar que tot el pes de la tasca recaigui en el dissenyador (d'aquesta forma, no hem de confiar únicament en la seva habilitat i coneixement per aplicar el mètode de disseny elegit). A més, l'automatització de la tasca allibera al dissenyador del sempre complex i costós anàlisi de les fonts de dades (que pot arribar a ser inviable per grans fonts de dades).Avui dia, els mètodes automatitzables analitzen en detall les fonts de dades i passen per alt els requeriments. En canvi, els mètodes basats en l'anàlisi dels requeriments no consideren l'automatització del procés, ja que treballen amb requeriments expressats en llenguatges d'alt nivell que un ordenador no pot manegar. Aquesta mateixa situació es dona en els mètodes híbrids actual, que proposen un enfocament seqüencial, on l'anàlisi de les dades es complementa amb l'anàlisi dels requeriments, ja que totes dues tasques pateixen els mateixos problemes que els enfocament purs.En aquesta tesi proposem dos mètodes per donar suport a la tasca de modelatge del magatzem de dades: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Totes dues consideren els requeriments i les fonts de dades per portar a terme la tasca de modelatge i a més, van ser pensades per superar les limitacions dels enfocaments actuals.1. MDBE segueix un enfocament clàssic, en el que els requeriments d'usuari són coneguts d'avantmà. Aquest mètode es beneficia del coneixement capturat a les fonts de dades, però guia el procés des dels requeriments i, conseqüentment, és capaç de treballar sobre fonts de dades semànticament pobres. És a dir, explotant el fet que amb uns requeriments de qualitat, podem superar els inconvenients de disposar de fonts de dades que no capturen apropiadament el nostre domini de treball.2. A diferència d'MDBE, AMDO assumeix un escenari on es disposa de fonts de dades semànticament riques. Per aquest motiu, dirigeix el procés de modelatge des de les fonts de dades, i empra els requeriments per donar forma i adaptar els resultats generats a les necessitats de l'usuari. En aquest context, a diferència de l'anterior, unes fonts de dades semànticament riques esmorteeixen el fet de no tenir clars els requeriments d'usuari d'avantmà.Cal notar que els nostres mètodes estableixen un marc de treball combinat que es pot emprar per decidir, donat un escenari concret, quin enfocament és més adient. Per exemple, no es pot seguir el mateix enfocament en un escenari on els requeriments són ben coneguts d'avantmà i en un escenari on aquestos encara no estan clars (un cas recorrent d'aquesta situació és quan l'usuari no té clares les capacitats d'anàlisi del seu propi sistema). De fet, disposar d'uns bons requeriments d'avantmà esmorteeix la necessitat de disposar de fonts de dades semànticament riques, mentre que a l'inversa, si disposem de fonts de dades que capturen adequadament el nostre domini de treball, els requeriments no són necessaris d'avantmà. Per aquests motius, en aquesta tesi aportem un marc de treball combinat que cobreix tots els possibles escenaris que podem trobar durant la tasca de modelatge del magatzem de dades.
Previous experiences in the data warehouse field have shown that the data warehouse multidimensional conceptual schema must be derived from a hybrid approach: i.e., by considering both the end-user requirements and the data sources, as first-class citizens. Like in any other system, requirements guarantee that the system devised meets the end-user necessities. In addition, since the data warehouse design task is a reengineering process, it must consider the underlying data sources of the organization: (i) to guarantee that the data warehouse must be populated from data available within the organization, and (ii) to allow the end-user discover unknown additional analysis capabilities.Currently, several methods for supporting the data warehouse modeling task have been provided. However, they suffer from some significant drawbacks. In short, requirement-driven approaches assume that requirements are exhaustive (and therefore, do not consider the data sources to contain alternative interesting evidences of analysis), whereas data-driven approaches (i.e., those leading the design task from a thorough analysis of the data sources) rely on discovering as much multidimensional knowledge as possible from the data sources. As a consequence, data-driven approaches generate too many results, which mislead the user. Furthermore, the design task automation is essential in this scenario, as it removes the dependency on an expert's ability to properly apply the method chosen, and the need to analyze the data sources, which is a tedious and timeconsuming task (which can be unfeasible when working with large databases). In this sense, current automatable methods follow a data-driven approach, whereas current requirement-driven approaches overlook the process automation, since they tend to work with requirements at a high level of abstraction. Indeed, this scenario is repeated regarding data-driven and requirement-driven stages within current hybrid approaches, which suffer from the same drawbacks than pure data-driven or requirement-driven approaches.In this thesis we introduce two different approaches for automating the multidimensional design of the data warehouse: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Both approaches were devised to overcome the limitations from which current approaches suffer. Importantly, our approaches consider opposite initial assumptions, but both consider the end-user requirements and the data sources as first-class citizens.1. MDBE follows a classical approach, in which the end-user requirements are well-known beforehand. This approach benefits from the knowledge captured in the data sources, but guides the design task according to requirements and consequently, it is able to work and handle semantically poorer data sources. In other words, providing high-quality end-user requirements, we can guide the process from the knowledge they contain, and overcome the fact of disposing of bad quality (from a semantical point of view) data sources.2. AMDO, as counterpart, assumes a scenario in which the data sources available are semantically richer. Thus, the approach proposed is guided by a thorough analysis of the data sources, which is properly adapted to shape the output result according to the end-user requirements. In this context, disposing of high-quality data sources, we can overcome the fact of lacking of expressive end-user requirements.Importantly, our methods establish a combined and comprehensive framework that can be used to decide, according to the inputs provided in each scenario, which is the best approach to follow. For example, we cannot follow the same approach in a scenario where the end-user requirements are clear and well-known, and in a scenario in which the end-user requirements are not evident or cannot be easily elicited (e.g., this may happen when the users are not aware of the analysis capabilities of their own sources). Interestingly, the need to dispose of requirements beforehand is smoothed by the fact of having semantically rich data sources. In lack of that, requirements gain relevance to extract the multidimensional knowledge from the sources.So that, we claim to provide two approaches whose combination turns up to be exhaustive with regard to the scenarios discussed in the literature
Romero, O.; Calvanese, D.; Abello, A.; Rodríguez , M. International Workshop On Data Warehousing and OLAP p. 1-8 DOI: /doi.acm.org/10.1145/1651291.1651293 Presentation's date: 2009-11-06 Presentation of work at congresses
Nowadays, it is widely accepted that the data warehouse design task should be largely automated. Furthermore, the data warehouse
conceptual schema must be structured according to the multidimensional model and as a consequence, the most common way to automatically look for subjects and dimensions of analysis is by discovering
functional dependencies (as dimensions functionally depend on the fact) over the data sources. Most advanced methods
for automating the design of the data warehouse carry out this process
from relational OLTP systems, assuming that a RDBMS is the most common kind of data source we may find, and taking as starting
point a relational schema. In contrast, in our approach we propose to rely instead on a conceptual representation of the domain
of interest formalized through a domain ontology expressed in the DL-Lite Description Logic. We propose an algorithm to discover functional dependencies from the domain ontology that exploits the inference capabilities of DL-Lite, thus fully taking into account the semantics of the domain. We also provide an evaluation of our approach
in a real-world scenario.
Some research efforts have proposed the automation of the data warehouse design in order to free this task of being (completely) performed by an expert and facilitate the whole process. Most advanced approaches exclusively work over relational sources and perform a detailed analysis of the data sources to identify the multidimensional concepts in a reengineering process. Starting from a logical schema, however, may present some inconveniences. A logical schema is tied to the design decisions made when devising the system and these decisions either made to fulfill the system requirements (for instance, improve query answering, avoid insertion / deletion anomalies, preserve features inherited from legacy systems, etc.) or naively made by nonexpert users, have a big impact on the quality of the multidimensional schemas got by current automatable approaches. In this paper, we introduce our approach for automatically deriving the multidimensional schema from a domain ontology. Our goals are mainly two: i) we want to improve the quality of the output got (by working over a conceptual formalization of the domain instead of a logical one) and ii) we want to automate the process. This second goal is the main reason for choosing ontologies instead of other conceptual formalizations, as ontology languages provide reasoning services that will facilitate the automation of our task.
Object identification is a crucial step in most information systems. Nowadays, we have many different ways to identify entities such as surrogates, keys and object identifiers. However, not all of them guarantee the entity identity. Many works have been introduced in the literature for discovering meaningful keys, but all of them work at the logical or data level and they share some inherent constraints. Addressing it at the logical level, we may miss some important data dependencies, while the cost to identify data dependencies at the data level may not be affordable. In this paper we propose an approach for discovering meaningful keys from domain ontologies. In our approach, we guide the process at the conceptual level and we introduce a set of pruning rules for improving the performance by reducing the number of key hypotheses generated and to be verified with data. Finally, we also introduce a simulation over a real world case study to show the feasibility of our method.
The ideal scenario to derive the multidimensional conceptual schema of a data warehouse would entail a hybrid approach (i.e. a combined data-driven and requirement-driven approach). Thus, the resulting multidimensional schema would satisfy the end-user requirements and it would have been conciliated with the data sources. Currently, most methodologies follow either a data-driven or requirement-driven paradigm and only a few of them follow a hybrid approach. Furthermore, current hybrid methodologies are unbalanced and they do not benefit from all the advantages brought by each paradigm. In this paper we present a novel methodology that derives conceptual multidimensional schemas from relational sources bearing in mind the end-user requirements. The most relevant step within our methodology is the MDBE method that introduces three main benefits with regard to previous approaches: (i) the MDBE method is a fully automatic approach and therefore, it also handles requirements in an automatic way. (ii) Unlike data-driven methods, we focus on data of interest for the end-user. However, the user may not know all the potential analysis contained in the data sources and, unlike requirement-driven approaches, MDBE is able to propose new interesting multidimensional knowledge related to concepts already queried by the user. (iii) Finally, MDBE proposes meaningful multidimensional schemas derived from a validation process. Therefore, schemas proposed are sound and meaningful.
Discovering functional dependencies is a fundamental step in the design of relational databases and in most system reengineering processes, such as system maintainability and redesign. Typically, this task has been performed over relational databases, at the logical or physical level. Those works addressing it at the logical level, often make some unrealistic assumptions (such as completeness of the data structures or attributes semantically related having similar names), while those addressing it at the physical level propose solutions that are computationally expensive, whose performance deteriorates with a large number of attributes or instances, and which cannot tolerate erroneous data. To overcome these limitations, together with the fact that data representations at the logical or physical level may miss some important data dependencies, we propose to rely instead on a conceptual representation of the domain of interest, which is readily available for many systems built according to current software rengineering practices. Specifically, we rely on conceptual schemas specified in ER or as UML class diagrams, and formalized through a domain ontology expressed in the DL-Lite Description Logic (DL). We propose an algorithm to discover functional dependencies from the domain ontology that exploits the inference capabilities of the DL, thus fully taking into account the semantics of the domain. We also provide an evaluation of our approach in a real-world scenario.
The goal of this demonstration is to present MDBE, a tool implementing our methodology for automatically deriving multidimensional
schemas from relational sources, bearing in mind the end-user requirements. Our approach starts gathering the end-user information
requirements that will be mapped over the data sources as SQL queries.
Based on the constraints that a query must preserve to make multidimensional sense, MDBE automatically derives multidimensional schemas which agree with both the input requirements and the data sources.