Jovanovic, P.; Romero, O.; Simitsis, A.; Abello, A. IEEE transactions on knowledge and data engineering Vol. 28, num. 5, p. 1203-1216 DOI: 10.1109/TKDE.2016.2515609 Data de publicació: 2016-01-07 Article en revista
Business intelligence (BI) systems depend on efficient integration of disparate and often heterogeneous data. The integration of data is governed by data-intensive flows and is driven by a set of information requirements. Designing such flows is in general a complex process, which due to the complexity of business environments is hard to be done manually. In this paper, we deal with the challenge of efficient design and maintenance of data-intensive flows and propose an incremental approach, namely CoAl, for semi-automatically consolidating data-intensive flows satisfying a given set of information requirements. CoAl works at the logical level and consolidates data flows from either high-level information requirements or platform-specific programs. As CoAl integrates a new data flow, it opts for maximal reuse of existing flows and applies a customizable cost model tuned for minimizing the overall cost of a unified solution. We demonstrate the efficiency and effectiveness of our approach through an experimental evaluation using our implemented prototype.
Abello, A.; Romero, O.; Bach, T.; Berlanga, R.; Nebot, V.; Aramburu, M.; Simitsis, A. IEEE transactions on knowledge and data engineering Vol. 27, num. 2, p. 571-588 DOI: 10.1109/TKDE.2014.2330822 Data de publicació: 2015-02-01 Article en revista
This paper describes the convergence of some of the most influential technologies in the last few years, namely data warehousing (DW), on-line analytical processing (OLAP), and the Semantic Web (SW). OLAP is used by enterprises to derive important business-critical knowledge from data inside the company. However, the most interesting OLAP queries can no longer be answered on internal data alone, external data must also be discovered (most often on the web), acquired, integrated, and (analytically) queried, resulting in a new type of OLAP, exploratory OLAP. When using external data, an important issue is knowing the precise semantics of the data. Here, SW technologies come to the rescue, as they allow semantics (ranging from very simple to very complex) to be specified for web-available resources. SW technologies do not only support capturing the "passive" semantics, but also support active inference and reasoning on the data. The paper first presents a characterization of DW/OLAP environments, followed by an introduction to the relevant SW foundation concepts. Then, it describes the relationship of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms. Next, the paper goes on to survey the use of SW technologies for data modeling and data provisioning, including semantic data annotation and semantic-aware extract, transform, and load (ETL) processes. Finally, all the findings are discussed and a number of directions for future research are outlined, including SW support for intelligent MD querying, using SW technologies for providing context to data warehouses, and scalability issues.
The discovery of process models from event logs has emerged as one of the crucial problems for enabling the continuous support in the life-cycle of an information system. However, in a decade of process discovery research, the algorithms and tools that have appeared are known to have strong limitations in several dimensions. The size of the logs and the formal properties of the model discovered are the two main challenges nowadays. In this paper we propose the use of numerical abstract domains for tackling these two problems, for the particular case of the discovery of Petri nets. First, numerical abstract domains enable the discovery of general process models, requiring no knowledge (e.g., the bound of the Petri net to derive) for the discovery algorithm. Second, by using divide and conquer techniques we are able to control the size of the process discovery problems. The methods proposed in this paper have been implemented in a prototype tool and experiments are reported illustrating the significance of this fresh view of the process discovery problem.
Parra-Arnau, J.; Perego, A.; Ferrari, E.; Forne, J.; Rebollo-Monedero, D. IEEE transactions on knowledge and data engineering Vol. 26, num. 1, p. 180-193 DOI: 10.1109/TKDE.2012.248 Data de publicació: 2014-01-01 Article en revista
Collaborative tagging is one of the most popular services available online, and it allows end user to loosely classify either online or offline resources based on their feedback, expressed in the form of free-text labels (i.e., tags). Although tags are not per se sensitive information, the wide use of collaborative tagging services increases the risk of cross referencing, thereby seriously compromising user privacy. In this paper, we make a first contribution in this direction by showing how a specific privacy-enhancing technology, namely tag suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the effectiveness of a policy-based collaborative tagging system which supports enhanced Web access functionalities, like content filtering and discovery, based on preferences specified by end users.
Collaborative tagging is one of the most popular services available online, and it allows end user to loosely classify either online or offline resources based on their feedback, expressed in the form of free-text labels (i.e., tags). Although tags may not be per se sensitive information, the wide use of collaborative tagging services increases the risk of cross referencing, thereby seriously compromising user privacy. In this paper, we make a first contribution toward the development of a privacy-preserving collaborative tagging service, by showing how a specific privacy-enhancing technology, namely tag suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the effectiveness of a policy-based collaborative tagging system that supports enhanced web access functionalities, like content filtering and discovery, based on preferences specified by end users.
A central problem in the area of Process Mining is to obtain a formal model that represents the processes that are conducted in a system. If realized, this simple motivation allows for powerful techniques that can be used to formally analyze and optimize a system, without the need to resort to its semiformal and sometimes inaccurate specification. The problem addressed in this paper is known as Process Discovery: to obtain a formal model from a set of system executions. The theory of regions is a valuable tool in process discovery: it aims at learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is applied on an automaton and therefore one should convert the traces into an acyclic automaton in order to apply these techniques. Given that the complexity of the region-based techniques depends on the size of the input automata, revealing the underlying cycles and folding the initial automaton can incur in a significant complexity alleviation of the region-based techniques. In this paper, we follow this idea by incorporating region information in the cycle detection algorithm, enabling the identification of complex cycles that cannot be obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool suggest that the techniques
presented in this paper are a big step into widening the application of the theory of regions in Process Mining for industrial scenarios.
Comparison functions for sequences (of symbols) are important components of many applications, for example clustering, data cleansing and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, ...), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other well-known comparison functions such as the Edit or Jaro-Winkler distances.
t-Closeness is a privacy model recently defined for data anonymization. A data set is said to satisfy t-closeness if, for each
group of records sharing a combination of key attributes, the distance between the distribution of a confidential attribute in the group
and the distribution of the attribute in the entire data set is no more than a threshold t. Here, we define a privacy measure in terms of
information theory, similar to t-closeness. Then, we use the tools of that theory to show that our privacy measure can be achieved by
the postrandomization method (PRAM) for masking in the discrete case, and by a form of noise addition in the general case.
This article introduces a new method for supervised discretization based on interval distances by using a novel concept of neighbourhood in the target's space. The method proposed takes into consideration the order of the class attribute, when this exists, so that it can be used with ordinal discrete classes as well as continuous classes, in the case of regression problems. The method has proved to be very efficient in terms of accuracy and faster than the most commonly supervised discretization methods used in the literature. It is illustrated through several examples and a comparison with other standard discretization methods is performed for three public data sets by using two different learning tasks: a decision tree algorithm and SVM for regression.