In order to create better decisions for business analytics, organizations increasingly use external structured, semi-structured, and unstructured data in addition to the (mostly structured) internal data. Current Extract-Transform-Load (ETL) tools are not suitable for this “open world scenario” because they do not consider semantic issues in the integration processing. Current ETL tools neither support processing semantic data nor create a semantic Data Warehouse (DW), a repository of semantically integrated data. This paper describes our programmable Semantic ETL (SETL) framework. SETL builds on Semantic Web (SW) standards and tools and supports developers by offering a number of powerful modules, classes, and methods for (dimensional and semantic) DW constructs and tasks. Thus it supports semantic data sources in addition to traditional data sources, semantic integration, and creating or publishing a semantic (multidimensional) DW in terms of a knowledge base. A comprehensive experimental evaluation comparing SETL to a solution made with traditional tools (requiring much more hand-coding) on a concrete use case, shows that SETL provides better programmer productivity, knowledge base quality, and performance.
Goal-oriented requirements engineering promotes the use of goals to elicit, elaborate, structure, specify, analyze, negotiate, document, and modify requirements. Thus, goal-oriented specifications are essential for capturing the objectives that the system to be developed should achieve. However, the application of goaloriented specifications into model-driven development (MDD) processes is still handcrafted, not aligned in the automated flow from models to code. In other words, the experience of analysts and designers is necessary to
manually transform the input goal-oriented models into system models for code generation (models compilation). Some authors have proposed guidelines to facilitate and partially automate this translation, but there is a lack of techniques to assess the adequacy of goal-oriented models as starting point of MDD processes. In this paper, we present and evaluate a verification approach that guarantees the automatic, correct, and complete transformation of goal-oriented models into design models used by specific MDD solutions. In particular, this approach has been put into practice by adopting a well-known goal-oriented modeling approach, the i* framework, and an industrial MDD solution called Integranova.
Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.
All right sreserved. We present GeoSRS, a hybrid recommender system for a popular location-based social network (LBSN), in which users are able to write short reviews on the places of interest they visit. Using state-of-the-art text mining techniques, our system recommends locations to users using as source the whole set of text reviews in addition to their geographical location. To evaluate our system, we have collected our own data sets by crawling the social network Foursquare. To do this efficiently, we propose the use of a parallel version of the Quadtree technique, which may be applicable to crawling/exploring other spatially distributed sources. Finally, we study the performance of GeoSRS on our collected data set and conclude that by combining sentiment analysis and text modeling, GeoSRS generates more accurate recommendations. The performance of the system improves as more reviews are available, which further motivates the use of large-scale crawling techniques such as the Quadtree.
We formalise a specialised database management system model for time series using a multiresolution approach. These special purpose database systems store time series lossy compressed in a space-bounded storage. Time series can be stored at multiple resolutions, using distinct attribute aggregations and keeping its temporal attribute managed in a consistent way.
The model exhibits a generic approach that facilitates its customisation to suit better the actual application requirements in a given context. The elements, the meaning of which depends on a real application, are of generic nature.
Furthermore, we consider some specific time series properties that are a challenge in the multiresolution approach. We also describe a reference implementation of the model and introduce a use case based on real data.
In the recent years the problems of using generic storage (i.e., relational) techniques for very specific applications have been detected and outlined and, as a consequence, some alternatives to Relational DBMSs (e.g., HBase) have bloomed. Most of these alternatives sit on the cloud and benefit from cloud computing, which is nowadays a reality that helps us to save money by eliminating the hardware as well as software fixed costs and just pay per use. On top of this, specific querying frameworks to exploit the brute force in the cloud (e.g., MapReduce) have also been devised. The question arising next tries to clear out if this (rather naive) exploitation of the cloud is an alternative to tuning DBMSs or it still makes sense to consider other options when retrieving data from these settings.; In this paper, we study the feasibility of solving OLAP queries with Hadoop (the Apache project implementing MapReduce) while benefiting from secondary indexes and partitioning in HBase. Our main contribution is the comparison of different access plans and the definition of criteria (i.e., cost estimation) to choose among them in terms of consumed resources (namely CPU, bandwidth and I/O).
Context: Organisational reengineering, continuous process improvement, alignment among complementary analysis perspectives, and information traceability are some current motivations to promote investment and scientific effort for integrating goal and business process perspectives. Providing support to integrate information systems analysis becomes a challenge in this complex setting.
Objective: The GoBIS framework integrates two goal and business process modelling approaches: i* (a goal-oriented modelling method) and Communication Analysis (a communication-oriented business process modelling method).
Method: In this paper, we describe the methodological integration of both methods with the aim of fulfilling several criteria: i) to rely on appropriate theories; ii) to provide abstract and concrete syntaxes; iii) to provide scenarios of application; iv) to develop tool support; v) to provide demonstrable benefits to potential adopters.
Results: We provide guidelines for using the two modelling methods in a top-down analysis scenario. The guidelines are validated by means of a comparative experiment and a focus-group session with students.
Conclusions: From a practitioner viewpoint (modeller and/or analyst), the guidelines facilitate the traceability between goal and business process models, the experimental results highlight the benefits of GoBIS in performance and usability perceptions, and demonstrate an improvement on the completeness of the latter having an impact on efficiency. From a researcher perspective, the validation has produced useful feedback for future research.
An exponential growth of event data can be witnessed across all industries. Devices connected to the internet (internet of things), social interaction, mobile computing, and cloud computing provide new sources of event data and this trend will continue. The omnipresence of large amounts of event data is an important enabler for process mining. Process mining techniques can be used to discover, monitor and improve real processes by extracting knowledge from observed behavior. However, unprecedented volumes of event data also provide new challenges and often state-of-the-art process mining techniques cannot cope. This paper focuses on “conformance checking in the large” and presents a novel decomposition technique that partitions larger process models and event logs into smaller parts that can be analyzed independently. The so-called Single-Entry Single-Exit (SESE) decomposition not only helps to speed up conformance checking, but also provides improved diagnostics. The analyst can zoom in on the problematic parts of the process. Importantly, the conditions under which the conformance of the whole can be assessed by verifying the conformance of the SESE parts are described, which enables the decomposition and distribution of large conformance checking problems. All the techniques have been implemented in ProM, and experimental results are provided.
Designing data warehouse (DW) systems in highly dynamic enterprise environments is not an easy task. At each moment, the multidimensional (MD) schema needs to satisfy the set of information requirements posed by the business users. At the same time, the diversity and heterogeneity of the data sources need to be considered in order to properly retrieve needed data. Frequent arrival of new business needs requires that the system is adaptable to changes. To cope with such an inevitable complexity (both at the beginning of the design process and when potential evolution events occur), in this paper we present a semi-automatic method called ORE, for creating DW designs in an iterative fashion based on a given set of information requirements. Requirements are first considered separately. For each requirement, ORE expects the set of possible MD interpretations of the source data needed for that requirement (in a form similar to an MD schema). Incrementally, ORE builds the unified MD schema that satisfies the entire set of requirements and meet some predefined quality objectives. We have implemented ORE and performed a number of experiments to study our approach. We have also conducted a limited-scale case study to investigate its usefulness to designers.
This paper presents a multidimensional conceptual Object-Oriented model for Data Warehousing and OLAP tools, its structures, integrity constraints and query operations. It has been developed as an extension of UML core metaclasses to facilitate its usage, and try to fill the absence of a standard model. Being a UML extension allows reusing modeling constructs and techniques, and integrating multidimensional modeling in more general modeling processes. Moreover, while existing multidimensional models are restricted to the modeling of isolated stars, this paper investigates the representation of several semantically related star schemas. Summarizability and identification constraints can also be represented in the model, and a closed and complete set of algebraic operations has been defined in terms of functions (so that mathematical properties of functions can be smoothly applied).