Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 200 results
  • Learning probabilistic automata : a study in state distinguishability

     De Balle Pigem, Borja; Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    Theoretical computer science
    Date of publication: 2013-02-18
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Known algorithms for learning PDFA can only be shown to run in time polynomial in the so-called distinguishability μ of the target machine, besides the number of states and the usual accuracy and confidence parameters. We show that the dependence on μ is necessary in the worst case for every algorithm whose structure resembles existing ones. As a technical tool, a new variant of Statistical Queries termed View the MathML source-queries is defined. We show how to simulate View the MathML source-queries using classical Statistical Queries and show that known PAC algorithms for learning PDFA are in fact statistical query algorithms. Our results include a lower bound: every algorithm to learn PDFA with queries using a reasonable tolerance must make Ω(1/μ1−c) queries for every c>0. Finally, an adaptive algorithm that PAC-learns w.r.t. another measure of complexity is described. This yields better efficiency in many cases, while retaining the same inevitable worst-case behavior. Our algorithm requires fewer input parameters than previously existing ones, and has a better sample bound.

  • Power-aware multi-data center management using machine learning

     Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    International Workshop on Power-aware Algorithms, Systems, and Architectures
    Presentation's date: 2013-10-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The cloud relies upon multi-datacenter (multi-DC) infrastructures distributed along the world, where people and enterprises pay for resources to offer their web-services to worldwide clients. Intelligent management is required to automate and manage these infrastructures, as the amount of resources and data to manage exceeds the capacities of human operators. Also, it must take into account the cost of running the resources (energy) and the quality of service towards web-services and clients. (De-)consolidation and priming proximity to clients become two main strategies to allocate resources and properly place these web-services in the multi-DC network. Here we present a mathematical model to describe the scheduling problem given web-services and hosts across a multi-DC system, enhancing the decision makers with models for the system behavior obtained using machine learning. After running the system on real DC infrastructures we see that the model drives web-services to the best locations given quality of service, energy consumption, and client proximity, also (de-)consolidating according to the resources required for each web-service given its load.

  • Empowering automatic data-center management with machine learning

     Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    ACM Symposium on Applied Computing
    Presentation's date: 2013-03-21
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The Cloud as computing paradigm has become nowadays crucial for most Internet business models. Managing and optimizing its performance on a moment-by-moment basis is not easy given as the amount and diversity of elements involved (hardware, applications, workloads, customer needs...). Here we show how a combination of scheduling algorithms and data mining techniques helps improving the performance and profitability of a data-center running virtualized web-services. We model the data-center's main resources (CPU, memory, IO), quality of service (viewed as response time), and workloads (incoming streams of requests) from past executions. We show how these models to help scheduling algorithms make better decisions about job and resource allocation, aiming for a balance between throughput, quality of service, and power consumption.

  • Power-aware multi-data center management using machine learning

     Berral, Josep; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    International Workshop on Power-aware Algorithms, Systems, and Architectures
    Presentation's date: 2013-10-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The cloud relies upon multi-data center (multi-DC) infrastructures distributed along the world, where people and enterprises pay for resources to offer their web-services to worldwide clients. Intelligent management is required to automate and manage these infrastructures, as the amount of resources and data to manage exceeds the capacities of human operators. Also, it must take into account the cost of running the resources (energy) and the quality of service towards web-services and clients. (De-)consolidation and priming proximity to clients become two main strategies to allocate resources and properly place these web-services in the multi-DC network. Here we present a mathematical model to describe the scheduling problem given web-services and hosts across a multi-DC system, enhancing the decision makers with models for the system behavior obtained using machine learning. After running the system on real DC infrastructures we see that the model drives web-services to the best locations given quality of service, energy consumption, and client proximity, also (de-)consolidating according to the resources required for each web-service given its load.

  • The Architecture of a Churn Prediction System Based on Stream Mining

     De Balle Pigem, Borja; Casas Fernandez, Bernardino; Catarineu, Alex; Gavaldà Mestre, Ricard; Manzano Macho, David
    Congrés Internacional de l¿Associació Catalana d'Intel·ligència Artificial
    Presentation's date: 2013-10-24
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Churning is the movement of customers from a company to another. For any company, being able to predict with some time which of their customers will churn is essential to take actions in order to retain them, and for this reason most sectors invest substantial effort in techniques for (semi) automatically predicting churning, and data mining and machine learning are among the techniques successfully used to this effect. In this paper we describe a prototype for churn prediction using stream mining methods, which offer the additional promise of detecting new patterns of churn in real-time streams of high-speed data, and adapting quickly to a changing reality. The prototype is implemented on top of the MOA (Massive Online Analysis) framework for stream mining. The application implicit in the prototype is the telecommunication operator (mobile phone) sector.

  • An efficient closed frequent itemset miner for the MOA stream mining system

     Quadrana, Massimo; Bifet Figuerol, Albert Carles; Gavaldà Mestre, Ricard
    International Conference of the Catalan Association for Artificial Intelligence
    Presentation's date: 2013-10-24
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    We describe and evaluate an implementation of the IncMine algorithm due to Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams, working on the MOA platform. The goal was to produce a robust, efficient, and usable tool for that task that can both be used by practitioners and used for evaluation of research in the area. We experimentally confirm the excellent performance of the algorithm and its ability to handle concept drift.

  • Adaptively learning probabilistic deterministic automata from data streams

     De Balle Pigem, Borja; Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    Machine learning
    Date of publication: 2013-10
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for inferring models in this class in the restrictive data stream scenario: Unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We also present extensions of the algorithm that (1) reduce to a minimum the need for guessing parameters of the target distribution and (2) are able to adapt to changes in the input distribution, relearning new models when needed. We provide rigorous PAC-like bounds for all of the above. Our algorithm makes a key usage of stream sketching techniques for reducing memory and processing time, and is modular in that it can use different tests for state equivalence and for change detection in the stream.

  • Improved self-management of datacenter systems applying machine learning

     Berral Garcia, Josep Lluis
    Defense's date: 2013-11-22
    Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Application of clustering analysis and sequence analysis on the performance analysis of parallel applications  Open access

     González García, Juan
    Defense's date: 2013-06-07
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    High Performance Computing and Supercomputing is the high end area of the computing science that studies and develops the most powerful computers available. Current supercomputers are extremely complex so are the applications that run on them. To take advantage of the huge amount of computing power available it is strictly necessary to maximize the knowledge we have about how these applications behave and perform. This is the mission of the (parallel) performance analysis. In general, performance analysis toolkits oUer a very simplistic manipulations of the performance data. First order statistics such as average or standard deviation are used to summarize the values of a given performance metric, hiding in some cases interesting facts available on the raw performance data. For this reason, we require the Performance Analytics, i.e. the application of Data Analytics techniques in the performance analysis area. This thesis contributes with two new techniques to the Performance Analytics Veld. First contribution is the application of the cluster analysis to detect the parallel application computation structure. Cluster analysis is the unsupervised classiVcation of patterns (observations, data items or feature vectors) into groups (clusters). In this thesis we use the cluster analysis to group the CPU burst of a parallel application, the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the diUerent computational trends or phases that appear in the application. These clusters are useful to understand the behaviour of computation part of the application and focus the analyses to those that present performance issues. We demonstrate that our approach requires diUerent clustering algorithms previously used in the area. Second contribution of the thesis is the application of multiple sequence alignment algorithms to evaluate the computation structure detected. Multiple sequence alignment (MSA) is technique commonly used in bioinformatics to determine the similarities across two or more biological sequences: DNA or roteins. The Cluster Sequence Score we introduce applies a Multiple Sequence Alignment (MSA) algorithm to evaluate the SPMDiness of an application, i.e. how well its computation structure represents the Single Program Multiple Data (SPMD) paradigm structure. We also use this score in the Aggregative Cluster Re-Vnement, a new clustering algorithm we designed, able to detect the SPMD phases of an application at Vne-grain, surpassing the cluster algorithms we used initially. We demonstrate the usefulness of these techniques with three practical uses. The Vrst one is an extrapolation methodology able to maximize the performance metrics that characterize the application phases detected using a single application execution. The second one is the use of the computation structure detected to speedup in a multi-level simulation infrastructure. Finally, we analyse four production-class applications using the computation characterization to study the impact of possible application improvements and portings of the applications to diUerent hardware conVgurations. In summary, this thesis proposes the use of cluster analysis and sequence analysis to automatically detect and characterize the diUerent computation trends of a parallel application. These techniques provide the developer / analyst an useful insight of the application performance and ease the understanding of the application’s behaviour. The contributions of the thesis are not reduced to proposals and publications of the techniques themselves, but also practical uses to demonstrate their usefulness in the analysis task. In addition, the research carried out during these years has provided a production tool for analysing applications’ structure, part of BSC Tools suite.

  • On the Complexity of Resolution-based Proof Systems  Open access

     Oliva Valls, Sergi
    Defense's date: 2013-05-02
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Propositional Proof Complexity is the area of Computational Complexity that studies the length of proofs in propositional logic. One of its main questions is to determine which particular propositional formulas have short proofs in a given propositional proof system. In this thesis we present several results related to this question, all on proof systems that are extensions of the well-known resolution proof system. The first result of this thesis is that TQBF, the problem of determining if a fully-quantified propositional CNF-formula is true, is PSPACE-complete even when restricted to instances of bounded tree-width, i.e. a parameter of structures that measures their similarity to a tree. Instances of bounded tree-width of many NP-complete problems are tractable, e.g. SAT, the boolean satisfiability problem. We show that this does not scale up to TQBF. We also consider Q-resolution, a quantifier-aware version of resolution. On the negative side, our first result implies that, unless NP = PSPACE, the class of fully-quantified CNF-formulas of bounded tree-width does not have short proofs in any proof system (and in particular in Q-resolution). On the positive side, we show that instances with bounded respectful tree-width, a more restrictive condition, do have short proofs in Q-resolution. We also give a natural family of formulas with this property that have real-world applications. The second result concerns interpretability. Informally, we say that a first-order formula can be interpreted in another if the first one can be expressed using the vocabulary of the second, plus some extra features. We show that first-order formulas whose propositional translations have short R(const)-proofs, i.e. a generalized version of resolution with DNF-formulas of constant-size terms, are closed under a weaker form of interpretability (that with no extra features), called definability. Our main result is a similar result on interpretability. Also, we show some examples of interpretations and show a systematic technique to transform some Sigma_1-definitions into quantifier-free interpretations. The third and final result is about a relativized weak pigeonhole principle. This says that if at least 2n out of n^2 pigeons decide to fly into n holes, then some hole must be doubly occupied. We prove that the CNF encoding of this principle does not have polynomial-size DNF-refutations, i.e. refutations in the generalized version of resolution with unbounded DNF-formulas. For this proof we discuss the existence of unbalanced low-degree bipartite expanders satisfying a certain robustness condition.

  • Learning Finite-State Machines: Statistical and Algorithmic Aspects  Open access

     De Balle Pigem, Borja
    Defense's date: 2013-07-12
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The present thesis addresses several machine learning problems on generative and predictive models on sequential data. All the models considered have in common that they can be de ned in terms of nite-state machines. On one line of work we study algorithms for learning the probabilistic analog of Deterministic Finite Automata (DFA). This provides a fairly expressive generative model for sequences with very interesting algorithmic properties. State-merging algorithms for learning these models can be interpreted as a divisive clustering scheme where the "dependency graph" between clusters is not necessarily a tree. We characterize these algorithms in terms of statistical queries and a use this characterization for proving a lower bound with an explicit dependency on the distinguishability of the target machine. In a more realistic setting, we give an adaptive state-merging algorithm satisfying the stringent algorithmic constraints of the data streams computing paradigm. Our algorithms come with strict PAC learning guarantees. At the heart of state-merging algorithms lies a statistical test for distribution similarity. In the streaming version this is replaced with a bootstrap-based test which yields faster convergence in many situations. We also studied a wider class of models for which the state-merging paradigm also yield PAC learning algorithms. Applications of this method are given to continuous-time Markovian models and stochastic transducers on pairs of aligned sequences. The main tools used for obtaining these results include a variety of concentration inequalities and sketching algorithms. In another line of work we contribute to the rapidly growing body of spectral learning algorithms. The main virtues of this type of algorithms include the possibility of proving nite-sample error bounds in the realizable case and enormous savings on computing time over iterative methods like Expectation-Maximization. In this thesis we give the rst application of this method for learning conditional distributions over pairs of aligned sequences de ned by probabilistic nite-state transducers. We also prove that the method can learn the whole class of probabilistic automata, thus extending the class of models previously known to be learnable with this approach. In the last two chapters we present works combining spectral learning with methods from convex optimization and matrix completion. Respectively, these yield an alternative interpretation of spectral learning and an extension to cases with missing data. In the latter case we used a novel joint stability analysis of matrix completion and spectral learning to prove the rst generalization bound for this type of algorithms that holds in the non-realizable case. Work in this area has been motivated by connections between spectral learning, classic automata theory, and statistical learning; tools from these three areas have been used.

  • Addressing Practical Challenges for Anomaly Detection in Backbone Networks  Open access

     Paredes Oliva, Ignasi
    Defense's date: 2013-07-29
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Network monitoring has always been a topic of foremost importance for both network operators and researchers for multiple reasons ranging from anomaly detection to tra c classi cation or capacity planning. Nowadays, as networks become more and more complex, tra c increases and security threats reproduce, achieving a deeper understanding of what is happening in the network has become an essential necessity. In particular, due to the considerable growth of cybercrime, research on the eld of anomaly detection has drawn signi cant attention in recent years and tons of proposals have been made. All the same, when it comes to deploying solutions in real environments, some of them fail to meet some crucial requirements. Taking this into account, this thesis focuses on lling this gap between the research and the non-research world. Prior to the start of this work, we identify several problems. First, there is a clear lack of detailed and updated information on the most common anomalies and their characteristics. Second, unawareness of sampled data is still common although the performance of anomaly detection algorithms is severely a ected. Third, operators currently need to invest many work-hours to manually inspect and also classify detected anomalies to act accordingly and take the appropriate mitigation measures. This is further exacerbated due to the high number of false positives and false negatives and because anomaly detection systems are often perceived as extremely complex black boxes. Analysing an issue is essential to fully comprehend the problem space and to be able to tackle it properly. Accordingly, the rst block of this thesis seeks to obtain detailed and updated real-world information on the most frequent anomalies occurring in backbone networks. It rst reports on the performance of di erent commercial systems for anomaly detection and analyses the types of network nomalies detected. Afterwards, it focuses on further investigating the characteristics of the anomalies found in a backbone network using one of the tools for more than half a year. Among other results, this block con rms the need of applying sampling in an operational environment as well as the unacceptably high number of false positives and false negatives still reported by current commercial tools. On the whole, the presence of ampling in large networks for monitoring purposes has become almost mandatory and, therefore, all anomaly detection algorithms that do not take that into account might report incorrect results. In the second block of this thesis, the dramatic impact of sampling on the performance of well-known anomaly detection techniques is analysed and con rmed. However, we show that the results change signi cantly depending on the sampling technique used and also on the common metric selected to perform the comparison. In particular, we show that, Packet Sampling outperforms Flow Sampling unlike previously reported. Furthermore, we observe that Selective Sampling (SES), a sampling technique that focuses on small ows, obtains much better results than traditional sampling techniques for scan detection. Consequently, we propose Online Selective Sampling, a sampling technique that obtains the same good performance for scan detection than SES but works on a per-packet basis instead of keeping all ows in memory. We validate and evaluate our proposal and show that it can operate online and uses much less resources than SES. Although the literature is plenty of techniques for detecting anomalous events, research on anomaly classi cation and extraction (e.g., to further investigate what happened or to share evidence with third parties involved) is rather marginal. This makes it harder for network operators to analise reported anomalies because they depend solely on their experience to do the job. Furthermore, this task is an extremely time-consuming and error-prone process. The third block of this thesis targets this issue and brings it together with the knowledge acquired in the previous blocks. In particular, it presents a system for automatic anomaly detection, extraction and classi cation with high accuracy and very low false positives. We deploy the system in an operational environment and show its usefulness in practice. The fourth and last block of this thesis presents a generalisation of our system that focuses on analysing all the tra c, not only network anomalies. This new system seeks to further help network operators by summarising the most signi cant tra c patterns in their network. In particular, we generalise our system to deal with big network tra c data. In particular, it deals with src/dst IPs, src/dst ports, protocol, src/dst Autonomous Systems, layer 7 application and src/dst geolocation. We rst deploy a prototype in the European backbone network of G EANT and show that it can process large amounts of data quickly and build highly informative and compact reports that are very useful to help comprehending what is happening in the network. Second, we deploy it in a completely di erent scenario and show how it can also be successfully used in a real-world use case where we analyse the behaviour of highly distributed devices related with a critical infrastructure sector.

    La monitoritzaci o de xarxa sempre ha estat un tema de gran import ancia per operadors de xarxa i investigadors per m ultiples raons que van des de la detecci o d'anomalies fins a la classi caci o d'aplicacions. Avui en dia, a mesura que les xarxes es tornen m es i m es complexes, augmenta el tr ansit de dades i les amenaces de seguretat segueixen creixent, aconseguir una comprensi o m es profunda del que passa a la xarxa s'ha convertit en una necessitat essencial. Concretament, degut al considerable increment del ciberactivisme, la investigaci o en el camp de la detecci o d'anomalies ha crescut i en els darrers anys s'han fet moltes i diverses propostes. Tot i aix o, quan s'intenten desplegar aquestes solucions en entorns reals, algunes d'elles no compleixen alguns requisits fonamentals. Tenint aix o en compte, aquesta tesi se centra a omplir aquest buit entre la recerca i el m on real. Abans d'iniciar aquest treball es van identi car diversos problemes. En primer lloc, hi ha una clara manca d'informaci o detallada i actualitzada sobre les anomalies m es comuns i les seves caracter stiques. En segona inst ancia, no tenir en compte la possibilitat de treballar amb nom es part de les dades (mostreig de tr ansit) continua sent bastant est es tot i el sever efecte en el rendiment dels algorismes de detecci o d'anomalies. En tercer lloc, els operadors de xarxa actualment han d'invertir moltes hores de feina per classi car i inspeccionar manualment les anomalies detectades per actuar en conseqüencia i prendre les mesures apropiades de mitigaci o. Aquesta situaci o es veu agreujada per l'alt nombre de falsos positius i falsos negatius i perqu e els sistemes de detecci o d'anomalies s on sovint percebuts com caixes negres extremadament complexes. Analitzar un tema es essencial per comprendre plenament l'espai del problema i per poder-hi fer front de forma adequada. Per tant, el primer bloc d'aquesta tesi pret en proporcionar informaci o detallada i actualitzada del m on real sobre les anomalies m es freqüents en una xarxa troncal. Primer es comparen tres eines comercials per a la detecci o d'anomalies i se n'estudien els seus punts forts i febles, aix com els tipus d'anomalies de xarxa detectats. Posteriorment, s'investiguen les caracter stiques de les anomalies que es troben en la mateixa xarxa troncal utilitzant una de les eines durant m es de mig any. Entre d'altres resultats, aquest bloc con rma la necessitat de l'aplicaci o de mostreig de tr ansit en un entorn operacional, aix com el nombre inacceptablement elevat de falsos positius i falsos negatius en eines comercials actuals. En general, el mostreig de tr ansit de dades de xarxa ( es a dir, treballar nom es amb una part de les dades) en grans xarxes troncals s'ha convertit en gaireb e obligatori i, per tant, tots els algorismes de detecci o d'anomalies que no ho tenen en compte poden veure seriosament afectats els seus resultats. El segon bloc d'aquesta tesi analitza i confi rma el dram atic impacte de mostreig en el rendiment de t ecniques de detecci o d'anomalies plenament acceptades a l'estat de l'art. No obstant, es mostra que els resultats canvien signi cativament depenent de la t ecnica de mostreig utilitzada i tamb e en funci o de la m etrica usada per a fer la comparativa. Contr ariament als resultats reportats en estudis previs, es mostra que Packet Sampling supera Flow Sampling. A m es, a m es, s'observa que Selective Sampling (SES), una t ecnica de mostreig que se centra en mostrejar fluxes petits, obt e resultats molt millors per a la detecci o d'escanejos que no pas les t ecniques tradicionals de mostreig. En conseqü encia, proposem Online Selective Sampling, una t ecnica de mostreig que obt e el mateix bon rendiment per a la detecci o d'escanejos que SES, per o treballa paquet per paquet enlloc de mantenir tots els fluxes a mem oria. Despr es de validar i evaluar la nostra proposta, demostrem que es capa c de treballar online i utilitza molts menys recursos que SES. Tot i la gran quantitat de tècniques proposades a la literatura per a la detecci o d'esdeveniments an omals, la investigaci o per a la seva posterior classi caci o i extracci o (p.ex., per investigar m es a fons el que va passar o per compartir l'evid encia amb tercers involucrats) es m es aviat marginal. Aix o fa que sigui m es dif cil per als operadors de xarxa analalitzar les anomalies reportades, ja que depenen unicament de la seva experi encia per fer la feina. A m es a m es, aquesta tasca es un proc es extremadament lent i propens a errors. El tercer bloc d'aquesta tesi se centra en aquest tema tenint tamb e en compte els coneixements adquirits en els blocs anteriors. Concretament, presentem un sistema per a la detecci o extracci o i classi caci o autom atica d'anomalies amb una alta precisi o i molt pocs falsos positius. Adicionalment, despleguem el sistema en un entorn operatiu i demostrem la seva utilitat pr actica. El quart i ultim bloc d'aquesta tesi presenta una generalitzaci o del nostre sistema que se centra en l'an alisi de tot el tr ansit, no nom es en les anomalies. Aquest nou sistema pret en ajudar m es als operadors ja que resumeix els patrons de tr ansit m es importants de la seva xarxa. En particular, es generalitza el sistema per fer front al "big data" (una gran quantitat de dades). En particular, el sistema tracta IPs origen i dest i, ports origen i destí , protocol, Sistemes Aut onoms origen i dest , aplicaci o que ha generat el tr ansit i fi nalment, dades de geolocalitzaci o (tamb e per origen i dest ). Primer, despleguem un prototip a la xarxa europea per a la recerca i la investigaci o (G EANT) i demostrem que el sistema pot processar grans quantitats de dades r apidament aix com crear informes altament informatius i compactes que s on de gran utilitat per ajudar a comprendre el que est a succeint a la xarxa. En segon lloc, despleguem la nostra eina en un escenari completament diferent i mostrem com tamb e pot ser utilitzat amb exit en un cas d' us en el m on real en el qual s'analitza el comportament de dispositius altament distribuïts.

  • Toward energy-aware scheduling using machine learning

     Berral Garcia, Josep Lluis; Goiri Presa, Iñigo; Nou Castell, Ramon; Julià Massó, Ferran; Fitó Comellas, Josep Oriol; Guitart Fernández, Jordi; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    Date of publication: 2012-07-30
    Book chapter

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Applying trust metrics based on user interactions to recommendation in social networks

     Lumbreras, Alberto; Gavaldà Mestre, Ricard
    IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
    Presentation's date: 2012-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Learning probability distributions generated by finite-state machines

     Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    International Conference on Grammatical Inference
    Presentation's date: 2012-09-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Online techniques for dealing with concept drift in process mining

     Carmona Vargas, Jose; Gavaldà Mestre, Ricard
    International Symposium on Intelligent Data Analysis
    Presentation's date: 2012-12-27
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Bootstrapping and learning PDFA in data streams  Open access

     De Balle Pigem, Borja; Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    International Conference on Grammatical Inference
    Presentation's date: 2012-09-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specic classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for infering models in this class under the stringent data stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We provide rigorous PAC-like bounds for all of the above, as well as an evaluation on synthetic data showing that the algorithm performs well in practice. Our algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapping in a streaming setting which may be of independent interest. In experiments we have observed that this sketch yields important reductions in the examples required for performing some crucial statistical tests in our algorithm.

  • A methodology for the evaluation of high response time on E-commerce users and sales

     Poggi, Nicolas; Carrera Perez, David; Gavaldà Mestre, Ricard; Ayguade Parra, Eduard; Torres Viñals, Jordi
    Information systems frontiers
    Date of publication: 2012-10-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Best Student Paper ICGI 2012

     De Balle Pigem, Borja; Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    Award or recognition

    View View Open in new window  Share

  • MINERIA EN DATOS BIOLOGICOS Y SOCIALES: ALGORITMOS, TEORIA E IMPLEMENTACION

     Morrill, Glyn Verden; Quattoni, Ariadna Julieta; Arratia Quesada, Argimiro Alejandro; De Balle Pigem, Borja; Arias Vicente, Marta; Casas Fernandez, Bernardino; Bifet Figuerol, Albert Carles; Berral Garcia, Josep Lluis; Lopez Herrera, Josefina; Baixeries i Juvillà, Jaume; Delgado Pin, Jordi; Belanche Muñoz, Luis Antonio; Castro Rabal, Jorge; Lozano Bojados, Antoni; Ferrer Cancho, Ramon; Sierra Santibañez, Maria Josefina; Gavaldà Mestre, Ricard
    Participation in a competitive project

     Share

  • Energy-efficient and multifaceted resource management for profit-driven virtualized data centers

     Goiri, Iñigo; Berral Garcia, Josep Lluis; Fitó Comellas, Josep Oriol; Julià Massó, Ferran; Nou Castell, Ramon; Guitart Fernández, Jordi; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    Future generation computer systems
    Date of publication: 2012-05
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    As long as virtualization has been introduced in data centers, it has been opening new chances for resource management. Nowadays, it is not just used as a tool for consolidating underused nodes and save power; it also allows new solutions to well-known challenges, such as heterogeneity management. Virtualization helps to encapsulate Web-based applications or HPC jobs in virtual machines (VMs) and see them as a single entity which can be managed in an easier and more efficient way. We propose a new scheduling policy that models and manages a virtualized data center. It focuses on the allocation of VMs in data center nodes according to multiple facets to optimize the provider’s profit. In particular, it considers energy efficiency, virtualization overheads, and SLA violation penalties, and supports the outsourcing to external providers. The proposed approach is compared to other common scheduling policies, demonstrating that a provider can improve its benefit by 30% and save power while handling other challenges, such as resource outsourcing, in a better and more intuitive way than other typical approaches do.

    Postprint (author’s final draft)

  • Adaptive scheduling on power-aware managed data-centers using machine learning

     Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    ACM/IEEE International Conference on Grid Computing
    Presentation's date: 2011-09-22
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Energy-related costs have become one of the major economic factors in IT data-centers, and companies and the research community are currently working on new efficient power-aware resource management strategies, also known as “Green IT”. Here we propose an autonomic scheduling of tasks and web-services over cloud environments, focusing on the profit optimization by executing a set of tasks according to servicelevel agreements minus its costs like power consumption. The principal contribution is the use of machine learning techniques in order to predict a priori resource usages, like CPU consumption, and estimate the tasks response time based on the monitored data traffic characteristics. Further, in order to optimize the scheduling, an exact solver based on mixed integer linear programming is used as a proof of concept, and also compared to some approximate algorithm solvers to find valid alternatives for the NP-hard problem of exact schedule solving. Experiments show that machine learning algorithms can predict system behaviors with acceptable accuracy, also the ILP solver obtains the optimal solution managing to adjust appropriately the schedule according to profits and cost of power increases, also reducing migrations when their cost is taken into consideration. Finally, is demonstrated that one of the approximate algorithm solvers is much faster but close in terms of the optimization goal to the exact solver.

  • Access to the full text
    Detecting sentiment change in twitter streaming data  Open access

     Bifet Figuerol, Albert Carles; Holmes, Geoffrey; Pfahringer, Bernhard; Gavaldà Mestre, Ricard
    Workshop on Applications of Pattern Analysis
    Presentation's date: 2011-10-19
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    MOA-TweetReader is a real-time system to read tweets in real time, to detect changes, and to fi nd the terms whose frequency changed. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. MOA-TweetReader is a software extension to the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.

  • Learning read-constant polynomials of constant degree modulo composites

     Chattopadhyay, Arkadev; Gavaldà Mestre, Ricard; Hansen, Kristoffer Arnsfelt; Thérien, Denis
    International Computer Science Symposium in Russia
    Presentation's date: 2011-06-14
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Non-intrusive estimation of QoS degradation impact on E-commerce user satisfaction

     Poggi, Nicolas; Carrera Perez, David; Gavaldà Mestre, Ricard; Ayguade Parra, Eduard
    IEEE International Symposium on Network Computing and Applications
    Presentation's date: 2011-08-26
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • SalaboMiner: a biomedical literature mining tool for inferring the genetics of complex diseases

     Rib, Leonor; Gavaldà Mestre, Ricard; Soria, José Manuel; Buil, Alfonso
    International Conference on Bioinformatics Models, Methods and Algorithms
    Presentation's date: 2011-01-26
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Optimal resource allocation in a virtualized software aging platform with software rejuvenation

     Alonso López, Javier; Goiri Presa, Iñigo; Guitart Fernández, Jordi; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    IEEE International Symposium on Software Reliability Engineering
    Presentation's date: 2011-11-29
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Mining frequent closed graphs on evolving data streams.

     Bifet Figuerol, Albert Carles; Holmes, Geoff; Pfahringer, Bernhard; Gavaldà Mestre, Ricard
    ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    Presentation's date: 2011-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on time-varying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batch-incremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in non-stationary data streams.

  • Mining frequent closed trees in evolving data streams

     Bifet Figuerol, Albert Carles; Gavaldà Mestre, Ricard
    Intelligent data analysis
    Date of publication: 2011
    Journal article

     Share Reference managers Reference managers Open in new window

  • XML Tree classification on evolving data streams

     Bifet Figuerol, Albert Carles; Gavaldà Mestre, Ricard
    Date of publication: 2011-11
    Book chapter

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Proactive Software Rejuvenation solution for web enviroments on virtualized platforms

     Alonso López, Javier
    Defense's date: 2011-02-21
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Organización del International Workshop on Social Web Mining

     Gavaldà Mestre, Ricard
    Participation in a competitive project

     Share

  • Characterization of workload and resource consumption for an online travel and booking site

     Poggi Mastrokalo, Nicolas; Carrera Perez, David; Gavaldà Mestre, Ricard; Torres Viñals, Jordi; Ayguade Parra, Eduard
    IEEE International Symposium on Workload Characterization
    Presentation's date: 2010-12-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Adaptive on-line software aging prediction based on machine learning  Open access

     Alonso López, Javier; Torres Viñals, Jordi; Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard
    IEEE/IFIP International Conference on Dependable Systems and Networks
    Presentation's date: 2010-07-28
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The growing complexity of software systems is resulting in an increasing number of software faults. According to the literature, software faults are becoming one of the main sources of unplanned system outages, and have an important impact on company benefits and image. For this reason, a lot of techniques (such as clustering, fail-over techniques, or server redundancy) have been proposed to avoid software failures, and yet they still happen. Many software failures are those due to the software aging phenomena. In this work, we present a detailed evaluation of our chosen machine learning prediction algorithm (M5P) in front of dynamic and non-deterministic software aging. We have tested our prediction model on a three-tier web 12EE application achieving acceptable prediction accuracy against complex scenarios with small training data sets. Furthermore, we have found an interesting approach to help to determine the root cause failure: The model generated by machine learning algorithms.

  • Access to the full text
    Learning PDFA with asynchronous transitions  Open access

     De Balle Pigem, Borja; Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    International Conference on Grammatical Inference
    Presentation's date: 2010-09-14
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we extend the PAC learning algorithm due to Clark and Thollard for learning distributions generated by PDFA to automata whose transitions may take varying time lengths, governed by exponential distributions.

  • Access to the full text
    A lower bound for learning distributions generated by probabilistic automata  Open access

     De Balle Pigem, Borja; Castro Rabal, Jorge; Gavaldà Mestre, Ricard
    International Conference on Algorithmic Learning Theory
    Presentation's date: 2010-10-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Known algorithms for learning PDFA can only be shown to run in time polynomial in the so-called distinguishability μ of the target machine, besides the number of states and the usual accuracy and confidence parameters. We show that the dependence on μ is necessary for every algorithm whose structure resembles existing ones. As a technical tool, a new variant of Statistical Queries termed L ∞-queries is defined. We show how these queries can be simulated from samples and observe that known PAC algorithms for learning PDFA can be rewritten to access its target using L∞-queries and standard Statistical Queries. Finally, we show a lower bound: every algorithm to learn PDFA using queries with a resonable tolerance needs a number of queries larger than (1=μ )c for every c < 1.

    Postprint (author’s final draft)

  • Algorithms for Process Conformance and Process Refinement

     Muñoz Gama, Jorge
    Defense's date: 2010-09-09
    Department of Software, Universitat Politècnica de Catalunya
    Theses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • GEOMETRY OF GRAMMAR: EXERCICES IN LAMBEK STYLE

     FADDA, MARIO
    Defense's date: 2010-07-27
    Department of Software, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Organización de los congresos ECML/PKDD'10

     Balcazar Navarro, Jose Luis; Carreras Perez, Xavier; Casas Fernandez, Bernardino; Gavaldà Mestre, Ricard; Berral Garcia, Josep Lluis; De Balle Pigem, Borja; Quattoni, Ariadna Julieta; Bifet Figuerol, Albert Carles; Arias Vicente, Marta
    Participation in a competitive project

     Share

  • Access to the full text
    Towards energy-aware scheduling in data centers using machine learning  Open access

     Berral Garcia, Josep Lluis; Goiri Presa, Iñigo; Nou Castell, Ramon; Julià, Ferran; Guitart Fernández, Jordi; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    1st International Conference on Energy-Efficient Computing and Networking
    Presentation's date: 2010-04-15
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    As energy-related costs have become a major economical factor for IT infrastructures and data-centers, companies and the research community are being challenged to nd better and more efficient power-aware resource management strategies. There is a growing interest in "Green" IT and there is still a big gap in this area to be covered. In order to obtain an energy-efficient data center, we propose a framework that provides an intelligent consolidation methodology using di erent techniques such as turning on/o ff machines, power-aware consolidation algorithms, and machine learning techniques to deal with uncertain information while maximizing performance. For the machine learning approach, we use models learned from previous system behaviors in order to predict power consumption levels, CPU loads, and SLA timings, and improve scheduling decisions. Our framework is vertical, because it considers from watt consumption to workload features, and cross-disciplinary, as it uses a wide variety of techniques. We evaluate these techniques with a framework that covers the whole control cycle of a real scenario, using a simulation with representative heterogeneous workloads, and we measure the quality of the results according to a set of metrics focused toward our goals, besides traditional policies. The results obtained indicate that our approach is close to the optimal placement and behaves better when the level of uncertainty increases.

    Postprint (author’s final draft)

  • Access to the full text
    J2EE instrumentation for software aging root cause application component determination with AspectJ  Open access

     Alonso López, Javier; Torres Viñals, Jordi; Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard
    IEEE Workshop on Dependable Parallel, Distributed and Network-Centric System
    Presentation's date: 2010-04-23
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Unplanned system outages have a negative impact on company revenues and image. While the last decades have seen a lot of efforts from industry and academia to avoid them, they still happen and their impact is increasing. According to many studies, one of the most important causes of these outages is software aging. Software aging phenomena refers to the accumulation of errors, usually provoking resource contention, during long running application executions, like web applications, which normally cause applications/systems hang or crash. Determining the software aging root cause failure, not the resource or resources involved in, is a huge task due to the growing day by day complexity of the systems. In this paper we present a monitoring framework based on Aspect Programming to monitor the resources used by every application component in runtime. Knowing the resources used by every component of the application we can determine which components are related to the software aging. Furthermore, we present a case study where we evaluate our approach to determine in a web application scenario, which components are involved in the software aging with promising results.

  • Introduction to programming

     Cortadella Fortuny, Jordi; Gavaldà Mestre, Ricard; Orejas Valdes, Fernando
    Date: 2010-09-01
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2010

     Arias Vicente, Marta; Berral Garcia, Josep Lluis; Quattoni, Ariadna Julieta; De Balle Pigem, Borja; Casas Fernandez, Bernardino; Bifet Figuerol, Albert Carles; Balcazar Navarro, Jose Luis; Gavaldà Mestre, Ricard
    Participation in a competitive project

     Share

  • Access to the full text
    An integer linear programming representation for data-center power-aware management  Open access

     Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    Date: 2010-11-12
    Report

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work exposes how to represent a grid data-center based scheduling problem, taking the advantages of the virtualization and consolidation techniques, as a linear integer programming problem including all three mentioned factors. Although being integer linear programming (ILP) a computationally hard problem, specifying correctly its constraints and optimization function can contribute to find integer optimal solutions in relative short time. So ILP solutions can help designers and system managers not only to apply them to schedulers but also to create new heuristics and holistic functions that approximate well to the optimal solutions in a quicker way.

    Postprint (author’s final draft)

  • Self-adaptive utility-based web session management

     Poggi, N; Moreno, T; Berral Garcia, Josep Lluis; Gavaldà Mestre, Ricard; Torres Viñals, Jordi
    Computer networks
    Date of publication: 2009-07
    Journal article

     Share Reference managers Reference managers Open in new window

  • An agebraic perspective on boolean function learning

     Gavaldà Mestre, Ricard; Therien Campo, Denis
    Lecture notes in artificial intelligence
    Date of publication: 2009
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Adaptive learning from evolving data streams

     Bifet Figuerol, Albert Carles; Gavaldà Mestre, Ricard
    International Symposium on Intelligent Data Analysis
    Presentation's date: 2009-09-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Improving adaptive bagging methods for evolving data streams

     Bifet Figuerol, Albert Carles; Holmes, Geoffrey; Pfahringer, Bernhard; Gavaldà Mestre, Ricard
    Asian Conference on Machine Learning
    Presentation's date: 2009-11-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Predicting web serves crashes: a case study in comparing prediction algorithms

     Gavaldà Mestre, Ricard
    International Conference on Autonomic and Autonomous Systems
    Presentation's date: 2009-04-20
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Predicting web server crashes: a case study in comparing prediction algorithms

     Alonso López, Javier; Torres Viñals, Jordi; Gavaldà Mestre, Ricard
    International Conference on Autonomic and Autonomous Systems
    Presentation's date: 2009-04-20
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window