Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 198 results
  • Programmable and scalable reductions on clusters

     Ciesko, Jan; Bueno Hedo, Javier; Puzovic, Nikola; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Reductions matter and they are here to stay. Wide adoption of parallel processing hardware in a broad range of computer applications has encouraged recent research efforts on their efficient parallelization. Furthermore, trends towards high productivity languages in mainstream computing increases the demand for efficient programming support. In this paper we present a new approach on parallel reductions for distributed memory systems that provides both scalability and programmability. Using OmpSs, a task-based parallel programming model, the developer has the ability to express scalable reductions through a single pragma annotation. This pragma annotation is applicable for tasks as well as for work-sharing constructs (with implicit tasking) and instructs the compiler to generate the required runtime calls. The supporting runtime handles data and task distribution, parallel execution and data reduction. Scalability is achieved through a software cache that maximizes local and temporal data reuse and allows overlapped computation and communication. Results confirm scalability for up to 32 12-core cluster nodes.

  • Analysis of the Task Superscalar architecture hardware design

     Yazdanpanah Ahmadabadi, Fahimeh; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Etsion, Yoav; Badia Sala, Rosa Maria
    International Conference on Computational Science
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we analyze the operational flow of two hardware implementations of the Task Superscalar architecture. The Task Superscalar is an experimental task based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks in the out-of-order manner. In this paper, we present a base implementation of the Task Superscalar architecture, as well as a new design with improved performance. We study the behavior of processing some dependent and non-dependent tasks with both base and improved hardware designs and present the simulation results compared with the results of the runtime implementation.

  • Implementing OmpSs support for regions of data in architectures with multiple address spaces

     Bueno Hedo, Javier; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

  • FPGA-based prototype of the task superscalar architecture

     Yazdanpanah Ahmadabadi, Fahimeh; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Etsion, Yoav; Badia Sala, Rosa Maria
    HiPEAC Workshop on Reconfigurable Computing
    Presentation's date: 2013-01-21
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    In this paper, we present the first hardware implementation of a prototype of the Task Superscalar architecture; an experimental task-based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. The implemented hardware is based on a distributed design that can op erate in parallel and is easily scalable to manage hundreds of cores in the same way that Out-of-Order architectures manage functional units. Our prototype operates at near 150 Mhz, fits in a current commercial FPGA board, and can maintain up to 1024 in-ight tasks, managing the data dependencies in few cycles.

  • Programmability and portability for exascale: top down programming methodology and tools with StarSs

     Subotic, Vladimir; Brinkmann, Steffen; Marjanovic, Vladimir; Badia Sala, Rosa Maria; Gracia, Jose; Niethammer, Chirstoph; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Journal of computational science
    Date of publication: 2013-11
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated tasks on heterogeneous platforms, including clusters of GPUs. This paper focuses on the methodology and tools that complements the programming model forming a consistent development environment with the objective of simplifying the live of application developers. The programming environment includes the tools TAREADOR and TEMANEJO, which have been designed specifically for StarSs. TAREADOR, a Valgrind-based tool, allows a top-down development approach by assisting the programmer in identifying tasks and their data-dependencies across all concurrency levels of an application. TEMANEJO is a graphical debugger supporting the programmer by visualizing the task dependency tree on one hand, but also allowing to manipulate task scheduling or dependencies. These tools are complemented with a set of performance analysis tools (Scalasca, Cube and Paraver) that enable to fine tune StarSs application.

  • Programming and Parallelising Applications for Distributed Infrastructures  Open access

     Tejedor Saavedra, Enric
    Defense's date: 2013-07-15
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet. The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider. In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance. In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types. This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications.

  • Productive programming of GPU clusters with OmpSs

     Bueno Hedo, Javier; Planas, Judit; Duran Gonzalez, Alejandro; Badia Sala, Rosa Maria; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applicactions programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

  • Transactional access to shared memory in StarSs, a task based programming model

     Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Lujan, M; Watson, I.
    International Conference on Parallel and Distributed Computing
    Presentation's date: 2012-08-27
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    With an increase in the number of processors on a single chip, programming environments which facilitate the exploitation of par- allelism on multicore architectures have become a necessity. StarSs is a task-based programming model that enables a flexible and high level programming. Although task synchronization in StarSs is based on data flow and dependency analysis, some applications (e.g. reductions )require locks to access shared data. Transactional Memory is an alternative to lock-based synchronization for controlling access to shared data. In this paper we explore the idea of integrating a lightweight Software Transactional Memory (STM) library, TinySTM , into an implementation of StarSs (SMPSs). The SMPSs run- time and the compiler have been modified to include and use calls to the STM library. We evaluated this approach on four applications and observe better performance in applications with high lock contention.

  • Extracting the optimal sampling frequency of applications using spectral analysis

     Casas Guix, Marc; Servat Gelabert, Harald; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    Concurrency and computation. Practice and experience
    Date of publication: 2012-03-10
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A high-productivity task-based programming model for clusters

     Tejedor, Enric; Farreras Esclusa, Montserrat; Grove, David; Badia Sala, Rosa Maria; Almási, George; Labarta Mancho, Jesus Jose
    Concurrency and computation. Practice and experience
    Date of publication: 2012-12-15
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language

    Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language

  • Desarrollo de un workflow genérico para el modelado de problemas de barrido paramétrico en sistemas distribuidos  Open access

     Reyes Avila, Sebastian
    Defense's date: 2012-11-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work presents the development and experimental validation of a generic workflow model applicable to any parameter sweep problem: the Parameter Sweep Scientific Workflow (PSWF) model. As part of it, a model for the monitoring and management of scientific workflows on distributed systems is developed. This model, Star Superscalar Status (SsTAT), is applicable to the StarSs programming model family. PSWF and SsTAT can be used by the scientific community as a reference for solving problems using the parameter sweep strategy. As an integral part of the work, the treatment of the parameter sweep problem is formalized. This is achieved by developing a general solution based on the PSNSS (Parameter Sweep Nested Summation Symbol) algorithm, using both the original sequential and a concurrent approach. Both versions are implemented and validated, showing its applicability to all automatable PSWF lifecycle phases. Load testing shows that large-scale parameter sweep problems can efficiently be addressed with the proposed approach. In addition, the SsTAT monitoring and management generic model is instantiated for a Grid environment. Thus, an operational implementation of SsTAT based on GRIDSs, GSTAT (GRID Superscalar Status), is developed. A series of tests performed on an actual heterogeneous Grid of computers shows that GSTAT can appropriately develop their functionality even in an environment so demanding as that. As a practical case, the model proposed here is applied to determining the molecular potential energy hypersurfaces. For this purpose, a specific instance of the workflow, called PSHYP (Parameter Sweep Hypersurfaces), is created.

    En este trabajo se presenta el desarrollo y validación experimental de un modelo de workflow genérico, aplicable a cualquier problema de barrido de parámetros, denominado Parameter Sweep Scientific Workflow (PSWF). Asimismo, se diseña y prueba un modelo de monitorización y gestión de workflows científicos, en sistemas distribuidos, designado como SsTAT (Star Superscalar Status) que es aplicable a la familia de modelos de programación Star Superscalar (StarSs). Los modelos PSWF y SsTAT pueden ser utilizados por la comunidad científica como referencia a la hora de resolver problemas mediante la estrategia de barrido de parámetros. Como parte integral del trabajo se formaliza el tratamiento del problema del barrido de parámetros, desarrollándose una solución general concretada en el algoritmo PSNSS (Parameter Sweep Nested Summation Symbol) en su versión secuencial y concurrente. Ambas versiones se implementan y validan, mostrándose su aplicabilidad a todas las fases automatizables del ciclo de vida PSWF. Mediante la realización de varias pruebas de carga se comprueba que el tratamiento de problemas de barrido de parámetros de gran envergadura puede abordarse eficientemente con la aproximación propuesta. A su vez, el modelo genérico de monitorización y gestión SsTAT se particulariza para un entorno Grid, generándose una implementación operativa del mismo, basada en GRIDSs, denominada GSTAT (GRID Superscalar Status). La realización de una serie de pruebas sobre un Grid real de computadores heterogéneo muestra que GSTAT desarrolla apropiadamente sus funciones incluso en un entorno tan exigente como este. Como caso práctico, se aplica el modelo aquí propuesto a la obtención de la hipersuperficie de energía potencial molecular generando a tal efecto un workflow específico denominado PSHYP (Parameter Sweep Hypersurfaces)

  • Programming Model and Run-Time Optimizations for the Cell/B.E.

     Bellens, Pieter
    Defense's date: 2012-09-27
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • OPTIMIS: A holistic approach to cloud service provisioning

     Juan, Ana; Hernández, Francisco; Tordsson, Johan; Elmroth, Erik; Ali-Eldin, Ahmed; Zsigri, Csilla; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Badia Sala, Rosa Maria; Djemame, Karim; Ziegler, Wolfgang; Dimitrakos, Theo; Nair, Srijith K.; Kousiouris, George; Konstanteli, Kleopatra; Varvarigou, Theodora; Hudzia, Benoit; Kipp, Alexander; Wesner, Stefan; Corrales, Marcelo; Forgó, Nikolaus; Sharif, Tabassum; Sheridan, Craig
    Future generation computer systems
    Date of publication: 2012-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    We present fundamental challenges for scalable and dependable service platforms and architectures that enable flexible and dynamic provisioning of cloud services. Our findings are incorporated in a toolkit targeting the cloud service and infrastructure providers. The innovations behind the toolkit are aimed at optimizing the whole service life cycle, including service construction, deployment, and operation, on a basis of aspects such as trust, risk, eco-efficiency and cost. Notably, adaptive self-preservation is crucial to meet predicted and unforeseen changes in resource requirements. By addressing the whole service life cycle, taking into account several cloud architectures, and by taking a holistic approach to sustainable service provisioning, the toolkit aims to provide a foundation for a reliable, sustainable, and trustful cloud computing industry.

  • Demonstration of the OPTIMIS toolkit for cloud service provisioning

     Badia Sala, Rosa Maria; Corrales, Marcelo; Dimitrakos, Theo; Djemame, Karim; Elmroth, Erik; Juan Ferrer, Ana; Forgó, Nikolaus; Guitart Fernández, Jordi; Hernández, Francisco; Hudzia, Benoit; Kipp, Alexander; Konstanteli, Kleopatra; Kousiouris, George; Nair, Srijith K.; Sharif, Tabassum; Sheridan, Craig; Sirvent Pardell, Raül; Tordsson, Johan; Varvarigou, Theodora; Wesner, Stefan; Ziegler, Wolfgang; Zsigri, Csilla
    ServiceWave Conference Series
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • ClusterSs: a task-based programming model for clusters

     Tejedor Saavedra, Enric; Farreras Esclusa, Montserrat; Badia Sala, Rosa Maria; Grove, David; Almási, George; Labarta Mancho, Jesus Jose
    International Symposium on High Performance Distributed Computing
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Productive cluster programming with OmpSs

     Bueno Hedo, Javier; Martinell, Lluis; Duran Gonzalez, Alejandro; Farreras Esclusa, Montserrat; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2011-09-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Symmetric rank-k update on clusters of multicore processors with SMPSs  Open access

     Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Marjanovic, Vladimir; Martín Huertas, Alberto Francisco; Mayo, Rafael; Quintana Ortí, Enrique Salvador; Reyes, Ruymán
    International Conference on Parallel Computing
    Presentation's date: 2011-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We investigate the use of the SMPSs programming model to leverage task parallelism in the execution of a message-pas sing implementation of the symmetric rank- k update on clusters equipped with multicore processors. Our experience shows that the major difficulties to adapt the code to the MPI/SMPSs instance of this programming model are due to the usage of the conventional column-major layout of matrices in numerical libraries. On the other hand, the experimental results show a considerable increase in the performance and scalability of our solution when compared with the standard options based on the use of a pure MPI approach or a hybrid one that combines MPI/multi-threaded BLAS.

  • G-means improved for Cell BE environment

     Foina, Aislan G.; Badia Sala, Rosa Maria; Ramirez Fernandez, Javier
    Lecture notes in computer science
    Date of publication: 2011-10-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The performance gain obtained by the adaptation of the G-means algorithm for a Cell BE environment using the CellSs framework is described. G-means is a clustering algorithm based on k-means, used to find the number of Gaussian distributions and their centers inside a multi-dimensional dataset. It is normally used for data mining pplications, and its execution can be divided into 6 execution steps. This paper analyzes each step to select which of them could be improved. In the implementation, the algorithm was modified to use the specific SIMD instructions of the Cell processor and to introduce parallel computing using the CellSs framework to handle the SPU tasks. The hardware used was an IBM BladeCenter QS22 containing two PowerXCell processors. The results show the execution of the algorithm 60% faster as compared with the non-improved code.

  • Parallel implementation of the integral histogram

     Bellens, Pieter; Palaniappan, Kannappan; Badia Sala, Rosa Maria; Seetharaman, Guna; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2011-08-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The integral histogram is a recently proposed preprocessing technique to compute histograms of arbitrary rectangular gridded (i.e. image or volume) regions in constant time. We formulate a general parallel version of the the integral histogram and analyse its implementation in Star Superscalar (StarSs). StarSs provides a uniform programming and runtime environment and facilitates the development of portable code for heterogeneous parallel architectures. In particular, we discuss the implementation for the multi-core IBM Cell Broadband Engine (Cell/B.E.) and provide extensive performance measurements and tradeo¿s using two di¿erent scan orders or histogram propagation methods. For 640 × 480 images, a tile or block size of 28×28 and 16 histogram bins the parallel algorithm is able to reach greater than real-time performance of more than 200 frames per second.

  • Efficient OpenMP over sequentially consistent distributed shared memory systems  Open access

     Costa Prats, Juan Jose
    Defense's date: 2011-07-20
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Nowadays clusters are one of the most used platforms in High Performance Computing and most programmers use the Message Passing Interface (MPI) library to program their applications in these distributed platforms getting their maximum performance, although it is a complex task. On the other side, OpenMP has been established as the de facto standard to program applications on shared memory platforms because it is easy to use and obtains good performance without too much effort. So, could it be possible to join both worlds? Could programmers use the easiness of OpenMP in distributed platforms? A lot of researchers think so. And one of the developed ideas is the distributed shared memory (DSM), a software layer on top of a distributed platform giving an abstract shared memory view to the applications. Even though it seems a good solution it also has some inconveniences. The memory coherence between the nodes in the platform is difficult to maintain (complex management, scalability issues, high overhead and others) and the latency of the remote-memory accesses which can be orders of magnitude greater than on a shared bus due to the interconnection network. Therefore this research improves the performance of OpenMP applications being executed on distributed memory platforms using a DSM with sequential consistency evaluating thoroughly the results from the NAS parallel benchmarks. The vast majority of designed DSMs use a relaxed consistency model because it avoids some major problems in the area. In contrast, we use a sequential consistency model because we think that showing these potential problems that otherwise are hidden may allow the finding of some solutions and, therefore, apply them to both models. The main idea behind this work is that both runtimes, the OpenMP and the DSM layer, should cooperate to achieve good performance, otherwise they interfere one each other trashing the final performance of applications. We develop three different contributions to improve the performance of these applications: (a) a technique to avoid false sharing at runtime, (b) a technique to mimic the MPI behaviour, where produced data is forwarded to their consumers and, finally, (c) a mechanism to avoid the network congestion due to the DSM coherence messages. The NAS Parallel Benchmarks are used to test the contributions. The results of this work shows that the false-sharing problem is a relative problem depending on each application. Another result is the importance to move the data flow outside of the critical path and to use techniques that forwards data as early as possible, similar to MPI, benefits the final application performance. Additionally, this data movement is usually concentrated at single points and affects the application performance due to the limited bandwidth of the network. Therefore it is necessary to provide mechanisms that allows the distribution of this data through the computation time using an otherwise idle network. Finally, results shows that the proposed contributions improve the performance of OpenMP applications on this kind of environments.

  • Poster: programming clusters of GPUs with OMPSs

     Bueno Hedo, Javier; Duran Gonzalez, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Badia Sala, Rosa Maria
    International Conference for High Performance Computing, Networking, Storage and Analysis~
    Presentation's date: 2011-11-18
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    BSC contributions in energy-aware resource management for large scale distributed systems  Open access

     Torres Viñals, Jordi; Ayguade Parra, Eduard; Carrera Perez, David; Guitart Fernández, Jordi; Beltran Querol, Vicenç; Becerra Fontal, Yolanda; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Workshop of the COST Action IC0804 on Energy Efficiency in Large Scale Distributed Systems
    Presentation's date: 2010-04-15
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper introduces the work being carried out at Barcelona Supercomputing Center in the area of Green Computing. We have been working in resource management for a long time and recently we included the energy parameter in the decision process, considering that for a more sustainable science, the paradigm will shift from “time to solution” to “kWh to the solution”. We will present our proposals organized in four points that follow the cloud computing stack. For each point we will enumerate the latest achievements that will be published during 2010 that are the basics for our future research. To conclude the paper we will review our ongoing and future research work and an overview of the projects where BSC is participating.

  • Handling task dependencies under strided and aliased references

     Pérez Cáncer, Josep Maria; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2010-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The emergence of multicore processors has increased the need for simple parallel programming models usable by nonexperts. The ability to specify subparts of a bigger data structure is an important trait of High Productivity Programming Languages. Such a concept can also be applied to dependency-aware task-parallel programming models. In those paradigms, tasks may have data dependencies, and those are used for scheduling them in parallel. However, calculating dependencies between subparts of bigger data structures is challenging. Accessed data may be strided, and can fully or partially overlap the accesses of other tasks. Techniques that are too approximate may produce too many extra dependencies and limit parallelism. Techniques that are too precise may be impractical in terms of time and space. We present the abstractions, data structures and algorithms to calculate dependencies between tasks with strided and possibly different memory access patterns. Our technique is performed at run time from a descriptio n of the inputs and outputs of each task and is not affected by pointer arithmetic nor reshaping. We demonstrate how it can be applied to increase programming productivity. We also demonstrate that scalability is comparable to other solutions and in some cases higher due to better parallelism extraction.

  • Access to the full text
    Task superscalar: an out-of-order task pipeline  Open access

     Etsion, Yoav; Cabarcas Jaramillo, Felipe; Rico Carro, Alejandro; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2010-12-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task superscalar pipeline dynamically detects intertask data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task superscalar pipeline frontend, that can be embedded into any manycore fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as ~50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task superscalar thus enables programmers to exploit manycore systems effectively, while simultaneously simplifying their programming model.

  • Parallel programming models for heterogeneous multicore architectures

     Ferrer, Roger; Bellens, Pieter; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Yeom, Jae-Seung; Schneider, Scott; Koukos, Konstantinos; Alvanos, Michail; Nikolopoulos, Dimitrios S.; Bilas, Angelos
    IEEE micro
    Date of publication: 2010-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Spectral analysis of executions of computer programs and its applications on performance analysis  Open access

     Casas Guix, Marc
    Defense's date: 2010-03-09
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work is motivated by the growing intricacy of high performance computing infrastructures. For example, supercomputer MareNostrum (installed in 2005 at BSC) has 10240 processors and currently there are machines with more than 100.000 processors. The complexity of this systems increases the complexity of the manual performance analysis of parallel applications. For this reason, it is mandatory to use automatic tools and methodologies.The performance analysis group of BSC and UPC has a large experience in analyzing parallel applications. The approach of this group consists mainly in the analysis of tracefiles (obtained from parallel applications executions) using performance analysis and visualization tools, such as Paraver. Taking into account the general characteristics of the current systems, this method can sometimes be very expensive in terms of time and inefficient. To overcome these problems, this thesis makes several contributions.The first one is an automatic system able to detect the internal structure of executions of high performance computing applications. This automatic system is able to rule out nonsignificant regions of executions, to detect redundancies and, finally, to select small but significant execution regions. This automatic detection process is based on spectral analysis (wavelet transform, fourier transform, etc..) and works detecting the most important frequencies of the application's execution. These main frequencies are strongly related to the internal loops of the application' source code. Finally, it is important to state that an automatic detection of small but significant execution regions reduces remarkably the complexity of the performance analysis process.The second contribution is an automatic methodology able to show general but nontrivial performance trends. They can be very useful for the analyst in order to carry out a performance analysis of the application. The automatic methodology is based on an analytical model. This model consists in several performance factors. Such factors modify the value of the linear speedup in order to fit the real speedup. That is, if this real speedup is far from the linear one, we will detect immediately which one of the performance factors is undermining the scalability of the application. The second main characteristic of the analytical model is that it can be used to predict the performance of high performance computing applications. From several execution on a few of processors, we extract model's performance factors and we extrapolate these values to executions on higher number of processors. Finally, we obtain a speedup prediction using the analytical model.The third contribution is the automatic detection of the optimal sampling frequency of applications. We show that it is possible to extract this frequency using spectral analysis. In case of sequential applications, we show that to use this frequency improves existing results of recognized techniques focused on the reduction of serial application's instruction execution stream (SimPoint, Smarts, etc..). In case of parallel benchmarks, we show that the optimal frequency is very useful to extract significant performance information very efficiently and accurately.In summary, this thesis proposes a set of techniques based on signal processing. The main focus of these techniques is to perform an automatic analysis of the applications, reporting and initial diagnostic of their performance and showing their internal iterative structure. Finally, these methods also provide a reduced tracefile from which it is easy to start manual finegrain performance analysis. The contributions of the thesis are not reduced to proposals and publications. The research carried out these last years has provided a tool for analyzing applications' structure. Even more, the methodology is general and it can be adapted to many performance analysis methods, improving remarkably their efficiency, flexibility and generality.

  • HiPEAC Paper Award

     Etsion, Yoav; Cabarcas Jaramillo, Felipe; Rico Carro, Alejandro; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Award or recognition

     Share

  • Exploiting semantics and virtualization for SLA-driven resource allocation in service providers

     Ejarque, Jorge; de Palol, Marc; Goiri Presa, Iñigo; Julià, Ferran; Guitart Fernández, Jordi; Badia Sala, Rosa Maria; Torres Viñals, Jordi
    Concurrency and computation. Practice and experience
    Date of publication: 2010-04-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Resource management is a key challenge that service providers must adequately face in order to accomplish their business goals. This paper introduces a framework, the semantically enhanced resource allocator (SERA), aimed to facilitate service provider management, reducing costs and at the same time fulfilling the QoS agreed with the customers. The SERA assigns resources depending on the information given by the service providers according to its business goals and on the resource requirements of the tasks. Tasks and resources are semantically described and these descriptions are used to infer the resource assignments. Virtualization is used to provide an application specific and isolated virtual environment for each task. In addition, the system supports fine-grain dynamic resource distribution among these virtual environments based on Service-Level Agreements. The required adaptation is implemented using agents, guarantying enough resources to each task in order to meet the agreed performance goals.

  • Extending OpenMP to survive the heterogeneous multi-core era

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Bellens, Pieter; Cabrera, Daniel; Duran González, Alejandro; Ferrer, Roger; Gonzalez Tallada, Marc; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Martinell, Lluis; Martorell Bofill, Xavier; Mayo, Rafael; Pérez Cáncer, Josep Maria; Planas, Judit; Quintana Ortí, Enrique Salvador
    International journal of parallel programming
    Date of publication: 2010-10
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, theCell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

  • An extension of the starSs programming model for platforms with multiple GPUs

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Igual, Francisco D.; Labarta Mancho, Jesus Jose; Mayo, Rafael; Quintana Ortí, Enrique Salvador
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2009-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A proposal to extend the OpenMP tasking model for heterogeneous architectures

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Cabrera, Daniel; Duran Gonzalez, Alejandro; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Mayo, Rafael; Pérez, Josep M.; Quintana Ortí, Enrique Salvador; Martorell Bofill, Xavier; Gonzalez Tallada, Marc
    International Workshop on OpenMP
    Presentation's date: 2009-06-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Introducing virtual execution environment for application lifecycle management and SLA-driven resource distribution within service providers  Open access

     Goiri, Iñigo; Julià, Ferran; Ejarque, Jorge; de Palol, Marc; Badia Sala, Rosa Maria; Guitart Fernández, Jordi; Torres Viñals, Jordi
    IEEE International Symposium on Network Computing and Applications
    Presentation's date: 2009-07-10
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Resource management is a key challenge that service providers must adequately face in order to ensure their profitability. This paper describes a proof-of-concept framework for facilitating resource management in service providers, which allows reducing costs and at the same time fulfilling the quality of service agreed with the customers. This is accomplished by means of virtualization. Our approach provides application-specific virtual environments and consolidates them in order to achieve a better utilization of the providers resources. In addition, it implements self-adaptive capabilities for dynamically distributing the providers resources among these virtual environments based on Service Level Agreements. The proposed solution has been implemented as a part of the Semantically-Enhanced Resource Allocator prototype developed within the BREIN European project. The evaluation shows that our prototype is able to react in very short time under changing conditions and avoid SLA violations by rescheduling efficiently the resources.

  • Just-in-Time Renaming and Lazy Write-Back on the Cell/B.E.

     Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International Conference on Parallel Processing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • CellSs: Sheduling techniques to better exploit memory hierarchy

     Bellens, Pieter; Perez, Josep M.; Cabarcas Jaramillo, Felipe; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    Scientific programming
    Date of publication: 2009-01
    Journal article

     Share Reference managers Reference managers Open in new window

  • GRID superscalar: a programming model for the Grid.

     Sirvent Pardell, Raül
    Defense's date: 2009-02-03
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Performance Monitoring of Grid Superscalar: Summing UP

     Badia Sala, Rosa Maria; Sirvent Pardell, Raül; Wlodzimierz, Funika; Machner, Piotr; Marian, Bubak
    Date of publication: 2009-02
    Book chapter

     Share Reference managers Reference managers Open in new window

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Gonzalez Tallada, Marc; Alonso López, Javier; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Carrera Perez, David; Martorell Bofill, Xavier; Torres Viñals, Jordi; Badia Sala, Rosa Maria; Cortes Rossello, Antonio; Corbalan Gonzalez, Julita; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Gil Gómez, Maria Luisa; Navarro Mas, Nacho; Herrero Zaragoza, José Ramón; Tejedor Saavedra, Enric; Becerra Fontal, Yolanda; Nou Castell, Ramon; Labarta Mancho, Jesus Jose; Ayguade Parra, Eduard
    Participation in a competitive project

     Share

  • Impact of the memory hierarchy on shared memory architectures in multicore programming models

     Badia Sala, Rosa Maria
    Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
    Presentation's date: 2009-02-18
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Impact of the memory hierarchy on shared memory architectures in multicore programming models  Open access

     Badia Sala, Rosa Maria; Pérez Cáncer, Josep Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
    Presentation's date: 2009-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Many and multicore architectures put a big pressure in parallel programming but gives a unique opportunity to propose new programming models that automatically exploit the parallelism of these architectures. OpenMP is a very well known standard that exploits parallelism in shared memory architectures. SMPSs has recently been proposed as a task based programming model that exploits the parallelism at the task level and takes into account data dependencies between tasks. However, besides parallelism in the programming, the memory hierarchy impact in many/multi core architectures is a feature of large importance. This paper presents an evaluation of these two programming models with regard to the impact of different levels of the memory hierarchy in the duration of the application. The evaluation is based on tracefiles with hardware counters on the execution of a memory intensive benchmark in both programming models.

  • Parallelizing dense and banded linear algebra libraries using SMPSs

     Badia Sala, Rosa Maria; Herrero Zaragoza, José Ramón; Labarta Mancho, Jesus Jose; Perez, Josep M.; Quintana Ortí, Enrique Salvador; Quintana-Ortí, Gregorio
    Concurrency and Computation: Practice and Experience
    Date of publication: 2009-12-25
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The promise of future many-core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP-like pragmas and a runtime system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS-level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column-major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run-time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms-by-blocks in the FLAME argot) for most operations and storage-by-blocks (or block data layout) is already in place.

  • A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks

     Duran Gonzalez, Alejandro; Ferrer, Roger; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International journal of parallel programming
    Date of publication: 2009-06
    Journal article

     Share Reference managers Reference managers Open in new window

  • Hierarchical Task-Based Programming With StarSs

     Planas, Judit; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International journal of high performance computing applications
    Date of publication: 2009-08
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    EMOTIVE: the BSC¿s engine for cloud solutions  Open access

     Goiri Presa, Iñigo; Guitart Fernández, Jordi; Macias Lloret, Mario; Torres Viñals, Jordi; Ayguade Parra, Eduard; Ejarque, Jorge; Sirvent Pardell, Raül; Lezzi, Daniele; Badia Sala, Rosa Maria
    Zero-In eMagazine: Building Insights, Breaking Boundaries
    Date of publication: 2009-10-01
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Cloud computing is strongly based on virtualization, allowing applications to be multiplexed onto a physical resource while isolated from other applications sharing that physical resource. This technology simplifies the management of e-Infrastructures, but also requires additional effort if users are to benefit from it. Cloud computing must hide its underlying complexity from users: the key is to provide users with a simple but functional interface for accessing IT resources "as a service", while allowing providers to build costeffective self-managed systems for transparently managing these resources. System developers should be also supported with simple tools that allow them to exploit the facilities of cloud infrastructures.

  • COMP: Superscalar: Bringing GRID superscalar and GCM together

     Tejedor, Enric; Badia Sala, Rosa Maria
    Eighth IEEE International Symposium on Cluster Computing and the Grid
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Extending the OpenMP Tasking Model to Allow Dependent Tasks

     Duran González, Alejandro; Josep, M Perez; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    4th International Workshop on OpenMP (IWOMP 2008). OpenMP in a New Era of Parallelism.
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Special section: Selected papers from the 7th IEEE/ACM international conference on grid computing (Grid2006)

     Badia Sala, Rosa Maria; Gannon, Dennis; Lee, Craig
    Future generation computer systems
    Date of publication: 2008-05
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Grid Superscalar and Job Mapping on the Reliable Grid Resources

     Badia Sala, Rosa Maria; Anciaux, Ani; Sirvent, Raul; Josep, M Pérez
    Date of publication: 2008-08
    Book chapter

     Share Reference managers Reference managers Open in new window

  • A Component-Based Integrated Toolkit

     Badia Sala, Rosa Maria; Tejedor, Enric; Kielmann, Thilo; Vladimir, Getov
    Date of publication: 2008-08
    Book chapter

     Share Reference managers Reference managers Open in new window

  • Orchestrating a Safe Functional Suspension of GCM Components

     Tejedor, Enric; Badia Sala, Rosa Maria
    Integrated Research in Grid Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • SLA-Driven Semantically-Enhanced Dynamic Resource Allocator for Virtualized Service Providers

     Ejarque, Jorge; de Palol, Marc; Goiri, Iñigo; Julià, Ferran; Guitart Fernández, Jordi; Badia Sala, Rosa Maria; Torres Viñals, Jordi
    IEEE International Conference on eScience
    Presentation's date: 2008-12-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window