Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 218 results
  • Access to the full text
    Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs  Open access

     Martín Huertas, Alberto Francisco; Reyes, Ruyman; Badia Sala, Rosa Maria; Quintana Ortí, Enrique Salvador
    Parallel computing
    Date of publication: 2014-05
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.

    In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.

  • Programmability and portability for exascale: top down programming methodology and tools with StarSs

     Subotic, Vladimir; Brinkmann, Steffen; Marjanovic, Vladimir; Badia Sala, Rosa Maria; Gracia, Jose; Niethammer, Chirstoph; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Journal of computational science
    Date of publication: 2013-11
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated tasks on heterogeneous platforms, including clusters of GPUs. This paper focuses on the methodology and tools that complements the programming model forming a consistent development environment with the objective of simplifying the live of application developers. The programming environment includes the tools TAREADOR and TEMANEJO, which have been designed specifically for StarSs. TAREADOR, a Valgrind-based tool, allows a top-down development approach by assisting the programmer in identifying tasks and their data-dependencies across all concurrency levels of an application. TEMANEJO is a graphical debugger supporting the programmer by visualizing the task dependency tree on one hand, but also allowing to manipulate task scheduling or dependencies. These tools are complemented with a set of performance analysis tools (Scalasca, Cube and Paraver) that enable to fine tune StarSs application.

  • Programming and Parallelising Applications for Distributed Infrastructures  Open access

     Tejedor Saavedra, Enric
    Defense's date: 2013-07-15
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    L'última dècada ha presenciat canvis sense precedents en les infrastrucures paral·leles i distribuïdes. Degut a la disminució del guany en rendiment dels processadors per increment de freqüència de rellotge, els fabricants han passat d'arquitectures uniprocessador a arquitectures multinucli; com a resultat, els clústers d'ordinadors han incorporat aquests nous dissenys de CPU. A més, la necessitat cada cop més gran de computació i emmagatzematge per part d'aplicacions científiques ha motivat l'aparició dels grids: infraestructures geogràficament distribuïdes i multidomini, basades en la compartició de recursos per a dur a terme tasques grans i complexes. Més recentment, els clouds han emergit combinant tecnologies de virtualització, orientació a serveis i models de negoci per proveir recursos sota demanda a través d' Internet.La mida i complexitat d'aquestes noves infraestructures suposa un repte pels programadors que volen explotar-les. D'una banda, algunes de les dificultats són inherents a la programació concurrent i distribuïda, per exemple gestionar la creació i sincronització de threads, missatges, particionat i transferència de dades, etc. D'altra banda, altres problemes estan relacionats amb les singularitats de cada escenari, com l'heterogeneïtat del middleware i recursos Grid o el risc de dependència d'un cert proveïdor quan s'escriu una aplicació pel Cloud.Davant d'un repte com aquest, la productivitat de programació, entesa com un terme mig entre programabilitat i rendiment, ha esdevingut crucial pels desenvolupadors de software. Existeix una gran necessitat de models i llenguatges de programació que sigui altament productius, els quals haurien d'oferir mitjans senzills per escriure aplicacions paral·leles i distribuïdes que puguin executar-se en les infrastructures actuals sense sacrificar el seu rendiment.En aquest sentit, aquesta tesi contribueix amb Java StarSs, un model de programació i un runtime d'execució per desenvolupar i paral·lelitzar aplicacions en infrastructures distribuïdes. El model té dues propietats clau: primer, l'usuari programa d'una manera totalment seqüencial i en Java estàndard (el codi de l'aplicació no ha de contenir cap construcció paral·lela, crida a API o pragma); segon, és completament agnòstic de la infrastructura, és a dir, els programes no contenen cap detall sobre desplegament o gestió de recursos, de tal manera que la mateixa aplicació pot córrer en diferents infraestructures sense canvis. L'únic requeriment per l'usuari és seleccionar les tasques de l'aplicació, que són l'unitat de paral·lelisme del model. Les tasques poden ser mètodes Java o bé operacions de servei web, i poden manipular qualsevol tipus de dades suportat pel llenguatge Java, en concret fitxers, objectes, arrays i primitius.Per tal de simplificar el model, Java StarSs trasllada la feina de paral·lelització del programador al runtime d'execució. El runtime és responsable de modificar l'aplicació original per fer que aquesta creï tasques asíncrones i sincronitzi accessos a dades des del programa principal. A més, la concurrència implícita entre tasques es troba de manera automàtica a mesura que s'executa l'aplicació, gràcies a un mecanisme de detecció de dependències que integra tots els tipus de dades de Java.Aquesta tesi proporciona una avaluació exhaustiva de Java StarSs a tres escenaris distribuïts diferents: Grid, Cluster i Cloud. Per cadascun, un runtime d'execució va ser dissenyat i implementat per a explotar la seves característiques particulars i per a gestionar les seves dificultats, tot mantenint l'agnosticisme del model respecte la infrastructura. L'avaluació compara Java StarSs amb altres propostes d'estat de l'art, en termes de programabilitat i rendiment, i demostra com el model pot proporcionar una productivitat remarcable als programadors d'aplicacions paral·leles i distribuïdes.

    The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet. The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider. In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance. In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types. This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications.

  • Access to the full text
    Self-adaptive OmpSs tasks in heterogeneous environments  Open access

     Planas Carbonell, Judit; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.

    As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.

    Postprint (author’s final draft)

  • FPGA-based prototype of the task superscalar architecture

     Yazdanpanah Ahmadabadi, Fahimeh; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Etsion, Yoav; Badia Sala, Rosa Maria
    HiPEAC Workshop on Reconfigurable Computing
    Presentation's date: 2013-01-21
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    In this paper, we present the first hardware implementation of a prototype of the Task Superscalar architecture; an experimental task-based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. The implemented hardware is based on a distributed design that can op erate in parallel and is easily scalable to manage hundreds of cores in the same way that Out-of-Order architectures manage functional units. Our prototype operates at near 150 Mhz, fits in a current commercial FPGA board, and can maintain up to 1024 in-ight tasks, managing the data dependencies in few cycles.

  • Access to the full text
    Loop level speculation in a task based programming model  Open access

     Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguade Parra, Eduard
    International Conference on High Performance Computing
    Presentation's date: 2013-12
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Uncountable loops (such as while loops in C) and if-conditions are some of the most common constructs in programming. While-loops are widely used to determine the convergence in linear algebra algorithms or goal finding problems from graph algorithms, to name a few. In general while-loops are used whenever the loop iteration space, the number of iterations a loop executes is unknown. Usually in while-loops, the execution of the next iteration is decided inside the current loop iteration (i.e. the execution of iteration i depends on the values computed in iteration i-1). This precludes their parallel execution in today's ubiquitous multi-core architectures. In this paper a technique to speculatively create parallel tasks from the next iterations before the current one completes is proposed. If consecutive loop-iterations are only control dependent, then multiple iterations can be executed simultaneously; later in the execution path, the runtime system will decide to either commit the results of such speculatively executed iterations or undo the changes made by them. Data dependences within or between non-speculative and speculative work are honored to guarantee correctness. The proposed technique is implemented in SMPSs, a task-based dataflow programming model for shared-memory multiprocessor architectures. The approach is evaluated on a set of applications from graph algorithms and linear algebra. Results are promising with an average increase in the speedup of 1.2x with 16 threads when compared to non speculative execution of the applications. The increase in the speedup is significant, since the performance gain is achieved over an already parallelized version of the benchmarks.

    Uncountable loops (such as while loops in C) and if-conditions are some of the most common constructs in programming. While-loops are widely used to determine the convergence in linear algebra algorithms or goal finding problems from graph algorithms, to name a few. In general while-loops are used whenever the loop iteration space, the number of iterations a loop executes is unknown. Usually in while-loops, the execution of the next iteration is decided inside the current loop iteration (i.e. the execution of iteration i depends on the values computed in iteration i-1). This precludes their parallel execution in today's ubiquitous multi-core architectures. In this paper a technique to speculatively create parallel tasks from the next iterations before the current one completes is proposed. If consecutive loop-iterations are only control dependent, then multiple iterations can be executed simultaneously; later in the execution path, the runtime system will decide to either commit the results of such speculatively executed iterations or undo the changes made by them. Data dependences within or between non-speculative and speculative work are honored to guarantee correctness. The proposed technique is implemented in SMPSs, a task-based dataflow programming model for shared-memory multiprocessor architectures. The approach is evaluated on a set of applications from graph algorithms and linear algebra. Results are promising with an average increase in the speedup of 1.2x with 16 threads when compared to non speculative execution of the applications. The increase in the speedup is significant, since the performance gain is achieved over an already parallelized version of the benchmarks.

    Postprint (author’s final draft)

  • Implementing OmpSs support for regions of data in architectures with multiple address spaces

     Bueno Hedo, Javier; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

  • Access to the full text
    Analysis of the Task Superscalar architecture hardware design  Open access

     Yazdanpanah Ahmadabadi, Fahimeh; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Etsion, Yoav; Badia Sala, Rosa Maria
    International Conference on Computational Science
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we analyze the operational flow of two hardware implementations of the Task Superscalar architecture. The Task Superscalar is an experimental task based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks in the out-of-order manner. In this paper, we present a base implementation of the Task Superscalar architecture, as well as a new design with improved performance. We study the behavior of processing some dependent and non-dependent tasks with both base and improved hardware designs and present the simulation results compared with the results of the runtime implementation.

    In this paper, we analyze the operational flow of two hardware implementations of the Task Superscalar architecture. The Task Superscalar is an experimental task based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks in the out-of-order manner. In this paper, we present a base implementation of the Task Superscalar architecture, as well as a new design with improved performance. We study the behavior of processing some dependent and non-dependent tasks with both base and improved hardware designs and present the simulation results compared with the results of the runtime implementation.

    Postprint (author’s final draft)

  • Programmable and scalable reductions on clusters

     Ciesko, Jan; Bueno Hedo, Javier; Puzovic, Nikola; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Reductions matter and they are here to stay. Wide adoption of parallel processing hardware in a broad range of computer applications has encouraged recent research efforts on their efficient parallelization. Furthermore, trends towards high productivity languages in mainstream computing increases the demand for efficient programming support. In this paper we present a new approach on parallel reductions for distributed memory systems that provides both scalability and programmability. Using OmpSs, a task-based parallel programming model, the developer has the ability to express scalable reductions through a single pragma annotation. This pragma annotation is applicable for tasks as well as for work-sharing constructs (with implicit tasking) and instructs the compiler to generate the required runtime calls. The supporting runtime handles data and task distribution, parallel execution and data reduction. Scalability is achieved through a software cache that maximizes local and temporal data reuse and allows overlapped computation and communication. Results confirm scalability for up to 32 12-core cluster nodes.

    Reductions matter and they are here to stay. Wide adoption of parallel processing hardware in a broad range of computer applications has encouraged recent research efforts on their efficient parallelization. Furthermore, trends towards high productivity languages in mainstream computing increases the demand for efficient programming support. In this paper we present a new approach on parallel reductions for distributed memory systems that provides both scalability and programmability. Using OmpSs, a task-based parallel programming model, the developer has the ability to express scalable reductions through a single pragma annotation. This pragma annotation is applicable for tasks as well as for work-sharing constructs (with implicit tasking) and instructs the compiler to generate the required runtime calls. The supporting runtime handles data and task distribution, parallel execution and data reduction. Scalability is achieved through a software cache that maximizes local and temporal data reuse and allows overlapped computation and communication. Results confirm scalability for up to 32 12-core cluster nodes.

  • Extracting the optimal sampling frequency of applications using spectral analysis

     Casas Guix, Marc; Servat Gelabert, Harald; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    Concurrency and computation. Practice and experience
    Date of publication: 2012-03-10
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A high-productivity task-based programming model for clusters

     Tejedor, Enric; Farreras Esclusa, Montserrat; Grove, David; Badia Sala, Rosa Maria; Almási, George; Labarta Mancho, Jesus Jose
    Concurrency and computation. Practice and experience
    Date of publication: 2012-12-15
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language

    Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language

  • OPTIMIS: A holistic approach to cloud service provisioning

     Juan, Ana; Hernández, Francisco; Tordsson, Johan; Elmroth, Erik; Ali-Eldin, Ahmed; Zsigri, Csilla; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Badia Sala, Rosa Maria; Djemame, Karim; Ziegler, Wolfgang; Dimitrakos, Theo; Nair, Srijith K.; Kousiouris, George; Konstanteli, Kleopatra; Varvarigou, Theodora; Hudzia, Benoit; Kipp, Alexander; Wesner, Stefan; Corrales, Marcelo; Forgó, Nikolaus; Sharif, Tabassum; Sheridan, Craig
    Future generation computer systems
    Date of publication: 2012-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    We present fundamental challenges for scalable and dependable service platforms and architectures that enable flexible and dynamic provisioning of cloud services. Our findings are incorporated in a toolkit targeting the cloud service and infrastructure providers. The innovations behind the toolkit are aimed at optimizing the whole service life cycle, including service construction, deployment, and operation, on a basis of aspects such as trust, risk, eco-efficiency and cost. Notably, adaptive self-preservation is crucial to meet predicted and unforeseen changes in resource requirements. By addressing the whole service life cycle, taking into account several cloud architectures, and by taking a holistic approach to sustainable service provisioning, the toolkit aims to provide a foundation for a reliable, sustainable, and trustful cloud computing industry.

  • A dynamic load balancing approach with SMPSuperscalar and MPI

     Garcia Gasulla, Marta; Corbalan Gonzalez, Julita; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2012-06
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we are going to compare the performance obtained with the hybrid programming models MPI+SMPSs and MPI+OpenMP, especially when executing with a Dynamic Load Balancing (DLB) library. But first we will describe the SMPSuperscalar programming model and how it hybridizes nicely with MPI. We are also explaining the load balancing algorithm for hybrid applications, LeWI, and how it can improve the performance of hybrid applications. We will also analyze why SMPSs is able to exploit the benefits of LeWI further than OpenMP. The performance results will show not only how the performance of hybrid applications can be improved with LeWI but also the benefit of using a hybrid programming model MPI+SMPSs for load balancing instead of MPI+OpenMP

  • HIPEAC 3 - European Network of Excellence on HighPerformance Embedded Architecture and Compilers

     Navarro Mas, Nacho; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Valero Cortes, Mateo; Ayguade Parra, Eduard; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Llaberia Griño, Jose M.
    Participation in a competitive project

     Share

  • Integración de modelos de programación paralela en entornos de computación científica

     ALFONSO, NIÑO RAMOS; Badia Sala, Rosa Maria
    Participation in a competitive project

     Share

  • Productive programming of GPU clusters with OmpSs

     Bueno Hedo, Javier; Planas, Judit; Duran Gonzalez, Alejandro; Badia Sala, Rosa Maria; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applicactions programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

  • Transactional access to shared memory in StarSs, a task based programming model

     Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Lujan, M; Watson, I.
    International Conference on Parallel and Distributed Computing
    Presentation's date: 2012-08-27
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    With an increase in the number of processors on a single chip, programming environments which facilitate the exploitation of par- allelism on multicore architectures have become a necessity. StarSs is a task-based programming model that enables a flexible and high level programming. Although task synchronization in StarSs is based on data flow and dependency analysis, some applications (e.g. reductions )require locks to access shared data. Transactional Memory is an alternative to lock-based synchronization for controlling access to shared data. In this paper we explore the idea of integrating a lightweight Software Transactional Memory (STM) library, TinySTM , into an implementation of StarSs (SMPSs). The SMPSs run- time and the compiler have been modified to include and use calls to the STM library. We evaluated this approach on four applications and observe better performance in applications with high lock contention.

  • Desarrollo de un workflow genérico para el modelado de problemas de barrido paramétrico en sistemas distribuidos  Open access

     Reyes Avila, Sebastian
    Defense's date: 2012-11-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work presents the development and experimental validation of a generic workflow model applicable to any parameter sweep problem: the Parameter Sweep Scientific Workflow (PSWF) model. As part of it, a model for the monitoring and management of scientific workflows on distributed systems is developed. This model, Star Superscalar Status (SsTAT), is applicable to the StarSs programming model family. PSWF and SsTAT can be used by the scientific community as a reference for solving problems using the parameter sweep strategy. As an integral part of the work, the treatment of the parameter sweep problem is formalized. This is achieved by developing a general solution based on the PSNSS (Parameter Sweep Nested Summation Symbol) algorithm, using both the original sequential and a concurrent approach. Both versions are implemented and validated, showing its applicability to all automatable PSWF lifecycle phases. Load testing shows that large-scale parameter sweep problems can efficiently be addressed with the proposed approach. In addition, the SsTAT monitoring and management generic model is instantiated for a Grid environment. Thus, an operational implementation of SsTAT based on GRIDSs, GSTAT (GRID Superscalar Status), is developed. A series of tests performed on an actual heterogeneous Grid of computers shows that GSTAT can appropriately develop their functionality even in an environment so demanding as that. As a practical case, the model proposed here is applied to determining the molecular potential energy hypersurfaces. For this purpose, a specific instance of the workflow, called PSHYP (Parameter Sweep Hypersurfaces), is created.

    En este trabajo se presenta el desarrollo y validación experimental de un modelo de workflow genérico, aplicable a cualquier problema de barrido de parámetros, denominado Parameter Sweep Scientific Workflow (PSWF). Asimismo, se diseña y prueba un modelo de monitorización y gestión de workflows científicos, en sistemas distribuidos, designado como SsTAT (Star Superscalar Status) que es aplicable a la familia de modelos de programación Star Superscalar (StarSs). Los modelos PSWF y SsTAT pueden ser utilizados por la comunidad científica como referencia a la hora de resolver problemas mediante la estrategia de barrido de parámetros. Como parte integral del trabajo se formaliza el tratamiento del problema del barrido de parámetros, desarrollándose una solución general concretada en el algoritmo PSNSS (Parameter Sweep Nested Summation Symbol) en su versión secuencial y concurrente. Ambas versiones se implementan y validan, mostrándose su aplicabilidad a todas las fases automatizables del ciclo de vida PSWF. Mediante la realización de varias pruebas de carga se comprueba que el tratamiento de problemas de barrido de parámetros de gran envergadura puede abordarse eficientemente con la aproximación propuesta. A su vez, el modelo genérico de monitorización y gestión SsTAT se particulariza para un entorno Grid, generándose una implementación operativa del mismo, basada en GRIDSs, denominada GSTAT (GRID Superscalar Status). La realización de una serie de pruebas sobre un Grid real de computadores heterogéneo muestra que GSTAT desarrolla apropiadamente sus funciones incluso en un entorno tan exigente como este. Como caso práctico, se aplica el modelo aquí propuesto a la obtención de la hipersuperficie de energía potencial molecular generando a tal efecto un workflow específico denominado PSHYP (Parameter Sweep Hypersurfaces)

  • Programming Model and Run-Time Optimizations for the Cell/B.E.

     Bellens, Pieter
    Defense's date: 2012-09-27
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • OmpSs-OpenCL programming model for heterogeneous systems

     Elangovan, Vinoth Krishnan; Badia Sala, Rosa Maria; Ayguade Parra, Eduard
    International Workshop on Languages and Compilers for Parallel Computing
    Presentation's date: 2012-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The advent of heterogeneous computing has forced programmers to use platform specific programming paradigms in order to achieve maximum performance. This approach has a steep learning curve for programmers and also has detrimental influence on productivity and code re-usability. To help with this situation, OpenCL an open-source, parallel computing API for cross platform computations was conceived. OpenCL provides a homogeneous view of the computational resources (CPU and GPU) thereby enabling software portability across different platforms. Although OpenCL resolves software portability issues, the programming paradigm presents low programmability and additionally falls short in performance. In this paper we focus on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address these shortcomings. This would enable the programmer to skip cumbersome OpenCL constructs including OpenCL plaform creation, compilation, kernel building, kernel argument setting and memory transfers, instead write a sequential program with annotated pragmas. Our proposal mainly focuses on how to exploit the best of the underlying hardware platform with greater ease in programming and to gain significant performance using the data parallelism offered by the OpenCL run time for GPUs and multicore architectures. We have evaluated the platform with important benchmarks and have noticed substantial ease in programming with comparable performance.

    The advent of heterogeneous computing has forced programmers to use platform specific programming paradigms in order to achieve maximum performance. This approach has a steep learning curve for programmers and also has detrimental influence on productivity and code re-usability. To help with this situation, OpenCL an open-source, parallel computing API for cross platform computations was conceived. OpenCL provides a homogeneous view of the computational resources (CPU and GPU) thereby enabling software portability across different platforms. Although OpenCL resolves software portability issues, the programming paradigm presents low programmability and additionally falls short in performance. In this paper we focus on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address these shortcomings. This would enable the programmer to skip cumbersome OpenCL constructs including OpenCL plaform creation, compilation, kernel building, kernel argument setting and memory transfers, instead write a sequential program with annotated pragmas. Our proposal mainly focuses on how to exploit the best of the underlying hardware platform with greater ease in programming and to gain significant performance using the data parallelism offered by the OpenCL run time for GPUs and multicore architectures. We have evaluated the platform with important benchmarks and have noticed substantial ease in programming with comparable performance.

  • Parallel implementation of the integral histogram

     Bellens, Pieter; Palaniappan, Kannappan; Badia Sala, Rosa Maria; Seetharaman, Guna; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2011-08-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The integral histogram is a recently proposed preprocessing technique to compute histograms of arbitrary rectangular gridded (i.e. image or volume) regions in constant time. We formulate a general parallel version of the the integral histogram and analyse its implementation in Star Superscalar (StarSs). StarSs provides a uniform programming and runtime environment and facilitates the development of portable code for heterogeneous parallel architectures. In particular, we discuss the implementation for the multi-core IBM Cell Broadband Engine (Cell/B.E.) and provide extensive performance measurements and tradeo¿s using two di¿erent scan orders or histogram propagation methods. For 640 × 480 images, a tile or block size of 28×28 and 16 histogram bins the parallel algorithm is able to reach greater than real-time performance of more than 200 frames per second.

  • G-means improved for Cell BE environment

     Foina, Aislan G.; Badia Sala, Rosa Maria; Ramirez Fernandez, Javier
    Lecture notes in computer science
    Date of publication: 2011-10-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The performance gain obtained by the adaptation of the G-means algorithm for a Cell BE environment using the CellSs framework is described. G-means is a clustering algorithm based on k-means, used to find the number of Gaussian distributions and their centers inside a multi-dimensional dataset. It is normally used for data mining pplications, and its execution can be divided into 6 execution steps. This paper analyzes each step to select which of them could be improved. In the implementation, the algorithm was modified to use the specific SIMD instructions of the Cell processor and to introduce parallel computing using the CellSs framework to handle the SPU tasks. The hardware used was an IBM BladeCenter QS22 containing two PowerXCell processors. The results show the execution of the algorithm 60% faster as compared with the non-improved code.

  • Demonstration of the OPTIMIS toolkit for cloud service provisioning

     Badia Sala, Rosa Maria; Corrales, Marcelo; Dimitrakos, Theo; Djemame, Karim; Elmroth, Erik; Juan Ferrer, Ana; Forgó, Nikolaus; Guitart Fernández, Jordi; Hernández, Francisco; Hudzia, Benoit; Kipp, Alexander; Konstanteli, Kleopatra; Kousiouris, George; Nair, Srijith K.; Sharif, Tabassum; Sheridan, Craig; Sirvent Pardell, Raül; Tordsson, Johan; Varvarigou, Theodora; Wesner, Stefan; Ziegler, Wolfgang; Zsigri, Csilla
    Lecture notes in computer science
    Date of publication: 2011-10
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    We demonstrate the OPTIMIS toolkit for scalable and dependable service platforms and architectures that enable flexible and dynamic provisioning of Cloud services. The innovations demonstrated are aimed at optimizing Cloud services and infrastructures based on aspects such as trust, risk, eco-efficiency, cost, performance and legal constraints. Adaptive self-preservation is part of the toolkit to meet predicted and unforeseen changes in resource requirements. By taking into account the whole service life cycle, the multitude of future Cloud architectures, and a by taking a holistic approach to sustainable service provisioning, the toolkit provides a foundation for a reliable, sustainable, and trustful Cloud computing industry.

  • Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

     Ferrer, Roger; Planas Carbonell, Judit; Bellens, Pieter; Duran Gonzalez, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2011
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmer¿s productivity

    In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmer’s productivity

  • Making the best of temporal locality: Just-in-time renaming and lazy write-back on the Cell/B.E.

     Bellens, Pieter; Perez, Josep M.; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International journal of high performance computing applications
    Date of publication: 2011-05
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Cell Superscalar (CellSs) provides a simple, flexible and easy programming approach for the Cell Broadband Engine (Cell/B.E.) that automatically exploits the inherent concurrency of applications at a function or task level. The CellSs environment is based on a source-to-source compiler that translates annotated C or Fortran code and a runtime library tailored for the Cell/B.E. that orchestrates the concurrent execution of the application. We introduce a technique called bypassing that allows CellSs to perform core-to-core Direct Memory Access (DMA) transfers for generic applications. In this review we concisely summarize the bypassing practice and introduce two improvements: just-in-time renaming and lazy write-back. These extensions come at no additional cost and potentially increase performance by improving the perceived bandwidth of the Element Interconnect Bus (EIB). Experiments on five fundamental linear algebra kernels demonstrate the applicability of these techniques and quantify the benefit that can be reaped. We also present performance results for a first prototype of CellSs with bypassing.

  • Poster: programming clusters of GPUs with OMPSs

     Bueno Hedo, Javier; Duran Gonzalez, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Badia Sala, Rosa Maria
    International Conference for High Performance Computing, Networking, Storage and Analysis~
    Presentation's date: 2011-11-18
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • EU-Brazil OpenBIO

     Badia Sala, Rosa Maria; Lezzi, Daniele
    Participation in a competitive project

     Share

  • Productive cluster programming with OmpSs

     Bueno Hedo, Javier; Martinell, Lluis; Duran Gonzalez, Alejandro; Farreras Esclusa, Montserrat; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2011-09-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Demonstration of the OPTIMIS toolkit for cloud service provisioning

     Badia Sala, Rosa Maria; Corrales, Marcelo; Dimitrakos, Theo; Djemame, Karim; Elmroth, Erik; Juan Ferrer, Ana; Forgó, Nikolaus; Guitart Fernández, Jordi; Hernández, Francisco; Hudzia, Benoit; Kipp, Alexander; Konstanteli, Kleopatra; Kousiouris, George; Nair, Srijith K.; Sharif, Tabassum; Sheridan, Craig; Sirvent Pardell, Raül; Tordsson, Johan; Varvarigou, Theodora; Wesner, Stefan; Ziegler, Wolfgang; Zsigri, Csilla
    ServiceWave Conference Series
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • ClusterSs: a task-based programming model for clusters

     Tejedor Saavedra, Enric; Farreras Esclusa, Montserrat; Badia Sala, Rosa Maria; Grove, David; Almási, George; Labarta Mancho, Jesus Jose
    International Symposium on High Performance Distributed Computing
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Symmetric rank-k update on clusters of multicore processors with SMPSs  Open access

     Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Marjanovic, Vladimir; Martín Huertas, Alberto Francisco; Mayo, Rafael; Quintana Ortí, Enrique Salvador; Reyes, Ruymán
    International Conference on Parallel Computing
    Presentation's date: 2011-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We investigate the use of the SMPSs programming model to leverage task parallelism in the execution of a message-pas sing implementation of the symmetric rank- k update on clusters equipped with multicore processors. Our experience shows that the major difficulties to adapt the code to the MPI/SMPSs instance of this programming model are due to the usage of the conventional column-major layout of matrices in numerical libraries. On the other hand, the experimental results show a considerable increase in the performance and scalability of our solution when compared with the standard options based on the use of a pure MPI approach or a hybrid one that combines MPI/multi-threaded BLAS.

  • Exploiting semantics and virtualization for SLA-driven resource allocation in service providers

     Ejarque, Jorge; de Palol, Marc; Goiri Presa, Iñigo; Julià, Ferran; Guitart Fernández, Jordi; Badia Sala, Rosa Maria; Torres Viñals, Jordi
    Concurrency and computation. Practice and experience
    Date of publication: 2010-04-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Resource management is a key challenge that service providers must adequately face in order to accomplish their business goals. This paper introduces a framework, the semantically enhanced resource allocator (SERA), aimed to facilitate service provider management, reducing costs and at the same time fulfilling the QoS agreed with the customers. The SERA assigns resources depending on the information given by the service providers according to its business goals and on the resource requirements of the tasks. Tasks and resources are semantically described and these descriptions are used to infer the resource assignments. Virtualization is used to provide an application specific and isolated virtual environment for each task. In addition, the system supports fine-grain dynamic resource distribution among these virtual environments based on Service-Level Agreements. The required adaptation is implemented using agents, guarantying enough resources to each task in order to meet the agreed performance goals.

    Resource management is a key challenge that service providers must adequately face in order to accomplish their business goals. This paper introduces a framework, the semantically enhanced resource allocator (SERA), aimed to facilitate service provider management, reducing costs and at the same time fulfilling the QoS agreed with the customers. The SERA assigns resources depending on the information given by the service providers according to its business goals and on the resource requirements of the tasks. Tasks and resources are semantically described and these descriptions are used to infer the resource assignments. Virtualization is used to provide an application specific and isolated virtual environment for each task. In addition, the system supports fine-grain dynamic resource distribution among these virtual environments based on Service-Level Agreements. The required adaptation is implemented using agents, guarantying enough resources to each task in order to meet the agreed performance goals.

  • Extending OpenMP to survive the heterogeneous multi-core era

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Bellens, Pieter; Cabrera, Daniel; Duran González, Alejandro; Ferrer, Roger; Gonzalez Tallada, Marc; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Martinell, Lluis; Martorell Bofill, Xavier; Mayo, Rafael; Pérez Cáncer, Josep Maria; Planas, Judit; Quintana Ortí, Enrique Salvador
    International journal of parallel programming
    Date of publication: 2010-10
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, theCell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

  • Automatic phase detection and structure extraction of MPI applications

     Casas Guix, Marc; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International journal of high performance computing applications
    Date of publication: 2010-08
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we present an automatic system able to detect the internal structure of executions of high-performance computing applications. This automatic system is able to rule out non-significant regions of executions, to detect redundancies, and, finally, to select small but significant execution regions. This automatic detection process is based on spectral analysis (wavelet transform, Fourier transform, etc.) and works detecting the most important frequencies of the applicationâ¿¿s execution. These main frequencies are strongly related to the internal loops of the applicationâ¿¿s source code. The automatic detection of small but significant execution regions shown in the paper reduces the load of the performance analysis process remarkably.

  • Spectral analysis of executions of computer programs and its applications on performance analysis  Open access

     Casas Guix, Marc
    Defense's date: 2010-03-09
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This work is motivated by the growing intricacy of high performance computing infrastructures. For example, supercomputer MareNostrum (installed in 2005 at BSC) has 10240 processors and currently there are machines with more than 100.000 processors. The complexity of this systems increases the complexity of the manual performance analysis of parallel applications. For this reason, it is mandatory to use automatic tools and methodologies.The performance analysis group of BSC and UPC has a large experience in analyzing parallel applications. The approach of this group consists mainly in the analysis of tracefiles (obtained from parallel applications executions) using performance analysis and visualization tools, such as Paraver. Taking into account the general characteristics of the current systems, this method can sometimes be very expensive in terms of time and inefficient. To overcome these problems, this thesis makes several contributions.The first one is an automatic system able to detect the internal structure of executions of high performance computing applications. This automatic system is able to rule out nonsignificant regions of executions, to detect redundancies and, finally, to select small but significant execution regions. This automatic detection process is based on spectral analysis (wavelet transform, fourier transform, etc..) and works detecting the most important frequencies of the application's execution. These main frequencies are strongly related to the internal loops of the application' source code. Finally, it is important to state that an automatic detection of small but significant execution regions reduces remarkably the complexity of the performance analysis process.The second contribution is an automatic methodology able to show general but nontrivial performance trends. They can be very useful for the analyst in order to carry out a performance analysis of the application. The automatic methodology is based on an analytical model. This model consists in several performance factors. Such factors modify the value of the linear speedup in order to fit the real speedup. That is, if this real speedup is far from the linear one, we will detect immediately which one of the performance factors is undermining the scalability of the application. The second main characteristic of the analytical model is that it can be used to predict the performance of high performance computing applications. From several execution on a few of processors, we extract model's performance factors and we extrapolate these values to executions on higher number of processors. Finally, we obtain a speedup prediction using the analytical model.The third contribution is the automatic detection of the optimal sampling frequency of applications. We show that it is possible to extract this frequency using spectral analysis. In case of sequential applications, we show that to use this frequency improves existing results of recognized techniques focused on the reduction of serial application's instruction execution stream (SimPoint, Smarts, etc..). In case of parallel benchmarks, we show that the optimal frequency is very useful to extract significant performance information very efficiently and accurately.In summary, this thesis proposes a set of techniques based on signal processing. The main focus of these techniques is to perform an automatic analysis of the applications, reporting and initial diagnostic of their performance and showing their internal iterative structure. Finally, these methods also provide a reduced tracefile from which it is easy to start manual finegrain performance analysis. The contributions of the thesis are not reduced to proposals and publications. The research carried out these last years has provided a tool for analyzing applications' structure. Even more, the methodology is general and it can be adapted to many performance analysis methods, improving remarkably their efficiency, flexibility and generality.

  • Parallel programming models for heterogeneous multicore architectures

     Ferrer, Roger; Bellens, Pieter; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Yeom, Jae-Seung; Schneider, Scott; Koukos, Konstantinos; Alvanos, Michail; Nikolopoulos, Dimitrios S.; Bilas, Angelos
    IEEE micro
    Date of publication: 2010-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • HiPEAC Paper Award

     Etsion, Yoav; Cabarcas Jaramillo, Felipe; Rico Carro, Alejandro; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Award or recognition

     Share

  • Exploiting Dataflow Parallelism in Teradevice Computing (TERAFLUX)

     Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Participation in a competitive project

     Share

  • Towards EXaflop applicaTions - TEXT

     Labarta Mancho, Jesus Jose; Badia Sala, Rosa Maria
    Participation in a competitive project

     Share

  • SIENA

     Badia Sala, Rosa Maria; Lezzi, Daniele
    Participation in a competitive project

     Share

  • Access to the full text
    BSC contributions in energy-aware resource management for large scale distributed systems  Open access

     Torres Viñals, Jordi; Ayguade Parra, Eduard; Carrera Perez, David; Guitart Fernández, Jordi; Beltran Querol, Vicenç; Becerra Fontal, Yolanda; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Workshop of the COST Action IC0804 on Energy Efficiency in Large Scale Distributed Systems
    Presentation's date: 2010-04-15
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper introduces the work being carried out at Barcelona Supercomputing Center in the area of Green Computing. We have been working in resource management for a long time and recently we included the energy parameter in the decision process, considering that for a more sustainable science, the paradigm will shift from “time to solution” to “kWh to the solution”. We will present our proposals organized in four points that follow the cloud computing stack. For each point we will enumerate the latest achievements that will be published during 2010 that are the basics for our future research. To conclude the paper we will review our ongoing and future research work and an overview of the projects where BSC is participating.

  • Handling task dependencies under strided and aliased references

     Pérez Cáncer, Josep Maria; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2010-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The emergence of multicore processors has increased the need for simple parallel programming models usable by nonexperts. The ability to specify subparts of a bigger data structure is an important trait of High Productivity Programming Languages. Such a concept can also be applied to dependency-aware task-parallel programming models. In those paradigms, tasks may have data dependencies, and those are used for scheduling them in parallel. However, calculating dependencies between subparts of bigger data structures is challenging. Accessed data may be strided, and can fully or partially overlap the accesses of other tasks. Techniques that are too approximate may produce too many extra dependencies and limit parallelism. Techniques that are too precise may be impractical in terms of time and space. We present the abstractions, data structures and algorithms to calculate dependencies between tasks with strided and possibly different memory access patterns. Our technique is performed at run time from a descriptio n of the inputs and outputs of each task and is not affected by pointer arithmetic nor reshaping. We demonstrate how it can be applied to increase programming productivity. We also demonstrate that scalability is comparable to other solutions and in some cases higher due to better parallelism extraction.

  • Access to the full text
    Task superscalar: an out-of-order task pipeline  Open access

     Etsion, Yoav; Cabarcas Jaramillo, Felipe; Rico Carro, Alejandro; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2010-12-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task superscalar pipeline dynamically detects intertask data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task superscalar pipeline frontend, that can be embedded into any manycore fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as ~50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task superscalar thus enables programmers to exploit manycore systems effectively, while simultaneously simplifying their programming model.

  • CellSs: Sheduling techniques to better exploit memory hierarchy

     Bellens, Pieter; Perez, Josep M.; Cabarcas Jaramillo, Felipe; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    Scientific programming
    Date of publication: 2009-01
    Journal article

     Share Reference managers Reference managers Open in new window

  • Parallelizing dense and banded linear algebra libraries using SMPSs

     Badia Sala, Rosa Maria; Herrero Zaragoza, José Ramón; Labarta Mancho, Jesus Jose; Perez, Josep M.; Quintana Ortí, Enrique Salvador; Quintana-Ortí, Gregorio
    Concurrency and Computation: Practice and Experience
    Date of publication: 2009-12-25
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The promise of future many-core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP-like pragmas and a runtime system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS-level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column-major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run-time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms-by-blocks in the FLAME argot) for most operations and storage-by-blocks (or block data layout) is already in place.

  • A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks

     Duran Gonzalez, Alejandro; Ferrer, Roger; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose
    International journal of parallel programming
    Date of publication: 2009-06
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Tasking in OpenMP 3.0 has been conceived to handle the dynamic generation of unstructured parallelism. New directives have been added allowing the user to identify units of independent work (tasks) and to define points to wait for the completion of tasks (task barriers). In this document we propose extensions to allow the runtime detection of dependencies between generated tasks, broading the range of applications that can benefit from tasking or improving the performance when load balancing or locality are critical issues for performance. The proposed extensions are evaluated on a SGI Altix multiprocessor architecture using a couple of small applications and a prototype runtime system implementation.

  • Hierarchical Task-Based Programming With StarSs

     Planas Carbonell, Judit; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International journal of high performance computing applications
    Date of publication: 2009-08
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Programming models for multicore and many-core systems are listed as one of the main challenges in the near future for computing research. These programming models should be able to exploit the underlying platform, but also should have good programmability to enable programmer productivity. With respect to the heterogeneity and hierarchy of the underlying platforms, the programming models should take them into account but they should also enable the programmer to be unaware of the complexity of the hardware. In this paper we present an extension of the StarSs syntax to support task hierarchy. A motivation for such a hierarchical approach is presented through experimentation with CellSs. A prototype implementation of such a hierarchical task-based programming model that combines a first task level with SMPSs and a second task level with CellSs is presented. The preliminary results obtained when executing a matrix multiplication and a Cholesky factorization show the viability and potential of the approach and the current issues raised.

  • Access to the full text
    EMOTIVE: the BSC¿s engine for cloud solutions  Open access

     Goiri Presa, Iñigo; Guitart Fernández, Jordi; Macias Lloret, Mario; Torres Viñals, Jordi; Ayguade Parra, Eduard; Ejarque, Jorge; Sirvent Pardell, Raül; Lezzi, Daniele; Badia Sala, Rosa Maria
    Zero-In eMagazine: Building Insights, Breaking Boundaries
    Date of publication: 2009-10-01
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Cloud computing is strongly based on virtualization, allowing applications to be multiplexed onto a physical resource while isolated from other applications sharing that physical resource. This technology simplifies the management of e-Infrastructures, but also requires additional effort if users are to benefit from it. Cloud computing must hide its underlying complexity from users: the key is to provide users with a simple but functional interface for accessing IT resources "as a service", while allowing providers to build costeffective self-managed systems for transparently managing these resources. System developers should be also supported with simple tools that allow them to exploit the facilities of cloud infrastructures.

  • GRID superscalar: a programming model for the Grid.

     Sirvent Pardell, Raül
    Defense's date: 2009-02-03
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Cortes Rossello, Antonio; Gil Gómez, Maria Luisa; Navarro Mas, Nacho; Corbalan Gonzalez, Julita; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Herrero Zaragoza, José Ramón; Tejedor Saavedra, Enric; Gonzalez Tallada, Marc; Becerra Fontal, Yolanda; Nou Castell, Ramon; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Carrera Perez, David; Alonso López, Javier; Labarta Mancho, Jesus Jose; Martorell Bofill, Xavier; Torres Viñals, Jordi; Badia Sala, Rosa Maria; Ayguade Parra, Eduard
    Participation in a competitive project

     Share