Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 519 results
  • Fine-grain parallel megabase sequence comparison with multiple heterogeneous GPUs

     De Sandes, Edans; Miranda Alamo, Guillermo; Melo, Alba; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    Presentation's date: 2014-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) algorithm for megabase DNA sequences in heterogeneous multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread over multiple GPUs, which communicate border elements to the neighbour, using a circular buffer mechanism that hides the communication overhead. We compared 4 pairs of human-chimpanzee homologous chromosomes using 2 different GPU environments, obtaining a performance of up to 140.36 GCUPS (Billion of cells processed per second) with 3 heterogeneous GPUS.

    This paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) algorithm for megabase DNA sequences in heterogeneous multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread over multiple GPUs, which communicate border elements to the neighbour, using a circular buffer mechanism that hides the communication overhead. We compared 4 pairs of human-chimpanzee homologous chromosomes using 2 different GPU environments, obtaining a performance of up to 140.36 GCUPS (Billion of cells processed per second) with 3 heterogeneous GPUS.

  • Deadline-based MapReduce workload management

     Polo, Jorda; Becerra Fontal, Yolanda; Carrera Perez, David; Steinder, Malgorzata; Whalley, Ian; Torres Viñals, Jordi; Ayguade Parra, Eduard
    IEEE transactions on network and service management
    Date of publication: 2013
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a scheduling technique for multi-job MapReduce workloads that is able to dynamically build performance models of the executing workloads, and then use these models for scheduling purposes. This ability is leveraged to adaptively manage workload performance while observing and taking advantage of the particulars of the execution environment of modern data analytics applications, such as hardware heterogeneity and distributed storage. The technique targets a highly dynamic environment in which new jobs can be submitted at any time, and in which MapReduce workloads share physical resources with other workloads. Thus the actual amount of resources available for applications can vary over time. Beyond the formulation of the problem and the description of the algorithm and technique, a working prototype (called Adaptive Scheduler) has been implemented. Using the prototype and medium-sized clusters (of the order of tens of nodes), the following aspects have been studied separately: the scheduler's ability to meet high-level performance goals guided only by user-defined completion time goals; the scheduler's ability to favor data-locality in the scheduling algorithm; and the scheduler's ability to deal with hardware heterogeneity, which introduces hardware affinity and relative performance characterization for those applications that can benefit from executing on specialized processors.

  • A systematic methodology to generate decomposable and responsive power models for CMPs

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2013-07
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand the power behavior of real systems. Consequently, several power-aware policies use models to guide their decisions. Hence, the presence of power models that are informative, accurate, and capable of detecting power phases is critical to improve the success of power-saving techniques. Additionally, the design of current processors varied considerably with the appearance of CMPs (multiple cores sharing resources). Thus, PMC-based power models warrant further investigation on current energy-efficient multicore processors. In this paper, we present a systematic methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we study theirresponsiveness -the capacity to detect power phases-. Specifically, we produce power models for an Intel Core 2 Duo with one and two cores enabled for all the DVFS configurations. The models are empirically validated using the SPECcpu2006, NAS and LMBENCH benchmarks. Finally, we compare the models against existing approaches concluding that the proposed methodology produces more accurate, responsive, and informative models.

  • A template system for the efficient compilation of domain abstractions onto reconfigurable computers

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    Journal of systems architecture
    Date of publication: 2013-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Past research has addressed the issue of using FPGAs as accelerators for HPC systems. Such research has identified that writing low level code for the generation of an efficient, portable and scalable architecture is challenging. We propose to increase the level of abstraction in order to help developers of reconfigurable accelerators deal with these three key issues. Our approach implements domain specific abstractions for FPGA based accelerators using techniques from generic programming. In this paper we explain the main concepts behind our system to Design Accelerators by Template Expansions (DATE). The DATE system can be effectively used for expanding individual kernels of an application and also for the generation of interfaces between various kernels to implement a complete system architecture. We present evaluations for six kernels as examples of individual kernel generation using the proposed system. Our evaluations are mainly intended to provide a proof-of-concept. We also show the usage of the DATE system for integration of various kernels to build a complete system based on a Template Architecture for Reconfigurable Accelerator Designs (TARCAD).

  • Programmability and portability for exascale: top down programming methodology and tools with StarSs

     Subotic, Vladimir; Brinkmann, Steffen; Marjanovic, Vladimir; Badia Sala, Rosa Maria; Gracia, Jose; Niethammer, Chirstoph; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Journal of computational science
    Date of publication: 2013-11
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated tasks on heterogeneous platforms, including clusters of GPUs. This paper focuses on the methodology and tools that complements the programming model forming a consistent development environment with the objective of simplifying the live of application developers. The programming environment includes the tools TAREADOR and TEMANEJO, which have been designed specifically for StarSs. TAREADOR, a Valgrind-based tool, allows a top-down development approach by assisting the programmer in identifying tasks and their data-dependencies across all concurrency levels of an application. TEMANEJO is a graphical debugger supporting the programmer by visualizing the task dependency tree on one hand, but also allowing to manipulate task scheduling or dependencies. These tools are complemented with a set of performance analysis tools (Scalasca, Cube and Paraver) that enable to fine tune StarSs application.

  • Implementing OmpSs support for regions of data in architectures with multiple address spaces

     Bueno Hedo, Javier; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

  • Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs  Open access

     Subotic, Vladimir
    Defense's date: 2013-07-26
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de unprograma paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela.En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo.En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica demensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes.En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento.También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones.Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas.Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa.En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programaciónMPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs.Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

    Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

  • Access to the full text
    Aeneas: A tool to enable applications to effectively use non-relational databases  Open access

     Cugnasco, Cesare; Hernández, Roger; Becerra Fontal, Yolanda; Torres Viñals, Jordi; Ayguade Parra, Eduard
    International Conference on Computational Science
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Non-relational databases arise as a solution to solve the scalability problems of relational databases when dealing with big data applications. However, they are highly configurable prone to user decisions that can heavily affect their performance. In order to maximize the performance, different data models and queries should be analyzed to choose the best fit. This may involve a wide range of tests and may result in productivity issues. We present Aeneas, a tool to support the design of data management code for applications using non-relational databases. Aeneas provides an easy and fast methodology to support the decision about how to organize and retrieve data in order to improve the performance.

    Non-relational databases arise as a solution to solve the scalability problems of relational databases when dealing with big data applications. However, they are highly configurable prone to user decisions that can heavily affect their performance. In order to maximize the performance, different data models and queries should be analyzed to choose the best fit. This may involve a wide range of tests and may result in productivity issues. We present Aeneas, a tool to support the design of data management code for applications using non-relational databases. Aeneas provides an easy and fast methodology to support the decision about how to organize and retrieve data in order to improve the performance.

    Postprint (author’s final draft)

  • Access to the full text
    Self-adaptive OmpSs tasks in heterogeneous environments  Open access

     Planas Carbonell, Judit; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.

    As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.

    Postprint (author’s final draft)

  • Access to the full text
    Enabling distributed key-value stores with low latency-impact snapshot support  Open access

     Polo, Jorda; Becerra Fontal, Yolanda; Carrera Perez, David; Torres Viñals, Jordi; Ayguade Parra, Eduard; Spreitzer, Mike; Steinder, Malgorzata
    IEEE International Symposium on Network Computing and Applications
    Presentation's date: 2013-08
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Current distributed key-value stores generally provide greater scalability at the expense of weaker consistency and isolation. However, additional isolation support is becoming increasingly important in the environments in which these stores are deployed, where different kinds of applications with different needs are executed, from transactional workloads to data analytics. While fully-fledged ACID support may not be feasible, it is still possible to take advantage of the design of these data stores, which often include the notion of multiversion concurrency control, to enable them with additional features at a much lower performance cost and maintaining its scalability and availability. In this paper we explore the effects that additional consistency guarantees and isolation capabilities may have on a state of the art key-value store: Apache Cassandra. We propose and implement a new multiversioned isolation level that provides stronger guarantees without compromising Cassandra's scalability and availability. As shown in our experiments, our version of Cassandra allows Snapshot Isolation-like transactions, preserving the overall performance and scalability of the system.

    Current distributed key-value stores generally provide greater scalability at the expense of weaker consistency and isolation. However, additional isolation support is becoming increasingly important in the environments in which these stores are deployed, where different kinds of applications with different needs are executed, from transactional workloads to data analytics. While fully-fledged ACID support may not be feasible, it is still possible to take advantage of the design of these data stores, which often include the notion of multiversion concurrency control, to enable them with additional features at a much lower performance cost and maintaining its scalability and availability. In this paper we explore the effects that additional consistency guarantees and isolation capabilities may have on a state of the art key-value store: Apache Cassandra. We propose and implement a new multiversioned isolation level that provides stronger guarantees without compromising Cassandra's scalability and availability. As shown in our experiments, our version of Cassandra allows Snapshot Isolation-like transactions, preserving the overall performance and scalability of the system.

    Postprint (author’s final draft)

  • Access to the full text
    Loop level speculation in a task based programming model  Open access

     Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguade Parra, Eduard
    International Conference on High Performance Computing
    Presentation's date: 2013-12
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Uncountable loops (such as while loops in C) and if-conditions are some of the most common constructs in programming. While-loops are widely used to determine the convergence in linear algebra algorithms or goal finding problems from graph algorithms, to name a few. In general while-loops are used whenever the loop iteration space, the number of iterations a loop executes is unknown. Usually in while-loops, the execution of the next iteration is decided inside the current loop iteration (i.e. the execution of iteration i depends on the values computed in iteration i-1). This precludes their parallel execution in today's ubiquitous multi-core architectures. In this paper a technique to speculatively create parallel tasks from the next iterations before the current one completes is proposed. If consecutive loop-iterations are only control dependent, then multiple iterations can be executed simultaneously; later in the execution path, the runtime system will decide to either commit the results of such speculatively executed iterations or undo the changes made by them. Data dependences within or between non-speculative and speculative work are honored to guarantee correctness. The proposed technique is implemented in SMPSs, a task-based dataflow programming model for shared-memory multiprocessor architectures. The approach is evaluated on a set of applications from graph algorithms and linear algebra. Results are promising with an average increase in the speedup of 1.2x with 16 threads when compared to non speculative execution of the applications. The increase in the speedup is significant, since the performance gain is achieved over an already parallelized version of the benchmarks.

    Uncountable loops (such as while loops in C) and if-conditions are some of the most common constructs in programming. While-loops are widely used to determine the convergence in linear algebra algorithms or goal finding problems from graph algorithms, to name a few. In general while-loops are used whenever the loop iteration space, the number of iterations a loop executes is unknown. Usually in while-loops, the execution of the next iteration is decided inside the current loop iteration (i.e. the execution of iteration i depends on the values computed in iteration i-1). This precludes their parallel execution in today's ubiquitous multi-core architectures. In this paper a technique to speculatively create parallel tasks from the next iterations before the current one completes is proposed. If consecutive loop-iterations are only control dependent, then multiple iterations can be executed simultaneously; later in the execution path, the runtime system will decide to either commit the results of such speculatively executed iterations or undo the changes made by them. Data dependences within or between non-speculative and speculative work are honored to guarantee correctness. The proposed technique is implemented in SMPSs, a task-based dataflow programming model for shared-memory multiprocessor architectures. The approach is evaluated on a set of applications from graph algorithms and linear algebra. Results are promising with an average increase in the speedup of 1.2x with 16 threads when compared to non speculative execution of the applications. The increase in the speedup is significant, since the performance gain is achieved over an already parallelized version of the benchmarks.

    Postprint (author’s final draft)

  • Access to the full text
    Design space explorations for streaming accelerators using streaming architectural simulator  Open access

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Bhurban Conference on Applied Sciences and Technology
    Presentation's date: 2013-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

    In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

    Postprint (author’s final draft)

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS performance evaluation review
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Counter-based power modeling methods: top-down vs. bottom-up

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The computer journal (Kalispell, Mont.)
    Date of publication: 2012-08-24
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Autonomic placement of mixed batch and transactional workloads

     Carrera Perez, David; Steinder, Malgorzata; Whalley, Ian; Torres Viñals, Jordi; Ayguade Parra, Eduard
    IEEE transactions on parallel and distributed systems
    Date of publication: 2012-02-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    To reduce the cost of infrastructure and electrical energy, enterprise datacenters consolidate workloads on the same physical hardware. Often, these workloads comprise both transactional and long-running analytic computations. Such consolidation brings new performance management challenges due to the intrinsically different nature of a heterogeneous set of mixed workloads, ranging from scientific simulations to multitier transactional applications. The fact that such different workloads have different natures imposes the need for new scheduling mechanisms to manage collocated heterogeneous sets of applications, such as running a web application and a batch job on the same physical server, with differentiated performance goals. In this paper, we present a technique that enables existing middleware to fairly manage mixed workloads: long running jobs and transactional applications. Our technique permits collocation of the workload types on the same physical hardware, and leverages virtualization control mechanisms to perform online system reconfiguration. In our experiments, including simulations as well as a prototype system built on top of state-of-the-art commercial middleware, we demonstrate that our technique maximizes mixed workload performance while providing service differentiation based on high-level performance goals.

  • A methodology for the evaluation of high response time on E-commerce users and sales

     Poggi, Nicolas; Carrera Perez, David; Gavaldà Mestre, Ricard; Ayguade Parra, Eduard; Torres Viñals, Jordi
    Information systems frontiers
    Date of publication: 2012-10-06
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The widespread adoption of high speed Internet access and it¿s usage for everyday tasks are causing profound changes in users¿ expectations in terms of Web site performance and reliability. At the same time, server management is living a period of changes with the emergence of the cloud computing paradigm that enables scaling server infrastructures within minutes. To help set performance objectives for maximizing user satisfaction and sales, while minimizing the number of servers and their cost, we present a methodology to determine how user sales are affected as response time increases. We begin with the characterization of more than 6 months of Web performance measurements, followed by the study of how the fraction of buyers in the workload is higher at peak traffic times, to then build a model of sales through a learning process using a 5-year sales dataset. Finally, we present our evaluation of high response time on users for popular applications found in the Web.

  • Energy accounting for shared virtualized environments under DVFS using PMC-based power models

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Torres Viñals, Jordi; Ayguade Parra, Eduard
    Future generation computer systems
    Date of publication: 2012-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Virtualized infrastructure providers demand new methods to increase the accuracy of the accounting models used to charge their customers. Future data centers will be composed of many-core systems that will host a large number of virtual machines (VMs) each. While resource utilization accounting can be achieved with existing system tools, energy accounting is a complex task when per-VM granularity is the goal. In this paper, we propose a methodology that brings new opportunities to energy accounting by adding an unprecedented degree of accuracy on the per-VM measurements. We present a system ¿ which leverages CPU and memory power models based in performance monitoring counters (PMCs) ¿ to perform energy accounting in virtualized systems. The contribution of this paper is threefold. First, we show that PMC-based power modeling methods are still valid on virtualized environments. Second, we show that the Dynamic Voltage and Frequency Scaling (DVFS) mechanism, which commonly is used by infrastructure providers to avoid power and thermal emergencies, does not affect the accuracy of the models. And third, we introduce a novel methodology for accounting of energy consumption in virtualized systems. Accounting is done on a per-VM basis, even in the case where multiple VMs are deployed on top of the same physical hardware, bypassing the limitations of per-server aggregated power metering. Overall, the results for an Intel® Core¿ 2 Duo show errors in energy estimations <5%. Such an approach brings flexibility to the chargeback models used by service and infrastructure providers. For instance, we are able to detect cases where VMs executed during the same amount of time, present more than 20% differences in energy consumption even only taking into account the consumption of the CPU and the memory.

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Cabarcas Jaramillo, Felipe; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • HIPEAC 3 - European Network of Excellence on HighPerformance Embedded Architecture and Compilers

     Navarro Mas, Nacho; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Valero Cortes, Mateo; Ayguade Parra, Eduard; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Llaberia Griño, Jose M.
    Participation in a competitive project

     Share

  • DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

     Vujic, Nikola; Alvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • PPMC: hardware scheduling and memory management support for multi accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    Presentation's date: 2012-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing the impact of network compression on molecular dynamics and finite element methods

     Dickov, Branimir; Pericas, Miquel; Houzeaux, Guillaume; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE International Conference on High Performance Computing and Communications
    Presentation's date: 2012-06
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • The task dependency analysis tool SSgrind

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-28
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • OmpSs to FPGA and overview of applications

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-25
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Introducing speculative optimizations in task dataflow with language extensions and runtime support  Open access

     Azuelos, Nathaniel; Etsion, Yoav; Keidar, Idit; Zaks, A.; Ayguade Parra, Eduard
    Workshop on Data-Flow Execution Models for Extreme Scale Computing
    Presentation's date: 2012-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We argue that speculation leads to increased parallelism in the coarse-grain dataflow paradigm. To do so, we present a framework for adding speculation in a popular and well-established framework. We specify a limited set of additions to the OmpSs language and changes required in its supporting runtime environment. These modifications enable speculation across the system in a flexible way. We evaluate our implementation using a simple benchmark leading to a promising 10% speedup.

    Postprint (author’s final draft)

  • Optimizing resource utilization with software-based temporal multi-threading (sTMT)

     Beltran Querol, Vicenç; Ayguade Parra, Eduard
    International Conference on High Performance Computing
    Presentation's date: 2012-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Compute and memory access units are two of the most important resources to appropriately manage in current and future multi–/many–core architectures. Memory bandwidth and computational capacity need to be exploited in a combined way to achieve the best system performance. Coarse–grain multi– threading, also known as temporal multi–threading (TMT), is a well known technique that improves overall resource utilization by time–multiplexing the execution of a reduced number of hardware threads that are switched in case of a high–latency event, such as a memory miss. Hence, the processor does not stall on memory misses and the number of in–fly memory operations is increased, improving the overall processor resource utilization. In this paper, we propose a software–based implementation of TMT that supports and unbounded number of threads and enables a flexible combination of multiple computational kernels. Our TMT implementation is based on micro–threads that combine fast cooperative and preemptive context switches to overcome some intrinsic limitations of current TMT hardware implementations, such as the reduced and fixed number of hardware threads available. Our proposal is demonstrated with an implementation on the Cell/B.E. which is evaluated using heterogeneous mixes of memory–/CPU–bound kernels. Experimental results show how the proposed technique reduce the execution time of several benchmarks by up to 78%.

  • Integrating dataflow abstractions into the shared memory model

     Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguade Parra, Eduard; Cristal Kestelman, Adrián
    International Symposium on Computer Architecture and High Performance Computing
    Presentation's date: 2012-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we present Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ which integrates dataflow abstractions into the shared memory programming model. The ADF model provides pragma directives that allow a programmer to organize a program into a set of tasks and to explicitly define input data for each task. The task dependency information is conveyed to the ADF runtime system which constructs the dataflow task graph and builds the necessary infrastructure for dataflow execution. Additionally, the ADF model allows tasks to share data. The key idea is that comput ation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory which guarantees atomicity of shared memory updates. We show examples that illustrate how the programmability of shared memory can be improved using the ADF model. Moreover, our evaluation shows that the ADF model performs well in comparison with programs para llelized using OpenMP and transactional memory.

  • PPMC: a programmable pattern based memory controller

     Hussain, Tassadaq; Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International IEEE/ACM Symposium on Applied Reconfigurable Computing
    Presentation's date: 2012-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    IT or not to be: the impact of Moodle in the education of developing countries  Open access

     Garcia Almiñana, Jordi; Somé, Michel; Ayguade Parra, Eduard; Cabre Garcia, Jose Maria; Casañ Guerrero, Maria Jose; Frigola Bourlon, Manel; Galanis ., Nikolaos; Garcia-cervigon Gutierrez, Manuel; Guerrero Zapata, Manel; Muñoz Gracia, Maria del Pilar
    Moodle Research Conference
    Presentation's date: 2012-09-15
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    E-learning environments, such as Moodle, provide a technology that fosters the improvement of the educational system in developed countries, where education is traditionally performed with relatively high standards of quality. A large number of case studies and research have been conducted to demonstrate how e-learning technologies can be applied to improve both training and learning processes. However, these technologies have not been proved efficient when applied to developing countries. The challenges that must be addressed in developing countries, both technological and societal, are much more complex and the possible solution margins are more constrained than those existing in the context where these technologies have been created. In this paper we show how Moodle can be used to improve the quality of education in developing countries and, even more important, how can be used to turn the educational system more sustainable and effective in the long-term. We describe our experience in implementing a programming course in Moodle for the Higher School of Informatics at the Université Polytechnique de Bobo-Dioulasso, in Burkina Faso (West Africa), joining efforts with local professors in designing and implementing the learning system. The case example has been designed having in mind a number of contextual problems: lack of lecturers, excessive teaching hours per lecturer, massive classes, and curricula organization and stability, among others. We finally discuss how the teaching effort is reduced, the students’ knowledge and capacity improves, and the institutional academic model can be guaranteed with the proposal. For this reason, we claim that information technologies in developing countries are a cost-effective way to guarantee the objectives originally defined in the academic curricula and, therefore, deal with the problem of the education.

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • BSArc: blacksmith streaming architecture for HPC accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Task-based parallel breadth-first search in heterogeneous environments

     Munguía, Lluis Miquel; Bader, David A.; Ayguade Parra, Eduard
    International Conference on High Performance Computing
    Presentation's date: 2012-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence.

  • Supporting stateful tasks in a dataflow graph

     Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguade Parra, Eduard; Cristal Kestelman, Adrián
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2012-09-19
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper introduces Atomic Dataflow Model (ADF) - a programming model for shared-memory systems that combines aspects of dataflow programming with the use of explicitly mutable state. The model provides language constructs that allow a programmer to delineate a program into a set of tasks and to explicitly define input data for each task. This information is conveyed to the ADF runtime system which constructs the task dependency graph and builds the necessary infrastructure for dataflow execution. However, the key aspect of the proposed model is that it does not require the programmer to specify all of the task’s dependencies exp licitly, but only those that imply logical ordering between tasks. The ADF model manages the remainder of inter-task dependencies automatically, by executing the body of the task within an implicit memory transaction. This provides an easy- to -program optimistic concurrency substrate and enables a task to safely share data with other concurrent tasks. In this paper, we describe the ADF model and show how it can increase the programmability of shared memory systems.

  • Productive programming of GPU clusters with OmpSs

     Bueno Hedo, Javier; Planas, Judit; Duran Gonzalez, Alejandro; Badia Sala, Rosa Maria; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applicactions programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

  • Transactional access to shared memory in StarSs, a task based programming model

     Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Lujan, M; Watson, I.
    International Conference on Parallel and Distributed Computing
    Presentation's date: 2012-08-27
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    With an increase in the number of processors on a single chip, programming environments which facilitate the exploitation of par- allelism on multicore architectures have become a necessity. StarSs is a task-based programming model that enables a flexible and high level programming. Although task synchronization in StarSs is based on data flow and dependency analysis, some applications (e.g. reductions )require locks to access shared data. Transactional Memory is an alternative to lock-based synchronization for controlling access to shared data. In this paper we explore the idea of integrating a lightweight Software Transactional Memory (STM) library, TinySTM , into an implementation of StarSs (SMPSs). The SMPSs run- time and the compiler have been modified to include and use calls to the STM library. We evaluated this approach on four applications and observe better performance in applications with high lock contention.

  • Architectural Explorations for Streaming Accelerators with Customized Memory Layouts  Open access

     Shafiq, Muhammad
    Defense's date: 2012-05-21
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.

    The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.

  • Software Caching Techniques and Hardware Optimizations for On-Chip Local Memories  Open access

     Vujic, Nikola
    Defense's date: 2012-06-05
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.

    Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals.

  • OmpSs-OpenCL programming model for heterogeneous systems

     Elangovan, Vinoth Krishnan; Badia Sala, Rosa Maria; Ayguade Parra, Eduard
    International Workshop on Languages and Compilers for Parallel Computing
    Presentation's date: 2012-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The advent of heterogeneous computing has forced programmers to use platform specific programming paradigms in order to achieve maximum performance. This approach has a steep learning curve for programmers and also has detrimental influence on productivity and code re-usability. To help with this situation, OpenCL an open-source, parallel computing API for cross platform computations was conceived. OpenCL provides a homogeneous view of the computational resources (CPU and GPU) thereby enabling software portability across different platforms. Although OpenCL resolves software portability issues, the programming paradigm presents low programmability and additionally falls short in performance. In this paper we focus on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address these shortcomings. This would enable the programmer to skip cumbersome OpenCL constructs including OpenCL plaform creation, compilation, kernel building, kernel argument setting and memory transfers, instead write a sequential program with annotated pragmas. Our proposal mainly focuses on how to exploit the best of the underlying hardware platform with greater ease in programming and to gain significant performance using the data parallelism offered by the OpenCL run time for GPUs and multicore architectures. We have evaluated the platform with important benchmarks and have noticed substantial ease in programming with comparable performance.

    The advent of heterogeneous computing has forced programmers to use platform specific programming paradigms in order to achieve maximum performance. This approach has a steep learning curve for programmers and also has detrimental influence on productivity and code re-usability. To help with this situation, OpenCL an open-source, parallel computing API for cross platform computations was conceived. OpenCL provides a homogeneous view of the computational resources (CPU and GPU) thereby enabling software portability across different platforms. Although OpenCL resolves software portability issues, the programming paradigm presents low programmability and additionally falls short in performance. In this paper we focus on integrating OpenCL framework with the OmpSs task based programming model using Nanos run time infrastructure to address these shortcomings. This would enable the programmer to skip cumbersome OpenCL constructs including OpenCL plaform creation, compilation, kernel building, kernel argument setting and memory transfers, instead write a sequential program with annotated pragmas. Our proposal mainly focuses on how to exploit the best of the underlying hardware platform with greater ease in programming and to gain significant performance using the data parallelism offered by the OpenCL run time for GPUs and multicore architectures. We have evaluated the platform with important benchmarks and have noticed substantial ease in programming with comparable performance.

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • On the instrumentation of OpenMP and OmpSs Tasking constructs  Open access

     Servat Gelabert, Harald; Teruel Garcia, Xavier; Llort Sanchez, German Matías; Duran Gonzalez, Alejandro; Giménez Lucas, Judit; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Workshop on Productivity and Performance
    Presentation's date: 2012-08
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Parallelism has become more and more commonplace with the advent of the multicore processors. Although different parallel pro- gramming models have arisen to exploit the computing capabilities of such processors, developing applications that take benefit of these pro- cessors may not be easy. And what is worse, the performance achieved by the parallel version of the application may not be what the developer expected, as a result of a dubious ut ilization of the resources offered by the processor. We present in this paper a fruitful synergy of a shared memory parallel compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the analysis experience of the parallel application by incorporating data that is only known in the compiler and runtime side. Additionally we present performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.

  • OmpSs to OpenCL and to FPGAs

     Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Cabrera, Daniel; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-28
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

     Ferrer, Roger; Planas Carbonell, Judit; Bellens, Pieter; Duran Gonzalez, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2011
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmer¿s productivity

    In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmer’s productivity

  • The Abstract Streaming Machine: compile-time performance modelling of stream programs on heterogeneous multiprocessors

     Carpenter, Paul Matthew; Ramirez Bellido, Alejandro; Ayguade Parra, Eduard
    Transactions on HIPEAC
    Date of publication: 2011-10-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    ACOTES project: Advanced compiler technologies for embedded streaming  Open access

     Munk, H.; Ayguade Parra, Eduard; Bastoul, C.; Carpenter, Paul Matthew; Chamski, Z.; Cohen, A.; Cornero, M.; Dumont, P.; Duranton, M.; Fellahi, M.; Ferrer, Roger; Ladelsky, R.; Lindwer, M.; Martorell Bofill, Xavier; Miranda, C.; Nuzman, D.; Ornstein, A.; Pop, A.; Pop, S.; Pouchet, L. N; Ramirez Bellido, Alejandro; Ródenas, D.; Rohou, E.; Rosen, I.; Shvadron, U.; Trifunovic, K.; Zaks, A.
    International journal of parallel programming
    Date of publication: 2011-04
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

  • Assessing accelerator-based HPC reverse time migration

     Araya Polo, Mauricio; Cabezas, Javier; Hanzich, Mauricio; Pericas, Miquel; Rubio, Fèlix; Gelado Fernandez, Isaac; Shafiq, Muhammad; Morancho Llena, Enrique; Navarro Mas, Nacho; Ayguade Parra, Eduard; Cela Espin, Jose M.; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Date of publication: 2011-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • OmpSs: A proposal for programming heterogeneous multi-core architectures

     Duran Gonzalez, Alejandro; Ayguade Parra, Eduard; Badia,, R.M.; Labarta Mancho, Jesus Jose; Martinell, Lluis; Martorell Bofill, Xavier; Planas, Judit
    Parallel processing letters
    Date of publication: 2011-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Poster: programming clusters of GPUs with OMPSs

     Bueno Hedo, Javier; Duran Gonzalez, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Badia Sala, Rosa Maria
    International Conference for High Performance Computing, Networking, Storage and Analysis~
    Presentation's date: 2011-11-18
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Productive cluster programming with OmpSs

     Bueno Hedo, Javier; Martinell, Lluis; Duran Gonzalez, Alejandro; Farreras Esclusa, Montserrat; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2011-09-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    Presentation's date: 2011-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window