Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 514 results
  • A systematic methodology to generate decomposable and responsive power models for CMPs

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2013-07
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand the power behavior of real systems. Consequently, several power-aware policies use models to guide their decisions. Hence, the presence of power models that are informative, accurate, and capable of detecting power phases is critical to improve the success of power-saving techniques. Additionally, the design of current processors varied considerably with the appearance of CMPs (multiple cores sharing resources). Thus, PMC-based power models warrant further investigation on current energy-efficient multicore processors. In this paper, we present a systematic methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we study theirresponsiveness -the capacity to detect power phases-. Specifically, we produce power models for an Intel Core 2 Duo with one and two cores enabled for all the DVFS configurations. The models are empirically validated using the SPECcpu2006, NAS and LMBENCH benchmarks. Finally, we compare the models against existing approaches concluding that the proposed methodology produces more accurate, responsive, and informative models.

  • Deadline-based MapReduce workload management

     Polo, Jorda; Becerra Fontal, Yolanda; Carrera Perez, David; Steinder, Malgorzata; Whalley, Ian; Torres Viñals, Jordi; Ayguade Parra, Eduard
    IEEE transactions on network and service management
    Date of publication: 2013
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a scheduling technique for multi-job MapReduce workloads that is able to dynamically build performance models of the executing workloads, and then use these models for scheduling purposes. This ability is leveraged to adaptively manage workload performance while observing and taking advantage of the particulars of the execution environment of modern data analytics applications, such as hardware heterogeneity and distributed storage. The technique targets a highly dynamic environment in which new jobs can be submitted at any time, and in which MapReduce workloads share physical resources with other workloads. Thus the actual amount of resources available for applications can vary over time. Beyond the formulation of the problem and the description of the algorithm and technique, a working prototype (called Adaptive Scheduler) has been implemented. Using the prototype and medium-sized clusters (of the order of tens of nodes), the following aspects have been studied separately: the scheduler's ability to meet high-level performance goals guided only by user-defined completion time goals; the scheduler's ability to favor data-locality in the scheduling algorithm; and the scheduler's ability to deal with hardware heterogeneity, which introduces hardware affinity and relative performance characterization for those applications that can benefit from executing on specialized processors.

  • Implementing OmpSs support for regions of data in architectures with multiple address spaces

     Bueno Hedo, Javier; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

  • Programmability and portability for exascale: top down programming methodology and tools with StarSs

     Subotic, Vladimir; Brinkmann, Steffen; Marjanovic, Vladimir; Badia Sala, Rosa Maria; Gracia, Jose; Niethammer, Chirstoph; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Journal of computational science
    Date of publication: 2013-11
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated tasks on heterogeneous platforms, including clusters of GPUs. This paper focuses on the methodology and tools that complements the programming model forming a consistent development environment with the objective of simplifying the live of application developers. The programming environment includes the tools TAREADOR and TEMANEJO, which have been designed specifically for StarSs. TAREADOR, a Valgrind-based tool, allows a top-down development approach by assisting the programmer in identifying tasks and their data-dependencies across all concurrency levels of an application. TEMANEJO is a graphical debugger supporting the programmer by visualizing the task dependency tree on one hand, but also allowing to manipulate task scheduling or dependencies. These tools are complemented with a set of performance analysis tools (Scalasca, Cube and Paraver) that enable to fine tune StarSs application.

  • A template system for the efficient compilation of domain abstractions onto reconfigurable computers

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    Journal of systems architecture
    Date of publication: 2013-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Past research has addressed the issue of using FPGAs as accelerators for HPC systems. Such research has identified that writing low level code for the generation of an efficient, portable and scalable architecture is challenging. We propose to increase the level of abstraction in order to help developers of reconfigurable accelerators deal with these three key issues. Our approach implements domain specific abstractions for FPGA based accelerators using techniques from generic programming. In this paper we explain the main concepts behind our system to Design Accelerators by Template Expansions (DATE). The DATE system can be effectively used for expanding individual kernels of an application and also for the generation of interfaces between various kernels to implement a complete system architecture. We present evaluations for six kernels as examples of individual kernel generation using the proposed system. Our evaluations are mainly intended to provide a proof-of-concept. We also show the usage of the DATE system for integration of various kernels to build a complete system based on a Template Architecture for Reconfigurable Accelerator Designs (TARCAD).

  • Optimization techniques for fine-grained communication in PGAS environments

     Alvanos, Michail
    Defense's date: 2013-12-10
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Los lenguajes de programación basados en la técnica del Partitioned Global Address Space (PGAS) prometen ofreceruna mejor productividad del programador y un buen rendimiento en ordenadores paralelos a gran escala. Sin embargo, es difícil de lograr un rendimiento adecuado para aplicaciones que se basan en la comunicación de grano fino sin comprometer su programabilidad. Habitualmente se requiere de asistencia manual o por parte del compilador, para la optimización de código para evitar los accesos a datos de grano fino. La desventaja de aplicar manualmente transformaciones de código es el aumento de la complejidad del programa, lo que reduce enórmemente la productividad del programador. Por otro lado, las optimizaciones que puede realizar el compilador en los accesos de grano fino requieren del conocimiento de la asignación de datos físico y el uso de construcciones de bucle paralelas.Esta tesis presenta optimizaciones para resolver los tres problemas principales de la comunicación de grano fino: (i) la baja eficiencia de las comunicaciones de red, (ii) la gran cantidad de llamadas en tiempo de ejecución , y (iii) la aparición de congestión en la red de comunicaciones, debida a la distribución no uniforme de los datos.Para resolver estos problemas, la tesis presenta tres enfoques. En primer lugar, se presenta una transformacióninspector-ejecutor mejorada, para aumentar la eficiencia de la red a través de la agregación de datos en tiempo de ejecución. En segundo lugar, se presentan optimizaciones adicionales a la transformación del bucle inspector-ejecutorpara eliminar automáticamente las llamadas en tiempo de ejecución . Por último, la tesis presenta una transformación de bucles para evitar congestión en la red de comunicaciones y la sobrecarga de los nodos. A diferencia de trabajos previos que utilizan agregación de datos estática, precarga, privatización de datos con limitaciones, y gestión de cache en software, las soluciones presentadas en esta tesis cubren todos los aspectos relacionados con la comunicación de grano fino, incluyendo la reducción del número de llamadas generadas por el compilador y minimizando la sobrecarga de las optimizaciones de la técnica inspector-ejecutor.Se realiza una evaluación de las propuestas con varios microbenchmarks y benchmarks, con el objetivo de determinar su escalabilidad y rendimiento en la arquitectura Power 775. Los resultados indican que aplicaciones con accesos regulares a datos, llegan a obtener hasta un 180% del rendimiento obtenido en versiones optimizadas a mano, mientras que en aplicaciones con accesos irregulares a datos, se espera que las transformaciones puedan producir versiones desde 1,12x hasta 6,3 veces más veloces. Las técnicas de planificación de bubles muestran mejoras de rendimientoentre el 3% y el 25%, para NAS FT y aplicaciones de ordenación, y hasta 3,4x en los microbenchmarks.

  • Enhancing the Efficiency and Practicality of Software Transactional Memory on Massively Multithreaded Systems  Open access

     Kestor, Gokcen
    Defense's date: 2013-03-22
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines. The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past.

  • Programming and Parallelising Applications for Distributed Infrastructures  Open access

     Tejedor Saavedra, Enric
    Defense's date: 2013-07-15
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet. The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider. In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance. In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types. This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications.

  • Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs  Open access

     Subotic, Vladimir
    Defense's date: 2013-07-26
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

  • Autonomic placement of mixed batch and transactional workloads

     Carrera Perez, David; Steinder, Malgorzata; Whalley, Ian; Torres Viñals, Jordi; Ayguade Parra, Eduard
    IEEE transactions on parallel and distributed systems
    Date of publication: 2012-02-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • BSArc: blacksmith streaming architecture for HPC accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    IT or not to be: the impact of Moodle in the education of developing countries  Open access

     Garcia Almiñana, Jordi; Somé, Michel; Ayguade Parra, Eduard; Cabre Garcia, Jose Maria; Casañ Guerrero, Maria Jose; Frigola Bourlon, Manel; Galanis ., Nikolaos; Garcia-cervigon Gutierrez, Manuel; Guerrero Zapata, Manel; Muñoz Gracia, Maria del Pilar
    Moodle Research Conference
    Presentation's date: 2012-09-15
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    E-learning environments, such as Moodle, provide a technology that fosters the improvement of the educational system in developed countries, where education is traditionally performed with relatively high standards of quality. A large number of case studies and research have been conducted to demonstrate how e-learning technologies can be applied to improve both training and learning processes. However, these technologies have not been proved efficient when applied to developing countries. The challenges that must be addressed in developing countries, both technological and societal, are much more complex and the possible solution margins are more constrained than those existing in the context where these technologies have been created. In this paper we show how Moodle can be used to improve the quality of education in developing countries and, even more important, how can be used to turn the educational system more sustainable and effective in the long-term. We describe our experience in implementing a programming course in Moodle for the Higher School of Informatics at the Université Polytechnique de Bobo-Dioulasso, in Burkina Faso (West Africa), joining efforts with local professors in designing and implementing the learning system. The case example has been designed having in mind a number of contextual problems: lack of lecturers, excessive teaching hours per lecturer, massive classes, and curricula organization and stability, among others. We finally discuss how the teaching effort is reduced, the students’ knowledge and capacity improves, and the institutional academic model can be guaranteed with the proposal. For this reason, we claim that information technologies in developing countries are a cost-effective way to guarantee the objectives originally defined in the academic curricula and, therefore, deal with the problem of the education.

  • Task-based parallel breadth-first search in heterogeneous environments

     Munguía, Lluis Miquel; Bader, David A.; Ayguade Parra, Eduard
    International Conference on High Performance Computing
    Presentation's date: 2012-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence.

  • On the instrumentation of OpenMP and OmpSs Tasking constructs  Open access

     Servat Gelabert, Harald; Teruel Garcia, Xavier; Llort Sanchez, German Matías; Duran Gonzalez, Alejandro; Giménez Lucas, Judit; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Workshop on Productivity and Performance
    Presentation's date: 2012-08
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Parallelism has become more and more commonplace with the advent of the multicore processors. Although different parallel pro- gramming models have arisen to exploit the computing capabilities of such processors, developing applications that take benefit of these pro- cessors may not be easy. And what is worse, the performance achieved by the parallel version of the application may not be what the developer expected, as a result of a dubious ut ilization of the resources offered by the processor. We present in this paper a fruitful synergy of a shared memory parallel compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the analysis experience of the parallel application by incorporating data that is only known in the compiler and runtime side. Additionally we present performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.

  • OmpSs to OpenCL and to FPGAs

     Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Cabrera, Daniel; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-28
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • PPMC: hardware scheduling and memory management support for multi accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    Presentation's date: 2012-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing the impact of network compression on molecular dynamics and finite element methods

     Dickov, Branimir; Pericas, Miquel; Houzeaux, Guillaume; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE International Conference on High Performance Computing and Communications
    Presentation's date: 2012-06
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • The task dependency analysis tool SSgrind

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-28
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Introducing speculative optimizations in task dataflow with language extensions and runtime support  Open access

     Azuelos, Nathaniel; Etsion, Yoav; Keidar, Idit; Zaks, A.; Ayguade Parra, Eduard
    Workshop on Data-Flow Execution Models for Extreme Scale Computing
    Presentation's date: 2012-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We argue that speculation leads to increased parallelism in the coarse-grain dataflow paradigm. To do so, we present a framework for adding speculation in a popular and well-established framework. We specify a limited set of additions to the OmpSs language and changes required in its supporting runtime environment. These modifications enable speculation across the system in a flexible way. We evaluate our implementation using a simple benchmark leading to a promising 10% speedup.

    Postprint (author’s final draft)

  • OmpSs to FPGA and overview of applications

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-25
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Optimizing resource utilization with software-based temporal multi-threading (sTMT)

     Beltran Querol, Vicenç; Ayguade Parra, Eduard
    International Conference on High Performance Computing
    Presentation's date: 2012-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Compute and memory access units are two of the most important resources to appropriately manage in current and future multi–/many–core architectures. Memory bandwidth and computational capacity need to be exploited in a combined way to achieve the best system performance. Coarse–grain multi– threading, also known as temporal multi–threading (TMT), is a well known technique that improves overall resource utilization by time–multiplexing the execution of a reduced number of hardware threads that are switched in case of a high–latency event, such as a memory miss. Hence, the processor does not stall on memory misses and the number of in–fly memory operations is increased, improving the overall processor resource utilization. In this paper, we propose a software–based implementation of TMT that supports and unbounded number of threads and enables a flexible combination of multiple computational kernels. Our TMT implementation is based on micro–threads that combine fast cooperative and preemptive context switches to overcome some intrinsic limitations of current TMT hardware implementations, such as the reduced and fixed number of hardware threads available. Our proposal is demonstrated with an implementation on the Cell/B.E. which is evaluated using heterogeneous mixes of memory–/CPU–bound kernels. Experimental results show how the proposed technique reduce the execution time of several benchmarks by up to 78%.

  • Integrating dataflow abstractions into the shared memory model

     Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguade Parra, Eduard; Cristal Kestelman, Adrián
    International Symposium on Computer Architecture and High Performance Computing
    Presentation's date: 2012-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we present Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ which integrates dataflow abstractions into the shared memory programming model. The ADF model provides pragma directives that allow a programmer to organize a program into a set of tasks and to explicitly define input data for each task. The task dependency information is conveyed to the ADF runtime system which constructs the dataflow task graph and builds the necessary infrastructure for dataflow execution. Additionally, the ADF model allows tasks to share data. The key idea is that comput ation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory which guarantees atomicity of shared memory updates. We show examples that illustrate how the programmability of shared memory can be improved using the ADF model. Moreover, our evaluation shows that the ADF model performs well in comparison with programs para llelized using OpenMP and transactional memory.

  • PPMC: a programmable pattern based memory controller

     Hussain, Tassadaq; Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International IEEE/ACM Symposium on Applied Reconfigurable Computing
    Presentation's date: 2012-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Supporting stateful tasks in a dataflow graph

     Gajinov, Vladimir; Stipic, Srdjan; Unsal, Osman Sabri; Harris, Tim; Ayguade Parra, Eduard; Cristal Kestelman, Adrián
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2012-09-19
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper introduces Atomic Dataflow Model (ADF) - a programming model for shared-memory systems that combines aspects of dataflow programming with the use of explicitly mutable state. The model provides language constructs that allow a programmer to delineate a program into a set of tasks and to explicitly define input data for each task. This information is conveyed to the ADF runtime system which constructs the task dependency graph and builds the necessary infrastructure for dataflow execution. However, the key aspect of the proposed model is that it does not require the programmer to specify all of the task’s dependencies exp licitly, but only those that imply logical ordering between tasks. The ADF model manages the remainder of inter-task dependencies automatically, by executing the body of the task within an implicit memory transaction. This provides an easy- to -program optimistic concurrency substrate and enables a task to safely share data with other concurrent tasks. In this paper, we describe the ADF model and show how it can increase the programmability of shared memory systems.

  • Productive programming of GPU clusters with OmpSs

     Bueno Hedo, Javier; Planas, Judit; Duran Gonzalez, Alejandro; Badia Sala, Rosa Maria; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applicactions programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

  • Transactional access to shared memory in StarSs, a task based programming model

     Gayatri, Rahulkumar; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Lujan, M; Watson, I.
    International Conference on Parallel and Distributed Computing
    Presentation's date: 2012-08-27
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    With an increase in the number of processors on a single chip, programming environments which facilitate the exploitation of par- allelism on multicore architectures have become a necessity. StarSs is a task-based programming model that enables a flexible and high level programming. Although task synchronization in StarSs is based on data flow and dependency analysis, some applications (e.g. reductions )require locks to access shared data. Transactional Memory is an alternative to lock-based synchronization for controlling access to shared data. In this paper we explore the idea of integrating a lightweight Software Transactional Memory (STM) library, TinySTM , into an implementation of StarSs (SMPSs). The SMPSs run- time and the compiler have been modified to include and use calls to the STM library. We evaluated this approach on four applications and observe better performance in applications with high lock contention.

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

     Vujic, Nikola; Alvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A methodology for the evaluation of high response time on E-commerce users and sales

     Poggi, Nicolas; Carrera Perez, David; Gavaldà Mestre, Ricard; Ayguade Parra, Eduard; Torres Viñals, Jordi
    Information systems frontiers
    Date of publication: 2012-10-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Counter-based power modeling methods: top-down vs. bottom-up

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The computer journal (Kalispell, Mont.)
    Date of publication: 2012-08-24
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS Performance Evaluation Review
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A Multicore Emulator with a Profiling Infrastructure for Transactional Memory on FPGA

     Sonmez, Nehir
    Defense's date: 2012-09-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Programming Model and Run-Time Optimizations for the Cell/B.E.

     Bellens, Pieter
    Defense's date: 2012-09-27
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Towards Lightweight and High-Performance Hardware Transactional Memory  Open access

     Tomic, Sa¿a
    Defense's date: 2012-07-13
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Conventional lock-based synchronization serializes accesses to critical sections guarded by the same lock. Using multiple locks brings the possibility of a deadlock or a livelock in the program, making parallel programming a difficult task. Transactional Memory (TM) is a promising paradigm for parallel programming, offering an alternative to lock-based synchronization. TM eliminates the risk of deadlocks and livelocks, while it provides the desirable semantics of Atomicity, Consistency, and Isolation of critical sections. TM speculatively executes a series of memory accesses as a single, atomic, transaction. The speculative changes of a transaction are kept private until the transaction commits. If a transaction can break the atomicity or cause a deadlock or livelock, the TM system aborts the transaction and rolls back the speculative changes. To be effective, a TM implementation should provide high performance and scalability. While implementations of TM in pure software (STM) do not provide desirable performance, Hardware TM (HTM) implementations introduce much smaller overhead and have relatively good scalability, due to their better control of hardware resources. However, many HTM systems support only the transactions that fit limited hardware resources (for example, private caches), and fall back to software mechanisms if hardware limits are reached. These HTM systems, called best-effort HTMs, are not desirable since they force a programmer to think in terms of hardware limits, to use both HTM and STM, and to manage concurrent transactions in HTM and STM. In contrast with best-effort HTMs, unbounded HTM systems support overflowed transactions, that do not fit into private caches. Unbounded HTM systems often require complex protocols or expensive hardware mechanisms for conflict detection between overflowed transactions. In addition, an execution with overflowed transactions is often much slower than an execution that has only regular transactions. This is typically due to restrictive or approximative conflict management mechanism used for overflowed transactions. In this thesis, we study hardware implementations of transactional memory, and make three main contributions. First, we improve the general performance of HTM systems by proposing a scalable protocol for conflict management. The protocol has precise conflict detection, in contrast with often-employed inexact Bloom-filter-based conflict detection, which often falsely report conflicts between transactions. Second, we propose a best-effort HTM that utilizes the new scalable conflict detection protocol, termed EazyHTM. EazyHTM allows parallel commits for all non-conflicting transactions, and generally simplifies transaction commits. Finally, we propose an unbounded HTM that extends and improves the initial protocol for conflict management, and we name it EcoTM. EcoTM features precise conflict detection, and it efficiently supports large as well as small and short transactions. The key idea of EcoTM is to leverage an observation that very few locations are actually conflicting, even if applications have high contention. In EcoTM, each core locally detects if a cache line is non-conflicting, and conflict detection mechanism is invoked only for the few potentially conflicting cache lines.

    La Sincronización tradicional basada en los cerrojos de exclusión mutua (locks) serializa los accesos a las secciones críticas protegidas este cerrojo. La utilización de varios cerrojos en forma concurrente y/o paralela aumenta la posibilidad de entrar en abrazo mortal (deadlock) o en un bloqueo activo (livelock) en el programa, está es una de las razones por lo cual programar en forma paralela resulta ser mucho mas dificultoso que programar en forma secuencial. La memoria transaccional (TM) es un paradigma prometedor para la programación paralela, que ofrece una alternativa a los cerrojos. La memoria transaccional tiene muchas ventajas desde el punto de vista tanto práctico como teórico. TM elimina el riesgo de bloqueo mutuo y de bloqueo activo, mientras que proporciona una semántica de atomicidad, coherencia, aislamiento con características similares a las secciones críticas. TM ejecuta especulativamente una serie de accesos a la memoria como una transacción atómica. Los cambios especulativos de la transacción se mantienen privados hasta que se confirma la transacción. Si una transacción entra en conflicto con otra transacción o sea que alguna de ellas escribe en una dirección que la otra leyó o escribió, o se entra en un abrazo mortal o en un bloqueo activo, el sistema de TM aborta la transacción y revierte los cambios especulativos. Para ser eficaz, una implementación de TM debe proporcionar un alto rendimiento y escalabilidad. Las implementaciones de TM en el software (STM) no proporcionan este desempeño deseable, en cambio, las mplementaciones de TM en hardware (HTM) tienen mejor desempeño y una escalabilidad relativamente buena, debido a su mejor control de los recursos de hardware y que la resolución de los conflictos así el mantenimiento y gestión de los datos se hace en hardware. Sin embargo, muchos de los sistemas de HTM están limitados a los recursos de hardware disponibles, por ejemplo el tamaño de las caches privadas, y dependen de mecanismos de software para cuando esos límites son sobrepasados. Estos sistemas HTM, llamados best-effort HTM no son deseables, ya que obligan al programador a pensar en términos de los límites existentes en el hardware que se esta utilizando, así como en el sistema de STM que se llama cuando los recursos son sobrepasados. Además, tiene que resolver que transacciones hardware y software se ejecuten concurrentemente. En cambio, los sistemas de HTM ilimitados soportan un numero de operaciones ilimitadas o sea no están restringidos a límites impuestos artificialmente por el hardware, como ser el tamaño de las caches o buffers internos. Los sistemas HTM ilimitados por lo general requieren protocolos complejos o mecanismos muy costosos para la detección de conflictos y el mantenimiento de versiones de los datos entre las transacciones. Por otra parte, la ejecución de transacciones es a menudo mucho más lenta que en una ejecución sobre un sistema de HTM que este limitado. Esto es debido al que los mecanismos utilizados en el HTM limitado trabaja con conjuntos de datos relativamente pequeños que caben o están muy cerca del núcleo del procesador. En esta tesis estudiamos implementaciones de TM en hardware. Presentaremos tres contribuciones principales: Primero, mejoramos el rendimiento general de los sistemas, al proponer un protocolo escalable para la gestión de conflictos. El protocolo detecta los conflictos de forma precisa, en contraste con otras técnicas basadas en filtros Bloom, que pueden reportar conflictos falsos entre las transacciones. Segundo, proponemos un best-effort HTM que utiliza el nuevo protocolo escalable detección de conflictos, denominado EazyHTM. EazyHTM permite la ejecución completamente paralela de todas las transacciones sin conflictos, y por lo general simplifica la ejecución. Por último, proponemos una extensión y mejora del protocolo inicial para la gestión de conflictos, que llamaremos EcoTM. EcoTM cuenta con detección de conflictos precisa, eficiente y es compatible tanto con transacciones grandes como con pequeñas. La idea clave de EcoTM es aprovechar la observación que en muy pocas ubicaciones de memoria aparecen los conflictos entre las transacciones, incluso en aplicaciones tienen muchos conflictos. En EcoTM, cada núcleo detecta localmente si la línea es conflictiva, además existe un mecanismo de detección de conflictos detallado que solo se activa para las pocas líneas de memoria que son potencialmente conflictivas.

  • Architectural Explorations for Streaming Accelerators with Customized Memory Layouts  Open access

     Shafiq, Muhammad
    Defense's date: 2012-05-21
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.

    The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.

  • Software Caching Techniques and Hardware Optimizations for On-Chip Local Memories  Open access

     Vujic, Nikola
    Defense's date: 2012-06-05
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.

    Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals.

  • Energy accounting for shared virtualized environments under DVFS using PMC-based power models

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Torres Viñals, Jordi; Ayguade Parra, Eduard
    Future generation computer systems
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Cabarcas Jaramillo, Felipe; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    Presentation's date: 2011-12-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Non-intrusive estimation of QoS degradation impact on E-commerce user satisfaction

     Poggi, Nicolas; Carrera Perez, David; Gavaldà Mestre, Ricard; Ayguade Parra, Eduard
    IEEE International Symposium on Network Computing and Applications
    Presentation's date: 2011-08-26
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A template system for the efficient compilation of domain abstractions onto reconfigurable computers

     Shafiq, Muhammad; Pericas, Miquel; Ayguade Parra, Eduard
    HiPEAC Workshop on Reconfigurable Computing
    Presentation's date: 2011-01-23
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • TARCAD: a template architecture for reconfigurable accelerator designs

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE Symposium on Application Specific Processors
    Presentation's date: 2011-06-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    Presentation's date: 2011-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Productive cluster programming with OmpSs

     Bueno Hedo, Javier; Martinell, Lluis; Duran Gonzalez, Alejandro; Farreras Esclusa, Montserrat; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2011-09-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Reconfigurable memory controller with programmable pattern support

     Hussain, Tassadaq; Pericas, Miquel; Ayguade Parra, Eduard
    HiPEAC Workshop on Reconfigurable Computing
    Presentation's date: 2011-01-23
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous architectures are increasingly popular due to their flexibility and high performance per watt capability. A kind of heterogeneous architecture, reconfigurable systems-on-chip, offer high performance per watt through the reconfigurable logic and flexibility via multiprocessor cores. But in order to achieve the performance goals it is necessary to provide enough data to the accelerators. In this paper we describe a programmable, pattern-based memory controller (PMC) that aims at improving the performance of heterogeneous or reconfigurable SoC devices. These include scatter gather and strided 1D, 2D and 3D patterns. PMC can prefetch complete patterns into scratchpads that can then be accessed either by a microprocessor or by an accelerator. As a result, the microprocessors and accelerators can focus on computation and are relieved of having to perform address calculations. PMC has been implemented and tested on an ML505 evaluation board using the MicroBlaze softcore as the platform’s microprocessor. While PMC adds some latency, it improves performance by offloading the processor and by making better use of available bandwidths. The PMC provide 1.5x speed-ups with processor and 27x speed-ups achieved by using hardware accelerator in PMC SoC based environment while executing thresholding application.

  • Resource-aware adaptive scheduling for MapReduce clusters

     Polo, Jordà; Castillo, Claris; Carrera Perez, David; Becerra Fontal, Yolanda; Whalley, Ian; Steinder, Malgorzata; Torres Viñals, Jordi; Ayguade Parra, Eduard
    ACM/IFIP/USENIX International Middleware Conference
    Presentation's date: 2011-12-16
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Integrating dataflow abstractions into transactional memory

     Gajinov, Vladimir; Milovanovic, Milos; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Ayguade Parra, Eduard; Valero Cortes, Mateo
    Workshop on Systems for Future Multi-Core Architectures
    Presentation's date: 2011-04-16
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Many concurrent programs require some form of conditional synchronization to coordinate the execution of different program tasks. Programming these algorithms using transactional memory (TM) often results in a high conflict rate between transactions. In this paper we propose an Atomic dataflow model - ADF, which aims to reduce transaction conflicts by incorporating dataflow scheduling principles into transactional memory. The ADF model is based on the execution of atomic units of work called ADF tasks. A programmer explicitly defines data dependencies for the ADF task using the trigger set extension. Trigger set data is implicitly tracked by the TM runtime system, which detects changes and enables the re-execution of a transaction when its dependencies are satisfied. In this paper we fully describe the ADF model, present its syntax and show advantages of the model on a practical example.

  • Assessing accelerator-based HPC reverse time migration

     Araya Polo, Mauricio; Cabezas, Javier; Hanzich, Mauricio; Pericas, Miquel; Rubio, Fèlix; Gelado Fernandez, Isaac; Shafiq, Muhammad; Morancho Llena, Enrique; Navarro Mas, Nacho; Ayguade Parra, Eduard; Cela Espin, Jose M.; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Date of publication: 2011-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window