Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 205 results
  • Energy Characterization Methodologies for CMP/SMT Processor Systems

     Bertran Monfort, Ramon
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    El rendiment dels sistemes informàtics i la seva assequibilitat estan limitats pel seu consum energètic. La densitat de potència limita la freqüència d'operació dels sistemes, afectant el seu rendiment. De forma similar, el consum energètic afecta els temps de vida dels aparells que s'alimenten amb bateries i incrementa els costos (factures elèctriques) dels centres de dades. Aquestes limitacions relacionades amb l'energia, conegudes en anglès com the power wall, motiven la recerca en noves solucions per mitigar els seus efectes negatius.En aquesta tesi, desenvolupem diferents mètodes per caracteritzar de forma sistemàtica diferents aspectes relacionats amb el consum energètic dels sistemes informàtics. Al mateix temps, presentem els entorns de programació desenvolupats per poder-les aplicar. Les con tribucions científiques d¿aquesta tesi són les següents:- Un mètode sistemàtic per a produir models de potència basats en comptadors. Els models de potència basats en comptadorssón una solució habitual per fer prediccions del consum de potència dels sistemes informàtics. Nosaltres avancem en l¿estat de l¿art en la generació d'aquest tipus de models proposant una metodologia que produeix models que són més reactius, informatius i robustos, mantenint alhora els nivells de precisió i assequibilitat.- Un mètode per a mesurar el consum energètic en sistemes virtuals compartits usant models de potència basats en comptadors. Aquest mètode proporciona un una nova forma de comptabilitat als proveïdors de cloud.- Un mètode sistemàtic per generar descripcions de rendiment/potència a nivell d'instrucció. Presentem una tècnica per generar descripcions de consum energètic a nivell d'instrucció. Aquest tipus de descripcions són útils en moltes situacions diferents.Per exemple, el fet de saber quines instruccions usen més potència i quines són més eficients energèticament pot (a) guiar el focus d¿atenció durant el disseny dels processadors; (b) impactar els algoritmes de selecció d¿instruccions dels compiladors; o bé (c) es pot usar per a la generació de tests específicament rela cionats amb temes energètics (potència, energia, temperatura).- Un mètode per generar tests que maximitzen el consum de potència. Presentem un mètode novell, basat en la descripció a nivell d'instrucció mencionada anteriorment, per generar testos que maximitzen el consum d¿energia sense la necessitat d'usar cerques de solucions sub-optimes basades en algoritmes genètics.Les contribucions software d¿aquesta tesi són les següents:- Microprobe: un entorn de generació de tests. Microprobe és un entorn de generació de tests que inclou definicions detallades de la micro-arquitectura per així poder generar tests que generin una activitat específica en l'arquitectura. Per exemple, permet la generació sistemàtica dels conjunts d¿entrenament necessaris per aplicar el nostre mètode de generació de models de potència, la generació automàtica dels tests necessaris per generar les descripcions a nivell d'instrucció, i la cerca automàtica de solucions (tests) que maximitzin el consum de potència.- Potra: un entorn de generació de models. Durant aquesta tesi, hem desenvolupat POTRA (POwer TRace Analyse) per sistematizar la generacio i avalucio dels models de potencia. - LibBGQT: Llibreria de traceig per Blue Gene/Q. LibBGQT ésun llibreria per al traceig automàtic a granularitat fina de la potència i el rendiment de la plataforma IBM Blue Gene.En resum, aquest tesi contribueix en l¿estat de l¿art en mètodes sistemàtics per caracteritzar mètriques relacionades en l¿energia. Presenta avaluacions solides d¿aquests mètodes en diferents contexts com poden ser diferents plataformes i configuracions de potència i rendiment. A més a més, aquesta tesi tambe contribueix amb un conjunt d¿eines software que es poden usar i ser esteses per continuar les línies de recerca obertes.

  • Access to the full text
    Enabling preemptive multiprogramming on GPUs  Open access

     Tanasic, Ivan; Gelado Fernandez, Isaac; Cabezas, Javier; Ramirez Bellido, Alejandro; Navarro, Nacho; Valero Cortes, Mateo
    International Symposium on Computer Architecture
    p. 193-204
    DOI: 10.1109/ISCA.2014.6853208
    Presentation's date: 2014-06-14
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.

    Postprint (author’s final draft)

  • Experimental assessment of a high performance back-end PCE for Flexgrid optical network re-optimization

     Gifre Renom, Lluis; Velasco Esteban, Luis; Navarro, Nacho; Junyent Giralt, Gabriel
    Optical Fiber Communication Conference and Exposition and National Fiber Optic Engineers Conference
    p. 1-3
    DOI: 10.1364/OFC.2014.W4A.3
    Presentation's date: 2014-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    A specialized high performance Graphics Processing Unit (GPU)-based back-end Path Computation Element (PCE) to compute re-optimization in Flexgrid networks is presented. Experimental results show 6x speedups compared to single centralized PCE.

  • Models de Programacio i Entorns d'eXecució PARal.lels

     Becerra Fontal, Yolanda; Carrera Perez, David; Corbalan Gonzalez, Julita; Cortes Rossello, Antonio; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Gil Gómez, Maria Luisa; Gonzalez Tallada, Marc; Guitart Fernández, Jordi; Herrero Zaragoza, José Ramón; Labarta Mancho, Jesus Jose; Martorell Bofill, Xavier; Navarro, Nacho; Nin Guerrero, Jordi; Torres Viñals, Jordi; Tous Liesa, Ruben; Utrera Iglesias, Gladys Miriam; Ayguade Parra, Eduard
    Competitive project

     Share

  • The TERAFLUX Project: Exploiting the dataflow paradigm in next generation teradevices

     Solinas, Marco; Badia Sala, Rosa Maria; Bodin, François; Cohen, Albert; Evripidou, Paraskevas; Faraboschi, Paolo; Fechner, Bernhard; Gao, Guang R.; Garbade, Arne; Girbal, Sylvain; Goodman, Daniel; Khan, Behran; Koliai, Souad; Li, Feng; Lujan, Mikel; Morin, Laurent; Mendelson, Avi; Navarro, Nacho; Pop, Antoniu; Trancoso, Pedro; Ungerer, Theo; Valero Cortes, Mateo; Weis, Sebastian; Watson, Ian; Zuckermann, Stéphane; Giorgi, Roberto
    Euromicro Symposium on Digital Systems Design
    p. 272-279
    DOI: 10.1109/DSD.2013.39
    Presentation's date: 2013-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper describes the project and provides an overview of the research carried out by the TERAFLUX consortium.

  • A systematic methodology to generate decomposable and responsive power models for CMPs

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    IEEE transactions on computers
    Vol. 62, num. 7, p. 1289-1302
    DOI: 10.1109/TC.2012.97
    Date of publication: 2013-07
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand the power behavior of real systems. Consequently, several power-aware policies use models to guide their decisions. Hence, the presence of power models that are informative, accurate, and capable of detecting power phases is critical to improve the success of power-saving techniques. Additionally, the design of current processors varied considerably with the appearance of CMPs (multiple cores sharing resources). Thus, PMC-based power models warrant further investigation on current energy-efficient multicore processors. In this paper, we present a systematic methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we study theirresponsiveness -the capacity to detect power phases-. Specifically, we produce power models for an Intel Core 2 Duo with one and two cores enabled for all the DVFS configurations. The models are empirically validated using the SPECcpu2006, NAS and LMBENCH benchmarks. Finally, we compare the models against existing approaches concluding that the proposed methodology produces more accurate, responsive, and informative models.

  • Architecture of a specialized back-end high performance computing-based PCE for flexgrid networks

     Gifre Renom, Lluis; Velasco Esteban, Luis; Navarro, Nacho
    International Conference on Transparent Optical Networks
    p. Mo.C4.3-1-Mo.C4.3-4
    DOI: 10.1109/ICTON.2013.6602716
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The requirement of executing network re-optimization operations to efficiently manage and deploy new generation flexgrid-based optical networks has brought to light the need of some specialized PCEs capable of performing such high time-consuming computations. The objective of such re-optimizations is to compute network reconfigurations to be done based on the current state of network resources to achieve near-optimal resources utilization. Such PCEs require high performance computing equipment to process the huge amount of data in both the Traffic Engineering (TED) and the Label Switched Path (LSP-DB) databases in a unified computation step. To deal with this problem, a High Performance Computing (HPC) Graphics Process Unit (GPU)-based cluster architecture is proposed in this paper. This architecture is capable of attending to Path Computation Element (PCE) requests demanding execution of network re-optimization tasks, perform such computations and report a near-optimal solution in practical times.

  • Comparison based sorting for systems with multiple GPUs

     Tanasic, Ivan; Vilanova, Lluís; Jorda, Marc; Cabezas, Javier; Gelado Fernandez, Isaac; Navarro, Nacho; Hwu, Wen-mei W.
    Workshop on General Purpose Processing Using GPUs
    p. 1-11
    DOI: 10.1145/2458523.2458524
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.

  • A template system for the efficient compilation of domain abstractions onto reconfigurable computers

     Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    Journal of systems architecture
    Vol. 59, num. 2, p. 91-102
    DOI: 10.1016/j.sysarc.2012.10.002
    Date of publication: 2013-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Past research has addressed the issue of using FPGAs as accelerators for HPC systems. Such research has identified that writing low level code for the generation of an efficient, portable and scalable architecture is challenging. We propose to increase the level of abstraction in order to help developers of reconfigurable accelerators deal with these three key issues. Our approach implements domain specific abstractions for FPGA based accelerators using techniques from generic programming. In this paper we explain the main concepts behind our system to Design Accelerators by Template Expansions (DATE). The DATE system can be effectively used for expanding individual kernels of an application and also for the generation of interfaces between various kernels to implement a complete system architecture. We present evaluations for six kernels as examples of individual kernel generation using the proposed system. Our evaluations are mainly intended to provide a proof-of-concept. We also show the usage of the DATE system for integration of various kernels to build a complete system based on a Template Architecture for Reconfigurable Accelerator Designs (TARCAD).

  • Access to the full text
    Design space explorations for streaming accelerators using streaming architectural simulator  Open access

     Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International Bhurban Conference on Applied Sciences and Technology
    p. 169-178
    DOI: 10.1109/IBCAST.2013.6512151
    Presentation's date: 2013-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

    Postprint (author’s final draft)

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. Article No. 89-
    DOI: 10.1109/TC.2013.194
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Counter-based power modeling methods: top-down vs. bottom-up

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    The computer journal (Kalispell, Mont.)
    Vol. 56, num. 2, p. 198-213
    DOI: 10.1093/comjnl/bxs116
    Date of publication: 2012-08-24
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • PPMC: hardware scheduling and memory management support for multi accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    p. 571-574
    DOI: 10.1109/FPL.2012.6339373
    Presentation's date: 2012-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS performance evaluation review
    Vol. 40, num. 1, p. 427-428
    DOI: 10.1145/2318857.2254827
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing the impact of network compression on molecular dynamics and finite element methods

     Dickov, Branimir; Pericas, Miquel; Houzeaux, Guillaume; Navarro, Nacho; Ayguade Parra, Eduard
    IEEE International Conference on High Performance Computing and Communications
    p. 588-597
    DOI: 10.1109/HPCC.2012.85
    Presentation's date: 2012-06
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Architectural Explorations for Streaming Accelerators with Customized Memory Layouts  Open access

     Shafiq, Muhammad
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.

    The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.

  • BSArc: blacksmith streaming architecture for HPC accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    p. 23-32
    DOI: 10.1145/2212908.2212914
    Presentation's date: 2012-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • PPMC: a programmable pattern based memory controller

     Hussain, Tassadaq; Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International IEEE/ACM Symposium on Applied Reconfigurable Computing
    p. 89-101
    DOI: 0.1007/978-3-642-28365-9_8
    Presentation's date: 2012-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Energy accounting for shared virtualized environments under DVFS using PMC-based power models

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Torres Viñals, Jordi; Ayguade Parra, Eduard
    Future generation computer systems
    Vol. 28, num. 2, p. 457-468
    DOI: 10.1016/j.future.2011.03.007
    Date of publication: 2012-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Virtualized infrastructure providers demand new methods to increase the accuracy of the accounting models used to charge their customers. Future data centers will be composed of many-core systems that will host a large number of virtual machines (VMs) each. While resource utilization accounting can be achieved with existing system tools, energy accounting is a complex task when per-VM granularity is the goal. In this paper, we propose a methodology that brings new opportunities to energy accounting by adding an unprecedented degree of accuracy on the per-VM measurements. We present a system — which leverages CPU and memory power models based in performance monitoring counters (PMCs) — to perform energy accounting in virtualized systems. The contribution of this paper is threefold. First, we show that PMC-based power modeling methods are still valid on virtualized environments. Second, we show that the Dynamic Voltage and Frequency Scaling (DVFS) mechanism, which commonly is used by infrastructure providers to avoid power and thermal emergencies, does not affect the accuracy of the models. And third, we introduce a novel methodology for accounting of energy consumption in virtualized systems. Accounting is done on a per-VM basis, even in the case where multiple VMs are deployed on top of the same physical hardware, bypassing the limitations of per-server aggregated power metering. Overall, the results for an Intel® Core™ 2 Duo show errors in energy estimations <5%. Such an approach brings flexibility to the chargeback models used by service and infrastructure providers. For instance, we are able to detect cases where VMs executed during the same amount of time, present more than 20% differences in energy consumption even only taking into account the consumption of the CPU and the memory.

  • HIPEAC 3 - European Network of Excellence on HighPerformance Embedded Architecture and Compilers

     Gil Gómez, Maria Luisa; Navarro, Nacho; Martorell Bofill, Xavier; Valero Cortes, Mateo; Ayguade Parra, Eduard; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Llaberia Griño, Jose M.
    Competitive project

     Share

  • Hardware and software support for distributed shared memory in chip multiprocessors

     Villavieja Prados, Carlos
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • The data transfer engine: towards a software controlled memory hierarchy

     Garcia Flores, Victor; Rico Carro, Alejandro; Villavieja Prados, Carlos; Navarro, Nacho; Ramirez Bellido, Alejandro
    International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
    p. 215-218
    Presentation's date: 2012
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    p. 427-428
    DOI: 10.1145/2254756.2254827
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    p. 1-8
    DOI: 10.1109/FPT.2011.6132717
    Presentation's date: 2011-12-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DiDi: mitigating the performance impact of TLB shootdowns using a shared TLB directory

     Villavieja Prados, Carlos; Karakostas, Vasileios; Vilanova, Lluis; Etsion, Yoav; Ramirez Bellido, Alejandro; Mendelson, Avi; Navarro, Nacho; Cristal Kestelman, Adrián; Unsal, Osman Sabri
    International Conference on Parallel Architectures and Compilation Techniques
    p. 340-349
    DOI: 10.1109/PACT.2011.65
    Presentation's date: 2011-10-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • FELI: HW/SW support for on-chip distributed shared memory in multicores

     Villavieja Prados, Carlos; Etsion, Yoav; Ramirez Bellido, Alejandro; Navarro, Nacho
    International European Conference on Parallel and Distributed Computing
    p. 282-294
    DOI: 10.1007/978-3-642-23400-2_27
    Presentation's date: 2011-09-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Design space exploration for aggressive core replication schemes in CMPs

     Álvarez Martí, Lluc; Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    International Symposium on High Performance Distributed Computing
    p. 269-270
    DOI: 10.1145/1996130.1996169
    Presentation's date: 2011-06-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • TARCAD: a template architecture for reconfigurable accelerator designs

     Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    IEEE Symposium on Application Specific Processors
    p. 8-15
    DOI: 10.1109/SASP.2011.5941071
    Presentation's date: 2011-06-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing accelerator-based HPC reverse time migration

     Araya Polo, Mauricio; Cabezas, Javier; Hanzich, Mauricio; Pericas, Miquel; Rubio, Fèlix; Gelado Fernandez, Isaac; Shafiq, Muhammad; Morancho Llena, Enrique; Navarro, Nacho; Ayguade Parra, Eduard; Cela Espin, Jose M.; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Vol. 22, num. 1, p. 147-162
    DOI: 10.1109/TPDS.2010.144
    Date of publication: 2011-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Multicore: the view from Europe

     Valero Cortes, Mateo; Navarro, Nacho
    IEEE micro
    Vol. 30, num. 5, p. 2-4
    DOI: 10.1109/MM.2010.93
    Date of publication: 2010-11-18
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • FEM: A step towards a common memory layout for FPGA based accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    p. 568-573
    DOI: 10.1109/FPL.2010.111
    Presentation's date: 2010-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    FPGA devices are mostly utilized for customized application designs with heavily pipelined and aggressively parallel computations. However, little focus is normally given to the FPGA memory organizations to efficiently use the data fetched into the FPGA. This work presents a Front End Memory (FEM) layout based on BRAMs and Distributed RAM for FPGA-based accelerators. The presented memory layout serves as a template for various data organizations which is in fact a step towards the standardization of a methodology for FPGA based memory management inside an accelerator. We present example application kernels implemented as specializations of the template memory layout. Further, the presented layout can be used for Spatially Mapped-Shared Memory multi-kernel applications targeting FPGAs. This fact is evaluated by mapping two applications, an Acoustic Wave Equation code and an N-Body method, to three multi-kernel execution models on a Virtex-4 L×200 device. The results show that the shared memory model for Acoustic Wave Equation code outperforms the local and runtime reconfigured models by 1.3-1.5×, respectively. For the N-Body method the shared model is slightly more efficient with a small number of bodies, but for larger systems the runtime reconfigured model shows a 3× speedup over the other two models.

  • On the programmability of heterogeneous massively-parallel computing systems  Open access

     Gelado Fernandez, Isaac
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous parallel computing combines general purpose processors with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous parallel computing impose added coding complexity when compared to traditional sequential shared-memory programming models for homogeneous systems. This extra code complexity is assumable in supercomputing environments, where programmability is sacrificed in pursuit of high performance. However, heterogeneous parallel systems are massively reaching the desktop market (e.g., 425.4 million of GPU cards were sold in 2009), where the trade-off between performance and programmability is the opposite. The code complexity required when using accelerators and the lack of compatibility prevents programmers from exploiting the full computing capabilities of heterogeneous parallel systems in general purpose applications. This dissertation aims to increase the programmability of CPU - accelerator systems, without introducing major performance penalties. The key insight is that general purpose application programmers tend to favor programmability at the cost of system performance. This fact is illustrated by the tendency to use high-level programming languages, such as C++, to ease the task of programming at the cost of minor performance penalties. Moreover, currently many general purpose applications are being developed using interpreted languages, such as Java, C# or python, which raise the abstraction level even further introducing relatively large performance overheads. This dissertation also takes the approach of raising the level of abstraction for accelerators to improve programmability and investigates hardware and software mechanisms to efficiently implement these high-level abstractions without introducing major performance overheads. Heterogeneous parallel systems typically implement separate memories for CPUs and accelerators, although commodity systems might use a shared memory at the cost of lower performance. However, in these commodity shared memory systems, coherence between accelerator and CPUs is not guaranteed. This system architecture implies that CPUs can only access system memory, and accelerators can only access their own local memory. This dissertation assumes separate system and accelerator memory and shows that low-level abstractions for these disjoint address spaces are the source of poor programmability of heterogeneous parallel systems. A first consequence of having separate system and accelerator memories are the current data transfer models for heterogeneous parallel systems. In this dissertation two data transfer paradigms are identified: per-call and double-buffered. In these two models, data structures used by accelerators are allocated in both, system and accelerator memories. These models differ on how data between accelerator and system memories is managed. The per-call model transfers the input data needed by accelerators before accelerator calls, and transfers back the output data produced by accelerators on accelerator call return. The per-call model is quite simple, but might impose unacceptable performance penalties due to data transfer overheads. The double-buffered model aims to overlap data communication and CPU and accelerator computation. This model requires a relative quite complex code due to parallel execution and the need of synchronization between data communication and processing tasks. The extra code required for data transfers in these two models is necessary due to the lack of by-reference parameter passing to accelerators. This dissertation presents a novel accelerator-hosted data transfer model. In this model, data used by accelerators is hosted in the accelerator memory, so when the CPU accesses this data, it is effectively accessing the accelerator memory. Such a model cleanly supports by-reference parameter passing to accelerator calls, removing the need to explicit data transfers. The second consequence of separate system and accelerator memories is that current programming models export separate virtual system and accelerator address spaces to application programmers. This dissertation identifies the double-pointer problem as a direct consequence of these separate virtual memory spaces. The double-pointer problem is that data structures used by both, accelerators and CPUs, are referenced by different virtual memory addresses (pointers) in the CPU and accelerator code. The double-pointer problem requires programmers to add extra code to ensure that both pointers contain consistent values (e.g., when reallocating a data structure). Keeping consistency between system and accelerator pointers might penalize accelerator performance and increase the accelerator memory requirements when pointers are embedded within data structures (e.g., a linked-list). For instance, the double-pointer problem requires increasing the numbers of global memory accesses by 2X in a GPU code that reconstructs a linked-list. This dissertation argues that a unified virtual address space that includes both, system and accelerator memories is an efficient solution to the double-pointer problem. Moreover, such a unified virtual address space cleanly complements the accelerator-hosted data model previously discussed. This dissertation introduces the Non-Uniform Accelerator Memory Access (NUAMA) architecture, as a hardware implementation of the accelerator-hosted data transfer model and the unified virtual address space. In NUAMA an Accelerator Memory Collector (AMC) is included within the system memory controller to identify memory requests for accelerator-hosted data. The AMC buffers and coalesces such memory requests to efficiently transfer data from the CPU to the accelerator memory. NUAMA also implements a hybrid L2 cache memory. The L2 cache in NUAMA follows a write-throughwrite-non-allocate policy for accelerator hosted data. This policy ensures that the contents of the accelerator memory are updated eagerly and, therefore, when the accelerator is called, most of the data has been already transferred. The eager update of the accelerator memory contents effectively overlaps data communication and CPU computation. A write-backwrite-allocate policy is used for the data hosted by the system memory, so the performance of applications that does not use accelerators is not affected. In NUAMA, accelerator-hosted data is identified using a TLB-assisted mechanism. The page table entries are extended with a bit, which is set for those memory pages that are hosted by the accelerator memory. NUAMA increases the average bandwidth requirements for the L2 cache memory and the interconnection network between the CPU and accelerators, but the instantaneous bandwidth, which is the limiting factor, requirements are lower than in traditional DMA-based architectures. The NUAMA architecture is compared to traditional DMA systems using cycle-accurate simulations. Experimental results show that NUAMA and traditional DMA-based architectures perform equally well. However, the application source code complexity of NUAMA is much lower than in DMA-based architectures. A software implementation of the accelerator-hosted model and the unified virtual address space is also explored. This dissertation presents the Asymmetric Distributed Shared Memory (ADSM) model. ADSM maintains a shared logical memory space for CPUs to access data in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data structures to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. ADSM reduces programming efforts for heterogeneous parallel computing systems and enhances application portability. The design and implementation of an ADSM run-time, called GMAC, on top of CUDA in a GNU/Linux environment is presented. Experimental results show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This dissertation presents the GMAC system, evaluates different design choices, and it further suggests additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model. Finally, the execution model of heterogeneous parallel systems is considered. Accelerator execution is abstracted in different ways in existent programming models. This dissertation explores three approaches implemented by existent programming models. OpenCL and the NVIDIA CUDA driver API use file descriptor semantics to abstract accelerators: user processes access accelerators through descriptors. This approach increases the complexity of using accelerators because accelerator descriptors are needed in any call involving the accelerator (e.g., memory allocations or passing a parameter to the accelerator). The IBM Cell SDK abstract accelerators as separate execution threads. This approach requires adding the necessary code to create new execution threads and synchronization primitives to use of accelerators. Finally, the NVIDIA CUDA run-time API abstract accelerators as Remote Procedure Calls (RPC). This approach is fundamentally incompatible with ADSM, because it assumes separate virtual address spaces for accelerator and CPU code. The Heterogeneous Parallel Execution (HPE) model is presented in this dissertation. This model extends the execution thread abstraction to incorporate different execution modes. Execution modes define the capabilities (e.g., accessible virtual address space, code ISA, etc) of the code being executed. In this execution model, accelerator calls are implemented as execution mode switches, analogously to system calls. Accelerator calls in HPE are synchronous, on the contrary of CUDA, OpenCL and the IBM Cell SDK. Synchronous accelerator calls provide full compatibility with the existent sequential execution model provided by most operating systems. Moreover, abstracting accelerator calls as execution mode switches allows application that use accelerator to run on system without accelerators. In these systems, the execution mode switch falls back to an emulation layer, which emulates the accelerator execution in the CPU. This dissertation further presents different design and implementation choices for the HPE model, in GMAC. The necessary hardware support for an efficient implementation of this model is also presented. Experimental results show that HPE introduces a low execution-time overhead while offering a clean and simple programming interface to applications.

  • Streaming scatter/gather DMA controller for hardware accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2010-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Feeding data to hardware accelerators is an intricate process that affects performance and efficiency of systems. In System-on-chip environments hardware accelerators act as target/slave and movement of data to/from memory is controlled by microprocessor (initiator/master) unit. Processors play middle role they read data from the hardware accelerator and write it to memory and vise versa. This technique provides flexibility but affects the performance of the system. To get maximum benefit from parallelism, HPC applications need to adopt memory controllers that have intelligence like CPU and have potential to synchronize with hardware accelerator. In this abstract we present a memory controller that provides Scatter/Gather DMA functionality. This memory controller takes maximum benefit of the hardware fabric by feeding data in streaming format. Memory access patterns are defined by programmable descriptor blocks available in the controller. We measure gate-count and speed by executing memory controller over Xilinx Virtex 5 ML505 board and compare results with SoC designed in Xilinx base system builder.

  • Decomposable and responsive power models for multicore processors using performance counters

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. 147-158
    DOI: 10.1145/1810085.1810108
    Presentation's date: 2010-06-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Local memory design space exploration for high-performance computing

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro, Nacho; Ayguade Parra, Eduard
    The Computer journal (paper)
    Vol. 54, num. 5, p. 786-799
    DOI: 10.1093/comjnl/bxq026
    Date of publication: 2010-03-23
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    An asymmetric distributed shared memory model for heterogeneous parallel systems  Open access

     Gelado Fernandez, Isaac; E. Stone, John; Cabezas, Javier; Patel, Sanjay; Navarro, Nacho; W. Hwu, Wen-mei
    International Conference on Architectural Support for Programming Languages and Operating Systems
    p. 347-358
    DOI: 10.1145/1736020.1736059
    Presentation's date: 2010-03-13
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmermanaged data transfers. This paper presents the GMAC system and evaluates different design choices.We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

  • An asymmetric distributed shared memory model for heterogeneous parallel systems

     Gelado Fernandez, Isaac; Stone, John E.; Cabezas, Javier; Patel, Sanjay; Navarro, Nacho; HEI HWU, WEN
    Computer architecture news
    Vol. 38, num. 1, p. 347-358
    DOI: 10.1145/1735970.1736059
    Date of publication: 2010
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Exploiting Dataflow Parallelism in Teradevice Computing (TERAFLUX)

     Badia Sala, Rosa Maria; Ramirez Bellido, Alejandro; Navarro, Nacho; Gil Gómez, Maria Luisa
    Competitive project

     Share

  • DISEÑO DE REDES INALÁMBRICAS INTEROPERABLES CON CAPACIDAD PARA SENSORES HETEROGÉNEOS

     Jimenez Castells, Marta; Gil Gómez, Maria Luisa; Navarro, Nacho
    Competitive project

     Share

  • Row-interleaved streaming data flow implementation of sparse matrix vector multiplication in FPGA

     Dickov, Branimir; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    HiPEAC Workshop on Reconfigurable Computing
    p. 1-10
    Presentation's date: 2010-01
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Sparse Matrix-Vector Multiplication (SMVM) is the critical computational kernel of many iterative solvers for systems of sparse lin-ear equations. In this paper we propose an FPGA design for SMVM which interleaves CRS (Compressed Row Storage) format so that just a single floating point accumulator is needed, which simplifies control, avoids any idle clock cycles and sustains high throughput. For the evaluation of the proposed design we use a RASC RC 100 blade attached to a SGI Altix multiprocessor architecture. The limited memory bandwidth of this archi-tecture heavily constraints the performance demonstrated. However, the use of FIFO buffers to stream input data makes the design portable to other FPGA-based platforms with higher memory bandwith.

  • Exploiting memory customization in FPGA for 3D stencil computations

     Shafiq, Muhammad; Pericas, Miquel; De la Cruz Martinez, Raul; Araya Polo, Mauricio; Navarro, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    p. 38-45
    DOI: 10.1109/FPT.2009.5377644
    Presentation's date: 2009-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    3D stencil computations are compute-intensive kernels often appearing in high-performance scientific and engineering applications. The key to efficiency in these memory-bound kernels is full exploitation of data reuse. This paper explores the design aspects for 3D-Stencil implementations that maximize the reuse of all input data on a FPGA architecture. The work focuses on the architectural design of 3D stencils with the form n ÿ (n + 1) ÿ n, where n = {2, 4, 6, 8, ...}. The performance of the architecture is evaluated using two design approaches, ¿Multi-Volume¿ and ¿Single-Volume¿. When n = 8, the designs achieve a sustained throughput of 55.5 GFLOPS in the ¿Single-Volume¿ approach and 103 GFLOPS in the ¿Multi-Volume¿ design approach in a 100-200 MHz multi-rate implementation on a Virtex-4 LX200 FPGA. This corresponds to a stencil data delivery of 1500 bytes/cycle and 2800 bytes/cycle respectively. The implementation is analyzed and compared to two CPU cache approaches and to the statically scheduled local stores on the IBM PowerXCell 8i. The FPGA approaches designed here achieve much higher bandwidth despite the FPGA device being the least recent of the chips considered. These numbers show how a custom memory organization can provide large data throughput when implementing 3D stencil kernels.

  • Access to the full text
    High-performance reverse time migration on GPU  Open access

     Cabezas, Javier; Ayala Polo, Mauricio; Gelado Fernandez, Isaac; Morancho Llena, Enrique; Navarro, Nacho; Cela Espin, Jose M.
    International Conference of the Chilean Computer Science Society
    p. 77-86
    DOI: 10.1109/SCCC.2009.19
    Presentation's date: 2009-11
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Partial Differential Equations (PDE) are the heart of most simulations in many scientific fields, from Fluid Mechanics to Astrophysics. One the most popular mathematical schemes to solve a PDE is Finite Difference (FD). In this work we map a PDE-FD algorithm called Reverse Time Migration to a GPU using CUDA. This seismic imaging (Geophysics) algorithm is widely used in the oil industry. GPUs are natural contenders in the aftermath of the clock race, in particular for High-performance Computing (HPC). Due to GPU characteristics, the parallelism paradigm shifts from the classical threads plus SIMD to Single Program Multiple Data (SPMD). The NVIDIA GTX 280 implementation outperforms homogeneous CPUs up to 9x (Intel Harpertown E5420) and up to 14x (IBM PPC 970). These preliminary results confirm that GPUs are a real option for HPC, from performance to programmability.

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Nou Castell, Ramon; Gonzalez Tallada, Marc; Gil Gómez, Maria Luisa; Navarro, Nacho; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Carrera Perez, David; Martorell Bofill, Xavier; Herrero Zaragoza, José Ramón; Torres Viñals, Jordi; Badia Sala, Rosa Maria; Becerra Fontal, Yolanda; Cortes Rossello, Antonio; Corbalan Gonzalez, Julita; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Alonso López, Javier; Tejedor Saavedra, Enric; Labarta Mancho, Jesus Jose; Ayguade Parra, Eduard
    Competitive project

     Share

  • A streaming based high performance FPGA core for 3D reverse time migration

     Shafiq, Muhammad; Pericas, Miquel; Navarro, Nacho; Ayguade Parra, Eduard
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Reverse Time Migration (RTM) is a wave equation depth migration method. It offers insights into geology that were previously impossible to interpret or understand using seismic data. With its unmatched benefits for seismic imaging, RTM is also found to be highly expensive in terms of computations and data bandwidth. It imposes requirements of fast processing core(s) along with large sizes of memory and caches. In this work we have developed a high performance 100-200MHz multirate FPGA core for RTM that manages both data and computations efficiently. The core is using a generic streaming interface to input and output large volumes of data. The core’s performance is tested using a Virtex-4 LX200 device. The results show that our RTM core can achieve 13.5GFLOPS and its usage as a loosly coupled accelerator in a heterogeneous environment of Altix-4700 achieves a minimum speedup of 3.5 x as compared to the native software only execution on 1.6GHz Itanium-2 core.

  • Mapping sparse matrix-vector multiplication (SMVM) on FPGA - reconfigurable supercomputing

     Dickov, Branimir; Pericas, Miquel; Ayguade Parra, Eduard; Navarro, Nacho
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    In many iterative solvers for systems of sparse linear equations Sparse Matrix­ Vector Multiplication (SMVM) is a critical computational kernel. In this work we use FPGA as a accelerator attached to a host processor (ALTIX supercomputer) to accelerate SMVM. However, due to the large data requirements performance heavily depends on by memory bandwidth which is very limited on the ALTIX platform. Input data are streamed through FIFO buffers, which simplifies portability to another platforms with better memory bandwidth characteristics. By modifying traditional CRS (Compressed Row Storage) representation for sparse matrices and by implementing innovative floating point accumulation we avoid any idle clock cycles and achieve sustained high throughput.

  • CASES 2007 guest editors' introduction

     Lumetta, Steven S.; Navarro, Nacho
    Design automation for embedded systems
    Vol. 13, num. 1-2, p. 89-
    DOI: 10.1007/s10617-008-9037-8
    Date of publication: 2009-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Cetra: A Trace and Analysis Framework for the Evaluation of Cell BE Systems

     Merino, Julio; Alvarez, Lluc; Gil Gómez, Maria Luisa; Navarro, Nacho
    IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009)
    p. 43-52
    Presentation's date: 2009-04-27
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    The cell broadband engine architecture (CBEA) is an heterogeneous multiprocessor architecture developed by Sony, Toshiba and IBM. The major implementation of this architecture is the cell broadband engine (cell for short), a processor that contains one generic PowerPC core and eight accelerators. The cell is targeted at high-performance computing systems and consumer-level devices that have high computational requirements. The workloads for the former are generally run in a queue-based environment while those for the latter are multiprogrammed. Applications for the cell are composed of multiple parallel tasks: one runs on the PowerPC core and one or more run on the accelerators. The operating system (OS) is in charge of scheduling these tasks on top of the physical processors, and such scheduling decisions become critical in multiprogrammed environments. System developers need a way to analyze how user applications behave in these conditions to be able to tune the OS internal algorithms. This article presents Cetra, a new tool-set that allows system developers to study how cell workloads interact with Linux, the OS kernel. First, we outline the major features of Cetra and provide a detailed description of its internals. Then, we demonstrate the usefulness of Cetra by presenting a case study that shows the features of the tool-set and allows us to compare the results to those provided by other performance analysis tools available in the market. At last, we describe another case study in which we discovered a scheduling starvation bug using Cetra.

  • Linux kernel compaction through cold code swapping

     Chanet, Dominique; Cabezas, Javier; Morancho Llena, Enrique; Navarro, Nacho; De Bosschere, Koen
    Lecture notes in computer science
    Vol. 5470, p. 173-200
    DOI: 10.1007/978-3-642-00904-4_10
    Date of publication: 2009-04-22
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Predictive runtime code scheduling for heterogeneous architectures  Open access

     Jiménez, Víctor; Vilanova, Lluis; Gelado Fernandez, Isaac; Gil Gómez, Maria Luisa; Fursin, Gregori; Navarro, Nacho
    International Conference on High Performance and Embedded Architectures and Compilers
    p. 19-33
    Presentation's date: 2009-01-25
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every re- cent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-speci c appli- cations like scienti c applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous com- puting systems where all their heterogeneous resources are continuously utilized by di erent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power con- sumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed sev- eral scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple appli- cations to fully utilize all available processing resources in CPU/GPU- like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.

  • Hardware support for explicit communications in scalable CMP's

     Villavieja Prados, Carlos; Katevenis, Manolis; Navarro, Nacho; Pnevmatikatos, Dionisios; Ramirez Bellido, Alejandro; Kavadias, Stamatis; Papaefstathiou, Vassilis; Nikolopoulos, Dimitrios S.
    Date: 2009-01
    Report

     Share Reference managers Reference managers Open in new window