Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 195 results
  • Experimental assessment of a high performance back-end PCE for Flexgrid optical network re-optimization

     Gifre Renom, Lluis; Velasco Esteban, Luis; Navarro Mas, Nacho; Junyent Giralt, Gabriel
    Optical Fiber Communication Conference and Exposition and National Fiber Optic Engineers Conference
    Presentation's date: 2014-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    A specialized high performance Graphics Processing Unit (GPU)-based back-end Path Computation Element (PCE) to compute re-optimization in Flexgrid networks is presented. Experimental results show 6x speedups compared to single centralized PCE.

    A specialized high performance Graphics Processing Unit (GPU)-based back-end Path Computation Element (PCE) to compute re-optimization in Flexgrid networks is presented. Experimental results show 6x speedups compared to single centralized PCE.

  • A systematic methodology to generate decomposable and responsive power models for CMPs

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2013-07
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand the power behavior of real systems. Consequently, several power-aware policies use models to guide their decisions. Hence, the presence of power models that are informative, accurate, and capable of detecting power phases is critical to improve the success of power-saving techniques. Additionally, the design of current processors varied considerably with the appearance of CMPs (multiple cores sharing resources). Thus, PMC-based power models warrant further investigation on current energy-efficient multicore processors. In this paper, we present a systematic methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we study theirresponsiveness -the capacity to detect power phases-. Specifically, we produce power models for an Intel Core 2 Duo with one and two cores enabled for all the DVFS configurations. The models are empirically validated using the SPECcpu2006, NAS and LMBENCH benchmarks. Finally, we compare the models against existing approaches concluding that the proposed methodology produces more accurate, responsive, and informative models.

  • A template system for the efficient compilation of domain abstractions onto reconfigurable computers

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    Journal of systems architecture
    Date of publication: 2013-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Past research has addressed the issue of using FPGAs as accelerators for HPC systems. Such research has identified that writing low level code for the generation of an efficient, portable and scalable architecture is challenging. We propose to increase the level of abstraction in order to help developers of reconfigurable accelerators deal with these three key issues. Our approach implements domain specific abstractions for FPGA based accelerators using techniques from generic programming. In this paper we explain the main concepts behind our system to Design Accelerators by Template Expansions (DATE). The DATE system can be effectively used for expanding individual kernels of an application and also for the generation of interfaces between various kernels to implement a complete system architecture. We present evaluations for six kernels as examples of individual kernel generation using the proposed system. Our evaluations are mainly intended to provide a proof-of-concept. We also show the usage of the DATE system for integration of various kernels to build a complete system based on a Template Architecture for Reconfigurable Accelerator Designs (TARCAD).

  • Architecture of a specialized back-end high performance computing-based PCE for flexgrid networks

     Gifre Renom, Lluis; Velasco Esteban, Luis; Navarro Mas, Nacho
    International Conference on Transparent Optical Networks
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The requirement of executing network re-optimization operations to efficiently manage and deploy new generation flexgrid-based optical networks has brought to light the need of some specialized PCEs capable of performing such high time-consuming computations. The objective of such re-optimizations is to compute network reconfigurations to be done based on the current state of network resources to achieve near-optimal resources utilization. Such PCEs require high performance computing equipment to process the huge amount of data in both the Traffic Engineering (TED) and the Label Switched Path (LSP-DB) databases in a unified computation step. To deal with this problem, a High Performance Computing (HPC) Graphics Process Unit (GPU)-based cluster architecture is proposed in this paper. This architecture is capable of attending to Path Computation Element (PCE) requests demanding execution of network re-optimization tasks, perform such computations and report a near-optimal solution in practical times.

  • Access to the full text
    Design space explorations for streaming accelerators using streaming architectural simulator  Open access

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Bhurban Conference on Applied Sciences and Technology
    Presentation's date: 2013-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

    In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

    Postprint (author’s final draft)

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS performance evaluation review
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Counter-based power modeling methods: top-down vs. bottom-up

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The computer journal (Kalispell, Mont.)
    Date of publication: 2012-08-24
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Energy accounting for shared virtualized environments under DVFS using PMC-based power models

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Torres Viñals, Jordi; Ayguade Parra, Eduard
    Future generation computer systems
    Date of publication: 2012-02
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Virtualized infrastructure providers demand new methods to increase the accuracy of the accounting models used to charge their customers. Future data centers will be composed of many-core systems that will host a large number of virtual machines (VMs) each. While resource utilization accounting can be achieved with existing system tools, energy accounting is a complex task when per-VM granularity is the goal. In this paper, we propose a methodology that brings new opportunities to energy accounting by adding an unprecedented degree of accuracy on the per-VM measurements. We present a system ¿ which leverages CPU and memory power models based in performance monitoring counters (PMCs) ¿ to perform energy accounting in virtualized systems. The contribution of this paper is threefold. First, we show that PMC-based power modeling methods are still valid on virtualized environments. Second, we show that the Dynamic Voltage and Frequency Scaling (DVFS) mechanism, which commonly is used by infrastructure providers to avoid power and thermal emergencies, does not affect the accuracy of the models. And third, we introduce a novel methodology for accounting of energy consumption in virtualized systems. Accounting is done on a per-VM basis, even in the case where multiple VMs are deployed on top of the same physical hardware, bypassing the limitations of per-server aggregated power metering. Overall, the results for an Intel® Core¿ 2 Duo show errors in energy estimations <5%. Such an approach brings flexibility to the chargeback models used by service and infrastructure providers. For instance, we are able to detect cases where VMs executed during the same amount of time, present more than 20% differences in energy consumption even only taking into account the consumption of the CPU and the memory.

  • HIPEAC 3 - European Network of Excellence on HighPerformance Embedded Architecture and Compilers

     Navarro Mas, Nacho; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Valero Cortes, Mateo; Ayguade Parra, Eduard; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Llaberia Griño, Jose M.
    Participation in a competitive project

     Share

  • The data transfer engine: towards a software controlled memory hierarchy

     Garcia Flores, Victor; Rico Carro, Alejandro; Villavieja Prados, Carlos; Navarro Mas, Nacho; Ramirez Bellido, Alejandro
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2012
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • PPMC: hardware scheduling and memory management support for multi accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    Presentation's date: 2012-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing the impact of network compression on molecular dynamics and finite element methods

     Dickov, Branimir; Pericas, Miquel; Houzeaux, Guillaume; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE International Conference on High Performance Computing and Communications
    Presentation's date: 2012-06
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • PPMC: a programmable pattern based memory controller

     Hussain, Tassadaq; Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International IEEE/ACM Symposium on Applied Reconfigurable Computing
    Presentation's date: 2012-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • BSArc: blacksmith streaming architecture for HPC accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Architectural Explorations for Streaming Accelerators with Customized Memory Layouts  Open access

     Shafiq, Muhammad
    Defense's date: 2012-05-21
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.

    The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.

  • Hardware and software support for distributed shared memory in chip multiprocessors

     Villavieja Prados, Carlos
    Defense's date: 2012-01-09
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing accelerator-based HPC reverse time migration

     Araya Polo, Mauricio; Cabezas, Javier; Hanzich, Mauricio; Pericas, Miquel; Rubio, Fèlix; Gelado Fernandez, Isaac; Shafiq, Muhammad; Morancho Llena, Enrique; Navarro Mas, Nacho; Ayguade Parra, Eduard; Cela Espin, Jose M.; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Date of publication: 2011-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    Presentation's date: 2011-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    Presentation's date: 2011-12-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • TARCAD: a template architecture for reconfigurable accelerator designs

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE Symposium on Application Specific Processors
    Presentation's date: 2011-06-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Design space exploration for aggressive core replication schemes in CMPs

     Álvarez Martí, Lluc; Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Symposium on High Performance Distributed Computing
    Presentation's date: 2011-06-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • FELI: HW/SW support for on-chip distributed shared memory in multicores

     Villavieja Prados, Carlos; Etsion, Yoav; Ramirez Bellido, Alejandro; Navarro Mas, Nacho
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2011-09-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DiDi: mitigating the performance impact of TLB shootdowns using a shared TLB directory

     Villavieja Prados, Carlos; Vilanova, Lluis; Karakostas, Vasileios; Etsion, Yoav; Ramirez Bellido, Alejandro; Mendelson, Avi; Navarro Mas, Nacho; Cristal Kestelman, Adrián; Unsal, Osman Sabri
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2011-10-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Local memory design space exploration for high-performance computing

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The Computer journal (paper)
    Date of publication: 2010-03-23
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Multicore: the view from Europe

     Valero Cortes, Mateo; Navarro Mas, Nacho
    IEEE micro
    Date of publication: 2010-11-18
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • An asymmetric distributed shared memory model for heterogeneous parallel systems

     Gelado Fernandez, Isaac; Stone, John E.; Cabezas, Javier; Patel, Sanjay; Navarro Mas, Nacho; HEI HWU, WEN
    Computer architecture news
    Date of publication: 2010
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Exploiting Dataflow Parallelism in Teradevice Computing (TERAFLUX)

     Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Participation in a competitive project

     Share

  • DISEÑO DE REDES INALÁMBRICAS INTEROPERABLES CON CAPACIDAD PARA SENSORES HETEROGÉNEOS

     Jimenez Castells, Marta; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Participation in a competitive project

     Share

  • On the programmability of heterogeneous massively-parallel computing systems  Open access

     Gelado Fernandez, Isaac
    Defense's date: 2010-07-02
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous parallel computing combines general purpose processors with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous parallel computing impose added coding complexity when compared to traditional sequential shared-memory programming models for homogeneous systems. This extra code complexity is assumable in supercomputing environments, where programmability is sacrificed in pursuit of high performance. However, heterogeneous parallel systems are massively reaching the desktop market (e.g., 425.4 million of GPU cards were sold in 2009), where the trade-off between performance and programmability is the opposite. The code complexity required when using accelerators and the lack of compatibility prevents programmers from exploiting the full computing capabilities of heterogeneous parallel systems in general purpose applications. This dissertation aims to increase the programmability of CPU - accelerator systems, without introducing major performance penalties. The key insight is that general purpose application programmers tend to favor programmability at the cost of system performance. This fact is illustrated by the tendency to use high-level programming languages, such as C++, to ease the task of programming at the cost of minor performance penalties. Moreover, currently many general purpose applications are being developed using interpreted languages, such as Java, C# or python, which raise the abstraction level even further introducing relatively large performance overheads. This dissertation also takes the approach of raising the level of abstraction for accelerators to improve programmability and investigates hardware and software mechanisms to efficiently implement these high-level abstractions without introducing major performance overheads. Heterogeneous parallel systems typically implement separate memories for CPUs and accelerators, although commodity systems might use a shared memory at the cost of lower performance. However, in these commodity shared memory systems, coherence between accelerator and CPUs is not guaranteed. This system architecture implies that CPUs can only access system memory, and accelerators can only access their own local memory. This dissertation assumes separate system and accelerator memory and shows that low-level abstractions for these disjoint address spaces are the source of poor programmability of heterogeneous parallel systems. A first consequence of having separate system and accelerator memories are the current data transfer models for heterogeneous parallel systems. In this dissertation two data transfer paradigms are identified: per-call and double-buffered. In these two models, data structures used by accelerators are allocated in both, system and accelerator memories. These models differ on how data between accelerator and system memories is managed. The per-call model transfers the input data needed by accelerators before accelerator calls, and transfers back the output data produced by accelerators on accelerator call return. The per-call model is quite simple, but might impose unacceptable performance penalties due to data transfer overheads. The double-buffered model aims to overlap data communication and CPU and accelerator computation. This model requires a relative quite complex code due to parallel execution and the need of synchronization between data communication and processing tasks. The extra code required for data transfers in these two models is necessary due to the lack of by-reference parameter passing to accelerators. This dissertation presents a novel accelerator-hosted data transfer model. In this model, data used by accelerators is hosted in the accelerator memory, so when the CPU accesses this data, it is effectively accessing the accelerator memory. Such a model cleanly supports by-reference parameter passing to accelerator calls, removing the need to explicit data transfers. The second consequence of separate system and accelerator memories is that current programming models export separate virtual system and accelerator address spaces to application programmers. This dissertation identifies the double-pointer problem as a direct consequence of these separate virtual memory spaces. The double-pointer problem is that data structures used by both, accelerators and CPUs, are referenced by different virtual memory addresses (pointers) in the CPU and accelerator code. The double-pointer problem requires programmers to add extra code to ensure that both pointers contain consistent values (e.g., when reallocating a data structure). Keeping consistency between system and accelerator pointers might penalize accelerator performance and increase the accelerator memory requirements when pointers are embedded within data structures (e.g., a linked-list). For instance, the double-pointer problem requires increasing the numbers of global memory accesses by 2X in a GPU code that reconstructs a linked-list. This dissertation argues that a unified virtual address space that includes both, system and accelerator memories is an efficient solution to the double-pointer problem. Moreover, such a unified virtual address space cleanly complements the accelerator-hosted data model previously discussed. This dissertation introduces the Non-Uniform Accelerator Memory Access (NUAMA) architecture, as a hardware implementation of the accelerator-hosted data transfer model and the unified virtual address space. In NUAMA an Accelerator Memory Collector (AMC) is included within the system memory controller to identify memory requests for accelerator-hosted data. The AMC buffers and coalesces such memory requests to efficiently transfer data from the CPU to the accelerator memory. NUAMA also implements a hybrid L2 cache memory. The L2 cache in NUAMA follows a write-throughwrite-non-allocate policy for accelerator hosted data. This policy ensures that the contents of the accelerator memory are updated eagerly and, therefore, when the accelerator is called, most of the data has been already transferred. The eager update of the accelerator memory contents effectively overlaps data communication and CPU computation. A write-backwrite-allocate policy is used for the data hosted by the system memory, so the performance of applications that does not use accelerators is not affected. In NUAMA, accelerator-hosted data is identified using a TLB-assisted mechanism. The page table entries are extended with a bit, which is set for those memory pages that are hosted by the accelerator memory. NUAMA increases the average bandwidth requirements for the L2 cache memory and the interconnection network between the CPU and accelerators, but the instantaneous bandwidth, which is the limiting factor, requirements are lower than in traditional DMA-based architectures. The NUAMA architecture is compared to traditional DMA systems using cycle-accurate simulations. Experimental results show that NUAMA and traditional DMA-based architectures perform equally well. However, the application source code complexity of NUAMA is much lower than in DMA-based architectures. A software implementation of the accelerator-hosted model and the unified virtual address space is also explored. This dissertation presents the Asymmetric Distributed Shared Memory (ADSM) model. ADSM maintains a shared logical memory space for CPUs to access data in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data structures to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. ADSM reduces programming efforts for heterogeneous parallel computing systems and enhances application portability. The design and implementation of an ADSM run-time, called GMAC, on top of CUDA in a GNU/Linux environment is presented. Experimental results show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This dissertation presents the GMAC system, evaluates different design choices, and it further suggests additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model. Finally, the execution model of heterogeneous parallel systems is considered. Accelerator execution is abstracted in different ways in existent programming models. This dissertation explores three approaches implemented by existent programming models. OpenCL and the NVIDIA CUDA driver API use file descriptor semantics to abstract accelerators: user processes access accelerators through descriptors. This approach increases the complexity of using accelerators because accelerator descriptors are needed in any call involving the accelerator (e.g., memory allocations or passing a parameter to the accelerator). The IBM Cell SDK abstract accelerators as separate execution threads. This approach requires adding the necessary code to create new execution threads and synchronization primitives to use of accelerators. Finally, the NVIDIA CUDA run-time API abstract accelerators as Remote Procedure Calls (RPC). This approach is fundamentally incompatible with ADSM, because it assumes separate virtual address spaces for accelerator and CPU code. The Heterogeneous Parallel Execution (HPE) model is presented in this dissertation. This model extends the execution thread abstraction to incorporate different execution modes. Execution modes define the capabilities (e.g., accessible virtual address space, code ISA, etc) of the code being executed. In this execution model, accelerator calls are implemented as execution mode switches, analogously to system calls. Accelerator calls in HPE are synchronous, on the contrary of CUDA, OpenCL and the IBM Cell SDK. Synchronous accelerator calls provide full compatibility with the existent sequential execution model provided by most operating systems. Moreover, abstracting accelerator calls as execution mode switches allows application that use accelerator to run on system without accelerators. In these systems, the execution mode switch falls back to an emulation layer, which emulates the accelerator execution in the CPU. This dissertation further presents different design and implementation choices for the HPE model, in GMAC. The necessary hardware support for an efficient implementation of this model is also presented. Experimental results show that HPE introduces a low execution-time overhead while offering a clean and simple programming interface to applications.

  • Decomposable and responsive power models for multicore processors using performance counters

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2010-06-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    An asymmetric distributed shared memory model for heterogeneous parallel systems  Open access

     Gelado Fernandez, Isaac; Cabezas, Javier; Navarro Mas, Nacho; E. Stone, John; Patel, Sanjay; W. Hwu, Wen-mei
    International Conference on Architectural Support for Programming Languages and Operating Systems
    Presentation's date: 2010-03-13
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmermanaged data transfers. This paper presents the GMAC system and evaluates different design choices.We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

  • Hardware support for explicit communications in scalable CMP's

     Villavieja Prados, Carlos; Katevenis, Manolis; Navarro Mas, Nacho; Pnevmatikatos, Dionisios; Ramirez Bellido, Alejandro; Kavadias, Stamatis; Papaefstathiou, Vassilis; Nikolopoulos, Dimitrios S.
    Date: 2009-01
    Report

     Share Reference managers Reference managers Open in new window

  • Linux kernel compaction through cold code swapping

     Chanet, Dominique; Cabezas, Javier; Morancho Llena, Enrique; Navarro Mas, Nacho; De Bosschere, Koen
    Lecture notes in computer science
    Date of publication: 2009-04-22
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Gonzalez Tallada, Marc; Alonso López, Javier; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Carrera Perez, David; Cortes Rossello, Antonio; Gil Gómez, Maria Luisa; Navarro Mas, Nacho; Martorell Bofill, Xavier; Torres Viñals, Jordi; Badia Sala, Rosa Maria; Corbalan Gonzalez, Julita; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Herrero Zaragoza, José Ramón; Becerra Fontal, Yolanda; Tejedor Saavedra, Enric; Nou Castell, Ramon; Labarta Mancho, Jesus Jose; Ayguade Parra, Eduard
    Participation in a competitive project

     Share

  • Access to the full text
    Predictive runtime code scheduling for heterogeneous architectures  Open access

     Jimenez, Victor; Vilanova, Lluis; Gelado Fernandez, Isaac; Gil Gómez, Maria Luisa; Fursin, Gregori; Navarro Mas, Nacho
    International Conference on High Performance and Embedded Architectures and Compilers
    Presentation's date: 2009-01-25
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every re- cent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-speci c appli- cations like scienti c applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous com- puting systems where all their heterogeneous resources are continuously utilized by di erent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power con- sumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed sev- eral scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple appli- cations to fully utilize all available processing resources in CPU/GPU- like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.

  • Cetra: A Trace and Analysis Framework for the Evaluation of Cell BE Systems

     Merino, Julio; Alvarez, Lluc; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009)
    Presentation's date: 2009-04-27
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • CASES 2007 guest editors' introduction

     Lumetta, Steven S.; Navarro Mas, Nacho
    Design automation for embedded systems
    Date of publication: 2009-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    High-performance reverse time migration on GPU  Open access

     Cabezas, Javier; Ayala Polo, Mauricio; Gelado Fernandez, Isaac; Morancho Llena, Enrique; Navarro Mas, Nacho; Cela Espin, Jose M.
    International Conference of the Chilean Computer Science Society
    Presentation's date: 2009-11
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Partial Differential Equations (PDE) are the heart of most simulations in many scientific fields, from Fluid Mechanics to Astrophysics. One the most popular mathematical schemes to solve a PDE is Finite Difference (FD). In this work we map a PDE-FD algorithm called Reverse Time Migration to a GPU using CUDA. This seismic imaging (Geophysics) algorithm is widely used in the oil industry. GPUs are natural contenders in the aftermath of the clock race, in particular for High-performance Computing (HPC). Due to GPU characteristics, the parallelism paradigm shifts from the classical threads plus SIMD to Single Program Multiple Data (SPMD). The NVIDIA GTX 280 implementation outperforms homogeneous CPUs up to 9x (Intel Harpertown E5420) and up to 14x (IBM PPC 970). These preliminary results confirm that GPUs are a real option for HPC, from performance to programmability.

  • H.264/AVC DECODER PARALLELIZATION IN CONTEXT OF CABAC ENTROPY DECODER

     Muhammad, Shafiq; Alvarez Mesa, Mauricio; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Date: 2008-07
    Report

     Share Reference managers Reference managers Open in new window

  • Resource Management in Virtualized Execution Environments: New Opportunities for Application Specific Decisions

     Becerra Fontal, Yolanda; Cortes Rossello, Antonio; Garcia Almiñana, Jordi; Navarro Mas, Nacho
    Date: 2008-06
    Report

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    On-Chip memories, the OS perspective  Open access

     Villavieja Prados, Carlos; Gelado Fernandez, Isaac; Ramirez Bellido, Alejandro; Navarro Mas, Nacho
    HiPEAC Industrial Workshop
    Presentation's date: 2008-06-04
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper is a work in progress study of the operating system services required to manage on-chip memories. We are evaluating different CMP on-chip memories configurations. Chip-MultiProcessors (CMP) architectures integrating multiple computing and memory elements presents different problems (coherency, latency, ...) that must be solved. On-chip local memories are directly addressable and their latency is much shorter than off-chip main memories. Since memory latency is a key factor for application performance, we study how the OS can help.

    Postprint (author’s final draft)

  • Evolución de la asignatura Proyecto de Redes de Computadores y Sistemas Operativos

     Martorell Bofill, Xavier; Navarro Moldes, Leandro; Navarro Mas, Nacho; Salavert Casamor, Antonio
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Synergy between Compiler Optimizations and Partitioning on the Cell processor

     Bertran Monfort, Ramon; Cavazos, John; Gil Gómez, Maria Luisa; Navarro Mas, Nacho; O'Boyle, Mike
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • L'assignatura Estructura i Disseny de Sistemes Operatius

     Cortés, Toni; Garcia Vidal, Jorge; Garcia Almiñana, Jordi; Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Adaptive Optimization Methods for Heterogeneous Architectures

     Jimenez, Victor; Grigori, Fursin; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Navarro Mas, Nacho
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • RPE de la FIB: assignatures de Sistemes Operatius

     Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • MPEG-4 Port to the Cell Processor

     Alvarez, Eric; Martorell Bofill, Xavier; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Linux Kernel Compaction through Cold Code Swapping

     Chanet, D; Cabezas, J; Morancho Llena, Enrique; Navarro Mas, Nacho; De Bosschere, Koen
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window