Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 201 results
  • Experimental assessment of a high performance back-end PCE for Flexgrid optical network re-optimization

     Gifre Renom, Lluis; Velasco Esteban, Luis; Navarro Mas, Nacho; Junyent Giralt, Gabriel
    Optical Fiber Communication Conference and Exposition and National Fiber Optic Engineers Conference
    p. 1-3
    DOI: 10.1364/OFC.2014.W4A.3
    Presentation's date: 2014-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    A specialized high performance Graphics Processing Unit (GPU)-based back-end Path Computation Element (PCE) to compute re-optimization in Flexgrid networks is presented. Experimental results show 6x speedups compared to single centralized PCE.

  • Models de Programacio i Entorns d'eXecució PARal.lels

     Becerra Fontal, Yolanda; Carrera Perez, David; Corbalan Gonzalez, Julita; Cortes Rossello, Antonio; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Gil Gómez, Maria Luisa; Gonzalez Tallada, Marc; Guitart Fernández, Jordi; Herrero Zaragoza, José Ramón; Labarta Mancho, Jesus Jose; Martorell Bofill, Xavier; Navarro Mas, Nacho; Nin Guerrero, Jordi; Torres Viñals, Jordi; Tous Liesa, Ruben; Utrera Iglesias, Gladys Miriam; Ayguade Parra, Eduard
    Competitive project

     Share

  • Architecture of a specialized back-end high performance computing-based PCE for flexgrid networks

     Gifre Renom, Lluis; Velasco Esteban, Luis; Navarro Mas, Nacho
    International Conference on Transparent Optical Networks
    p. Mo.C4.3-1-Mo.C4.3-4
    DOI: 10.1109/ICTON.2013.6602716
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The requirement of executing network re-optimization operations to efficiently manage and deploy new generation flexgrid-based optical networks has brought to light the need of some specialized PCEs capable of performing such high time-consuming computations. The objective of such re-optimizations is to compute network reconfigurations to be done based on the current state of network resources to achieve near-optimal resources utilization. Such PCEs require high performance computing equipment to process the huge amount of data in both the Traffic Engineering (TED) and the Label Switched Path (LSP-DB) databases in a unified computation step. To deal with this problem, a High Performance Computing (HPC) Graphics Process Unit (GPU)-based cluster architecture is proposed in this paper. This architecture is capable of attending to Path Computation Element (PCE) requests demanding execution of network re-optimization tasks, perform such computations and report a near-optimal solution in practical times.

  • Access to the full text
    Design space explorations for streaming accelerators using streaming architectural simulator  Open access

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Bhurban Conference on Applied Sciences and Technology
    p. 169-178
    DOI: 10.1109/IBCAST.2013.6512151
    Presentation's date: 2013-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In the recent years streaming accelerators like GPUs have been pop-up as an effective step towards parallel computing. The wish-list for these devices span from having a support for thousands of small cores to a nature very close to the general purpose computing. This makes the design space very vast for the future accelerators containing thousands of parallel streaming cores. This complicates to exercise a right choice of the architectural configuration for the next generation devices. However, accurate design space exploration tools developed for the massively parallel architectures can ease this task. The main objectives of this work are twofold. (i) We present a complete environment of a trace driven simulator named SArcs (Streaming Architectural Simulator) for the streaming accelerators. (ii) We use our simulation tool-chain for the design space explorations of the GPU like streaming architectures. Our design space explorations for different architectural aspects of a GPU like device a e with reference to a base line established for NVIDIA's Fermi architecture (GPU Tesla C2050). The explored aspects include the performation effects by the variations in the configurations of Streaming Multiprocessors Global Memory Bandwidth, Channles between SMs down to Memory Hierarchy and Cache Hierarchy. The explorations are performed using application kernels from Vector Reduction, 2D-Convolution. Matrix-Matrix Multiplication and 3D-Stencil. Results show that the configurations of the computational resources for the current Fermi GPU device can deliver higher performance with further improvement in the global memory bandwidth for the same device.

    Postprint (author’s final draft)

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. Article No. 89-
    DOI: 10.1109/TC.2013.194
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • PPMC: hardware scheduling and memory management support for multi accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    p. 571-574
    DOI: 10.1109/FPL.2012.6339373
    Presentation's date: 2012-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Assessing the impact of network compression on molecular dynamics and finite element methods

     Dickov, Branimir; Pericas, Miquel; Houzeaux, Guillaume; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE International Conference on High Performance Computing and Communications
    p. 588-597
    DOI: 10.1109/HPCC.2012.85
    Presentation's date: 2012-06
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Architectural Explorations for Streaming Accelerators with Customized Memory Layouts  Open access

     Shafiq, Muhammad
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.

    The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.

  • BSArc: blacksmith streaming architecture for HPC accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    p. 23-32
    DOI: 10.1145/2212908.2212914
    Presentation's date: 2012-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • PPMC: a programmable pattern based memory controller

     Hussain, Tassadaq; Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International IEEE/ACM Symposium on Applied Reconfigurable Computing
    p. 89-101
    DOI: 0.1007/978-3-642-28365-9_8
    Presentation's date: 2012-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • HIPEAC 3 - European Network of Excellence on HighPerformance Embedded Architecture and Compilers

     Gil Gómez, Maria Luisa; Navarro Mas, Nacho; Martorell Bofill, Xavier; Valero Cortes, Mateo; Ayguade Parra, Eduard; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Llaberia Griño, Jose M.
    Competitive project

     Share

  • Hardware and software support for distributed shared memory in chip multiprocessors

     Villavieja Prados, Carlos
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    p. 427-428
    DOI: 10.1145/2254756.2254827
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • The data transfer engine: towards a software controlled memory hierarchy

     Garcia Flores, Victor; Rico Carro, Alejandro; Villavieja Prados, Carlos; Navarro Mas, Nacho; Ramirez Bellido, Alejandro
    International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
    p. 215-218
    Presentation's date: 2012
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Implementation of a reverse time migration kernel using the HCE high level synthesis tool

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    p. 1-8
    DOI: 10.1109/FPT.2011.6132717
    Presentation's date: 2011-12-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DiDi: mitigating the performance impact of TLB shootdowns using a shared TLB directory

     Villavieja Prados, Carlos; Karakostas, Vasileios; Vilanova, Lluis; Etsion, Yoav; Ramirez Bellido, Alejandro; Mendelson, Avi; Navarro Mas, Nacho; Cristal Kestelman, Adrián; Unsal, Osman Sabri
    International Conference on Parallel Architectures and Compilation Techniques
    p. 340-349
    DOI: 10.1109/PACT.2011.65
    Presentation's date: 2011-10-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • FELI: HW/SW support for on-chip distributed shared memory in multicores

     Villavieja Prados, Carlos; Etsion, Yoav; Ramirez Bellido, Alejandro; Navarro Mas, Nacho
    International European Conference on Parallel and Distributed Computing
    p. 282-294
    DOI: 10.1007/978-3-642-23400-2_27
    Presentation's date: 2011-09-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Design space exploration for aggressive core replication schemes in CMPs

     Álvarez Martí, Lluc; Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Symposium on High Performance Distributed Computing
    p. 269-270
    DOI: 10.1145/1996130.1996169
    Presentation's date: 2011-06-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • TARCAD: a template architecture for reconfigurable accelerator designs

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE Symposium on Application Specific Processors
    p. 8-15
    DOI: 10.1109/SASP.2011.5941071
    Presentation's date: 2011-06-05
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • FEM: A step towards a common memory layout for FPGA based accelerators

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field Programmable Logic and Applications
    p. 568-573
    DOI: 10.1109/FPL.2010.111
    Presentation's date: 2010-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    FPGA devices are mostly utilized for customized application designs with heavily pipelined and aggressively parallel computations. However, little focus is normally given to the FPGA memory organizations to efficiently use the data fetched into the FPGA. This work presents a Front End Memory (FEM) layout based on BRAMs and Distributed RAM for FPGA-based accelerators. The presented memory layout serves as a template for various data organizations which is in fact a step towards the standardization of a methodology for FPGA based memory management inside an accelerator. We present example application kernels implemented as specializations of the template memory layout. Further, the presented layout can be used for Spatially Mapped-Shared Memory multi-kernel applications targeting FPGAs. This fact is evaluated by mapping two applications, an Acoustic Wave Equation code and an N-Body method, to three multi-kernel execution models on a Virtex-4 L×200 device. The results show that the shared memory model for Acoustic Wave Equation code outperforms the local and runtime reconfigured models by 1.3-1.5×, respectively. For the N-Body method the shared model is slightly more efficient with a small number of bodies, but for larger systems the runtime reconfigured model shows a 3× speedup over the other two models.

  • Streaming scatter/gather DMA controller for hardware accelerators

     Hussain, Tassadaq; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2010-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Feeding data to hardware accelerators is an intricate process that affects performance and efficiency of systems. In System-on-chip environments hardware accelerators act as target/slave and movement of data to/from memory is controlled by microprocessor (initiator/master) unit. Processors play middle role they read data from the hardware accelerator and write it to memory and vise versa. This technique provides flexibility but affects the performance of the system. To get maximum benefit from parallelism, HPC applications need to adopt memory controllers that have intelligence like CPU and have potential to synchronize with hardware accelerator. In this abstract we present a memory controller that provides Scatter/Gather DMA functionality. This memory controller takes maximum benefit of the hardware fabric by feeding data in streaming format. Memory access patterns are defined by programmable descriptor blocks available in the controller. We measure gate-count and speed by executing memory controller over Xilinx Virtex 5 ML505 board and compare results with SoC designed in Xilinx base system builder.

  • On the programmability of heterogeneous massively-parallel computing systems  Open access

     Gelado Fernandez, Isaac
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous parallel computing combines general purpose processors with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous parallel computing impose added coding complexity when compared to traditional sequential shared-memory programming models for homogeneous systems. This extra code complexity is assumable in supercomputing environments, where programmability is sacrificed in pursuit of high performance. However, heterogeneous parallel systems are massively reaching the desktop market (e.g., 425.4 million of GPU cards were sold in 2009), where the trade-off between performance and programmability is the opposite. The code complexity required when using accelerators and the lack of compatibility prevents programmers from exploiting the full computing capabilities of heterogeneous parallel systems in general purpose applications. This dissertation aims to increase the programmability of CPU - accelerator systems, without introducing major performance penalties. The key insight is that general purpose application programmers tend to favor programmability at the cost of system performance. This fact is illustrated by the tendency to use high-level programming languages, such as C++, to ease the task of programming at the cost of minor performance penalties. Moreover, currently many general purpose applications are being developed using interpreted languages, such as Java, C# or python, which raise the abstraction level even further introducing relatively large performance overheads. This dissertation also takes the approach of raising the level of abstraction for accelerators to improve programmability and investigates hardware and software mechanisms to efficiently implement these high-level abstractions without introducing major performance overheads. Heterogeneous parallel systems typically implement separate memories for CPUs and accelerators, although commodity systems might use a shared memory at the cost of lower performance. However, in these commodity shared memory systems, coherence between accelerator and CPUs is not guaranteed. This system architecture implies that CPUs can only access system memory, and accelerators can only access their own local memory. This dissertation assumes separate system and accelerator memory and shows that low-level abstractions for these disjoint address spaces are the source of poor programmability of heterogeneous parallel systems. A first consequence of having separate system and accelerator memories are the current data transfer models for heterogeneous parallel systems. In this dissertation two data transfer paradigms are identified: per-call and double-buffered. In these two models, data structures used by accelerators are allocated in both, system and accelerator memories. These models differ on how data between accelerator and system memories is managed. The per-call model transfers the input data needed by accelerators before accelerator calls, and transfers back the output data produced by accelerators on accelerator call return. The per-call model is quite simple, but might impose unacceptable performance penalties due to data transfer overheads. The double-buffered model aims to overlap data communication and CPU and accelerator computation. This model requires a relative quite complex code due to parallel execution and the need of synchronization between data communication and processing tasks. The extra code required for data transfers in these two models is necessary due to the lack of by-reference parameter passing to accelerators. This dissertation presents a novel accelerator-hosted data transfer model. In this model, data used by accelerators is hosted in the accelerator memory, so when the CPU accesses this data, it is effectively accessing the accelerator memory. Such a model cleanly supports by-reference parameter passing to accelerator calls, removing the need to explicit data transfers. The second consequence of separate system and accelerator memories is that current programming models export separate virtual system and accelerator address spaces to application programmers. This dissertation identifies the double-pointer problem as a direct consequence of these separate virtual memory spaces. The double-pointer problem is that data structures used by both, accelerators and CPUs, are referenced by different virtual memory addresses (pointers) in the CPU and accelerator code. The double-pointer problem requires programmers to add extra code to ensure that both pointers contain consistent values (e.g., when reallocating a data structure). Keeping consistency between system and accelerator pointers might penalize accelerator performance and increase the accelerator memory requirements when pointers are embedded within data structures (e.g., a linked-list). For instance, the double-pointer problem requires increasing the numbers of global memory accesses by 2X in a GPU code that reconstructs a linked-list. This dissertation argues that a unified virtual address space that includes both, system and accelerator memories is an efficient solution to the double-pointer problem. Moreover, such a unified virtual address space cleanly complements the accelerator-hosted data model previously discussed. This dissertation introduces the Non-Uniform Accelerator Memory Access (NUAMA) architecture, as a hardware implementation of the accelerator-hosted data transfer model and the unified virtual address space. In NUAMA an Accelerator Memory Collector (AMC) is included within the system memory controller to identify memory requests for accelerator-hosted data. The AMC buffers and coalesces such memory requests to efficiently transfer data from the CPU to the accelerator memory. NUAMA also implements a hybrid L2 cache memory. The L2 cache in NUAMA follows a write-throughwrite-non-allocate policy for accelerator hosted data. This policy ensures that the contents of the accelerator memory are updated eagerly and, therefore, when the accelerator is called, most of the data has been already transferred. The eager update of the accelerator memory contents effectively overlaps data communication and CPU computation. A write-backwrite-allocate policy is used for the data hosted by the system memory, so the performance of applications that does not use accelerators is not affected. In NUAMA, accelerator-hosted data is identified using a TLB-assisted mechanism. The page table entries are extended with a bit, which is set for those memory pages that are hosted by the accelerator memory. NUAMA increases the average bandwidth requirements for the L2 cache memory and the interconnection network between the CPU and accelerators, but the instantaneous bandwidth, which is the limiting factor, requirements are lower than in traditional DMA-based architectures. The NUAMA architecture is compared to traditional DMA systems using cycle-accurate simulations. Experimental results show that NUAMA and traditional DMA-based architectures perform equally well. However, the application source code complexity of NUAMA is much lower than in DMA-based architectures. A software implementation of the accelerator-hosted model and the unified virtual address space is also explored. This dissertation presents the Asymmetric Distributed Shared Memory (ADSM) model. ADSM maintains a shared logical memory space for CPUs to access data in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data structures to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. ADSM reduces programming efforts for heterogeneous parallel computing systems and enhances application portability. The design and implementation of an ADSM run-time, called GMAC, on top of CUDA in a GNU/Linux environment is presented. Experimental results show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This dissertation presents the GMAC system, evaluates different design choices, and it further suggests additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model. Finally, the execution model of heterogeneous parallel systems is considered. Accelerator execution is abstracted in different ways in existent programming models. This dissertation explores three approaches implemented by existent programming models. OpenCL and the NVIDIA CUDA driver API use file descriptor semantics to abstract accelerators: user processes access accelerators through descriptors. This approach increases the complexity of using accelerators because accelerator descriptors are needed in any call involving the accelerator (e.g., memory allocations or passing a parameter to the accelerator). The IBM Cell SDK abstract accelerators as separate execution threads. This approach requires adding the necessary code to create new execution threads and synchronization primitives to use of accelerators. Finally, the NVIDIA CUDA run-time API abstract accelerators as Remote Procedure Calls (RPC). This approach is fundamentally incompatible with ADSM, because it assumes separate virtual address spaces for accelerator and CPU code. The Heterogeneous Parallel Execution (HPE) model is presented in this dissertation. This model extends the execution thread abstraction to incorporate different execution modes. Execution modes define the capabilities (e.g., accessible virtual address space, code ISA, etc) of the code being executed. In this execution model, accelerator calls are implemented as execution mode switches, analogously to system calls. Accelerator calls in HPE are synchronous, on the contrary of CUDA, OpenCL and the IBM Cell SDK. Synchronous accelerator calls provide full compatibility with the existent sequential execution model provided by most operating systems. Moreover, abstracting accelerator calls as execution mode switches allows application that use accelerator to run on system without accelerators. In these systems, the execution mode switch falls back to an emulation layer, which emulates the accelerator execution in the CPU. This dissertation further presents different design and implementation choices for the HPE model, in GMAC. The necessary hardware support for an efficient implementation of this model is also presented. Experimental results show that HPE introduces a low execution-time overhead while offering a clean and simple programming interface to applications.

  • Decomposable and responsive power models for multicore processors using performance counters

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. 147-158
    DOI: 10.1145/1810085.1810108
    Presentation's date: 2010-06-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    An asymmetric distributed shared memory model for heterogeneous parallel systems  Open access

     Gelado Fernandez, Isaac; E. Stone, John; Cabezas, Javier; Patel, Sanjay; Navarro Mas, Nacho; W. Hwu, Wen-mei
    International Conference on Architectural Support for Programming Languages and Operating Systems
    p. 347-358
    DOI: 10.1145/1736020.1736059
    Presentation's date: 2010-03-13
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmermanaged data transfers. This paper presents the GMAC system and evaluates different design choices.We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

  • Exploiting Dataflow Parallelism in Teradevice Computing (TERAFLUX)

     Badia Sala, Rosa Maria; Ramirez Bellido, Alejandro; Navarro Mas, Nacho; Gil Gómez, Maria Luisa
    Competitive project

     Share

  • DISEÑO DE REDES INALÁMBRICAS INTEROPERABLES CON CAPACIDAD PARA SENSORES HETEROGÉNEOS

     Jimenez Castells, Marta; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Competitive project

     Share

  • Row-interleaved streaming data flow implementation of sparse matrix vector multiplication in FPGA

     Dickov, Branimir; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    HiPEAC Workshop on Reconfigurable Computing
    p. 1-10
    Presentation's date: 2010-01
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Sparse Matrix-Vector Multiplication (SMVM) is the critical computational kernel of many iterative solvers for systems of sparse lin-ear equations. In this paper we propose an FPGA design for SMVM which interleaves CRS (Compressed Row Storage) format so that just a single floating point accumulator is needed, which simplifies control, avoids any idle clock cycles and sustains high throughput. For the evaluation of the proposed design we use a RASC RC 100 blade attached to a SGI Altix multiprocessor architecture. The limited memory bandwidth of this archi-tecture heavily constraints the performance demonstrated. However, the use of FIFO buffers to stream input data makes the design portable to other FPGA-based platforms with higher memory bandwith.

  • Exploiting memory customization in FPGA for 3D stencil computations

     Shafiq, Muhammad; Pericas, Miquel; De la Cruz Martinez, Raul; Araya Polo, Mauricio; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference on Field-Programmable Technology
    p. 38-45
    DOI: 10.1109/FPT.2009.5377644
    Presentation's date: 2009-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    3D stencil computations are compute-intensive kernels often appearing in high-performance scientific and engineering applications. The key to efficiency in these memory-bound kernels is full exploitation of data reuse. This paper explores the design aspects for 3D-Stencil implementations that maximize the reuse of all input data on a FPGA architecture. The work focuses on the architectural design of 3D stencils with the form n ÿ (n + 1) ÿ n, where n = {2, 4, 6, 8, ...}. The performance of the architecture is evaluated using two design approaches, ¿Multi-Volume¿ and ¿Single-Volume¿. When n = 8, the designs achieve a sustained throughput of 55.5 GFLOPS in the ¿Single-Volume¿ approach and 103 GFLOPS in the ¿Multi-Volume¿ design approach in a 100-200 MHz multi-rate implementation on a Virtex-4 LX200 FPGA. This corresponds to a stencil data delivery of 1500 bytes/cycle and 2800 bytes/cycle respectively. The implementation is analyzed and compared to two CPU cache approaches and to the statically scheduled local stores on the IBM PowerXCell 8i. The FPGA approaches designed here achieve much higher bandwidth despite the FPGA device being the least recent of the chips considered. These numbers show how a custom memory organization can provide large data throughput when implementing 3D stencil kernels.

  • Access to the full text
    High-performance reverse time migration on GPU  Open access

     Cabezas, Javier; Ayala Polo, Mauricio; Gelado Fernandez, Isaac; Morancho Llena, Enrique; Navarro Mas, Nacho; Cela Espin, Jose M.
    International Conference of the Chilean Computer Science Society
    p. 77-86
    DOI: 10.1109/SCCC.2009.19
    Presentation's date: 2009-11
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Partial Differential Equations (PDE) are the heart of most simulations in many scientific fields, from Fluid Mechanics to Astrophysics. One the most popular mathematical schemes to solve a PDE is Finite Difference (FD). In this work we map a PDE-FD algorithm called Reverse Time Migration to a GPU using CUDA. This seismic imaging (Geophysics) algorithm is widely used in the oil industry. GPUs are natural contenders in the aftermath of the clock race, in particular for High-performance Computing (HPC). Due to GPU characteristics, the parallelism paradigm shifts from the classical threads plus SIMD to Single Program Multiple Data (SPMD). The NVIDIA GTX 280 implementation outperforms homogeneous CPUs up to 9x (Intel Harpertown E5420) and up to 14x (IBM PPC 970). These preliminary results confirm that GPUs are a real option for HPC, from performance to programmability.

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Gonzalez Tallada, Marc; Labarta Mancho, Jesus Jose; Tejedor Saavedra, Enric; Alonso López, Javier; Farreras Esclusa, Montserrat; Costa Prats, Juan Jose; Corbalan Gonzalez, Julita; Cortes Rossello, Antonio; Becerra Fontal, Yolanda; Badia Sala, Rosa Maria; Torres Viñals, Jordi; Herrero Zaragoza, José Ramón; Martorell Bofill, Xavier; Carrera Perez, David; Guitart Fernández, Jordi; Sirvent Pardell, Raül; Navarro Mas, Nacho; Gil Gómez, Maria Luisa; Nou Castell, Ramon; Ayguade Parra, Eduard
    Competitive project

     Share

  • Mapping sparse matrix-vector multiplication (SMVM) on FPGA - reconfigurable supercomputing

     Dickov, Branimir; Pericas, Miquel; Ayguade Parra, Eduard; Navarro Mas, Nacho
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    In many iterative solvers for systems of sparse linear equations Sparse Matrix­ Vector Multiplication (SMVM) is a critical computational kernel. In this work we use FPGA as a accelerator attached to a host processor (ALTIX supercomputer) to accelerate SMVM. However, due to the large data requirements performance heavily depends on by memory bandwidth which is very limited on the ALTIX platform. Input data are streamed through FIFO buffers, which simplifies portability to another platforms with better memory bandwidth characteristics. By modifying traditional CRS (Compressed Row Storage) representation for sparse matrices and by implementing innovative floating point accumulation we avoid any idle clock cycles and achieve sustained high throughput.

  • A streaming based high performance FPGA core for 3D reverse time migration

     Shafiq, Muhammad; Pericas, Miquel; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Reverse Time Migration (RTM) is a wave equation depth migration method. It offers insights into geology that were previously impossible to interpret or understand using seismic data. With its unmatched benefits for seismic imaging, RTM is also found to be highly expensive in terms of computations and data bandwidth. It imposes requirements of fast processing core(s) along with large sizes of memory and caches. In this work we have developed a high performance 100-200MHz multirate FPGA core for RTM that manages both data and computations efficiently. The core is using a generic streaming interface to input and output large volumes of data. The core¿s performance is tested using a Virtex-4 LX200 device. The results show that our RTM core can achieve 13.5GFLOPS and its usage as a loosly coupled accelerator in a heterogeneous environment of Altix-4700 achieves a minimum speedup of 3.5 x as compared to the native software only execution on 1.6GHz Itanium-2 core.

  • Cetra: A Trace and Analysis Framework for the Evaluation of Cell BE Systems

     Merino, Julio; Alvarez, Lluc; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009)
    p. 43-52
    Presentation's date: 2009-04-27
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    The cell broadband engine architecture (CBEA) is an heterogeneous multiprocessor architecture developed by Sony, Toshiba and IBM. The major implementation of this architecture is the cell broadband engine (cell for short), a processor that contains one generic PowerPC core and eight accelerators. The cell is targeted at high-performance computing systems and consumer-level devices that have high computational requirements. The workloads for the former are generally run in a queue-based environment while those for the latter are multiprogrammed. Applications for the cell are composed of multiple parallel tasks: one runs on the PowerPC core and one or more run on the accelerators. The operating system (OS) is in charge of scheduling these tasks on top of the physical processors, and such scheduling decisions become critical in multiprogrammed environments. System developers need a way to analyze how user applications behave in these conditions to be able to tune the OS internal algorithms. This article presents Cetra, a new tool-set that allows system developers to study how cell workloads interact with Linux, the OS kernel. First, we outline the major features of Cetra and provide a detailed description of its internals. Then, we demonstrate the usefulness of Cetra by presenting a case study that shows the features of the tool-set and allows us to compare the results to those provided by other performance analysis tools available in the market. At last, we describe another case study in which we discovered a scheduling starvation bug using Cetra.

  • Access to the full text
    Predictive runtime code scheduling for heterogeneous architectures  Open access

     Jiménez, Víctor; Vilanova, Lluis; Gelado Fernandez, Isaac; Gil Gómez, Maria Luisa; Fursin, Gregori; Navarro Mas, Nacho
    International Conference on High Performance and Embedded Architectures and Compilers
    p. 19-33
    Presentation's date: 2009-01-25
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every re- cent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-speci c appli- cations like scienti c applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous com- puting systems where all their heterogeneous resources are continuously utilized by di erent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power con- sumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed sev- eral scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple appli- cations to fully utilize all available processing resources in CPU/GPU- like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.

  • Hardware support for explicit communications in scalable CMP's

     Villavieja Prados, Carlos; Katevenis, Manolis; Navarro Mas, Nacho; Pnevmatikatos, Dionisios; Ramirez Bellido, Alejandro; Kavadias, Stamatis; Papaefstathiou, Vassilis; Nikolopoulos, Dimitrios S.
    Date: 2009-01
    Report

     Share Reference managers Reference managers Open in new window

  • H.264/AVC DECODER PARALLELIZATION IN CONTEXT OF CABAC ENTROPY DECODER

     Muhammad, Shafiq; Alvarez Mesa, Mauricio; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Date: 2008-07
    Report

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    On-Chip memories, the OS perspective  Open access

     Villavieja Prados, Carlos; Gelado Fernandez, Isaac; Ramirez Bellido, Alejandro; Navarro Mas, Nacho
    HiPEAC Industrial Workshop
    p. 1-2
    Presentation's date: 2008-06-04
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper is a work in progress study of the operating system services required to manage on-chip memories. We are evaluating different CMP on-chip memories configurations. Chip-MultiProcessors (CMP) architectures integrating multiple computing and memory elements presents different problems (coherency, latency, ...) that must be solved. On-chip local memories are directly addressable and their latency is much shorter than off-chip main memories. Since memory latency is a key factor for application performance, we study how the OS can help.

    Postprint (author’s final draft)

  • Resource Management in Virtualized Execution Environments: New Opportunities for Application Specific Decisions

     Becerra Fontal, Yolanda; Cortes Rossello, Antonio; Garcia Almiñana, Jordi; Navarro Mas, Nacho
    Date: 2008-06
    Report

     Share Reference managers Reference managers Open in new window

  • L'assignatura Estructura i Disseny de Sistemes Operatius

     Cortés, Toni; Garcia Vidal, Jorge; Garcia Almiñana, Jordi; Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Adaptive Optimization Methods for Heterogeneous Architectures

     Jiménez, Víctor; Grigori, Fursin; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Navarro Mas, Nacho
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 255-258
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Synergy between Compiler Optimizations and Partitioning on the Cell processor

     Bertran Monfort, Ramon; Cavazos, John; Gil Gómez, Maria Luisa; Navarro Mas, Nacho; O'Boyle, Mike
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 287-290
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Estat actual de la RPE de la FIB al DAC

     Sole Pareta, Josep; Martorell Bofill, Xavier; Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Code Execution Runtime Support for Heterogeneous Platforms

     Zoraida, Hidalgo; Merino, Julio; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 263-266
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • APDA: La asignatura del DAC en el master de la ETSETB

     Navarro Moldes, Leandro; Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • RPE de la FIB: assignatures d'Arquitectura de Computadors

     Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Support for Dynamically Adaptable Heterogeneous Applications

     Vilanova, Lluis; Navarro Mas, Nacho
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 267-270
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Evolución de la asignatura Proyecto de Redes de Computadores y Sistemas Operativos

     Martorell Bofill, Xavier; Navarro Moldes, Leandro; Navarro Mas, Nacho; Salavert Casamor, Antonio
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • MPEG-4 Port to the Cell Processor

     Álvarez Casado, Enrique; Martorell Bofill, Xavier; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 271-274
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Comparació de temaris d'ASO

     Gil Gómez, Maria Luisa; Fernandez Barta, Montserrat; Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Estat actual de la RPE de la FIB al DAC

     Sole Pareta, Josep; Martorell Bofill, Xavier; Navarro Mas, Nacho
    Jornades de Docència del Departament d'Arquitectura de Computadors. 10 Anys de Jornades
    p. 1-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window