Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 92 results
  • Models de Programacio i Entorns d'eXecució PARal.lels

     Becerra Fontal, Yolanda; Carrera Perez, David; Corbalan Gonzalez, Julita; Cortes Rossello, Antonio; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Gil Gómez, Maria Luisa; Gonzalez Tallada, Marc; Guitart Fernández, Jordi; Herrero Zaragoza, José Ramón; Labarta Mancho, Jesus Jose; Martorell Bofill, Xavier; Navarro Mas, Nacho; Nin Guerrero, Jordi; Torres Viñals, Jordi; Tous Liesa, Ruben; Utrera Iglesias, Gladys Miriam; Ayguade Parra, Eduard
    Competitive project

     Share

  • Access to the full text
    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks  Open access

     Bertran Monfort, Ramon; Buyuktosunoglu, Alper; Gupta, Meeta S.; Gonzalez Tallada, Marc; Bose, Pradip
    IEEE/ACM International Symposium on Microarchitecture
    p. 199-211
    Presentation's date: 2012-12-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).

    Postprint (author’s final draft)

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. Article No. 89-
    DOI: 10.1109/TC.2013.194
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Software Caching Techniques and Hardware Optimizations for On-Chip Local Memories  Open access

     Vujic, Nikola
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.

    Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals.

  • DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

     Vujic, Nikola; Alvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    p. 113-122
    DOI: 10.1145/2212908.2212925
    Presentation's date: 2012-05-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    p. 427-428
    DOI: 10.1145/2254756.2254827
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Design space exploration for aggressive core replication schemes in CMPs

     Álvarez Martí, Lluc; Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Symposium on High Performance Distributed Computing
    p. 269-270
    DOI: 10.1145/1996130.1996169
    Presentation's date: 2011-06-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

     Ferrer, Roger; Planas Carbonell, Judit; Bellens, Pieter; Duran Gonzalez, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International Workshop on Languages and Compilers for Parallel Computing
    p. 215-229
    DOI: 10.1007/978-3-642-19595-2_15
    Presentation's date: 2010-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmer¿s productivity.

    In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmer’s productivity.

  • Accurate energy accounting for shared virtualized environments using PMC-based power modeling techniques

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Torres Viñals, Jordi; Ayguade Parra, Eduard
    ACM/IEEE International Conference on Grid Computing
    p. 1-8
    DOI: 10.1109/GRID.2010.5697889
    Presentation's date: 2010-10-27
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    irtualized infrastructure providers demand new methods to increase the accuracy of the accounting models used to charge their customers. Future data centers will be composed of many-core systems that will host a large number of virtual machines (VMs) each. While resource utilization accounting can be achieved with existing system tools, energy accounting is a complex task when per-VM granularity is the goal. In this paper, we propose a methodology that brings new opportunities to energy accounting by adding an unprecedented degree of accuracy on the per-VM measurements. We present a system -which leverages CPU and memory power models based in performance monitoring counters (PMCs)- to perform energy accounting in virtualized systems. The contribution of this paper is twofold. First, we show that PMC-based power modeling methods are still valid on virtualized environments. And second, we introduce a novel methodology for accounting of energy consumption in virtualized systems. In overall, the results for an Intel® Core¿ 2 Duo show errors in energy estimations below the 5%. Such approach brings flexibility to the chargeback models used by service and infrastructure providers. For instance, we show that VMs executed during the same amount of time, present more than 20% differences in energy consumption even only taking into account the consumption of the CPU and the memory.

  • Decomposable and responsive power models for multicore processors using performance counters

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. 147-158
    DOI: 10.1145/1810085.1810108
    Presentation's date: 2010-06-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Towards accurate accounting of energy consumption in shared virtualized environments

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Becerra Fontal, Yolanda; Carrera Perez, David; Torres Viñals, Jordi; Ayguade Parra, Eduard
    International Conference on Energy-Efficient Computing and Networking
    Presentation's date: 2010-04
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Virtualized infrastructure providers demand new methods to increase the accuracy of the accounting models used to charge their customers. Future data centers will be composed of many-core systems that will host a large number of VMs each. While resource utilization accounting can be achieved with existing system tools, power metering is a complex task when per-VM granularity is the goal. In this paper we propose a novel methodology that brings new opportunities to power consumption accounting by adding an unprecedented degree of accuracy on the per-VM measurements. We present a system prototype that leverages power models based in performance monitoring counters (PMCs) to be used for energy accounting in virtualized systems. We validate the power modeling methodology in virtualized systems, by comparing the power predictions in both virtualized and non virtualized systems. The validation process has been performed as a case study for the Intel R CoreTM 2 Duo architecture. The validation follows two steps: first we validate the power model for one core, and second, we proceed on the validation of the entire processor. The resulting model is able to account for the power consumption for CPU and memory at process level. The main contribution of this paper is the introduction of a novel methodology that allows accurate accounting of energy consumption in virtualized systems. Accounting is done on a per-VM basis, even in the case that multiple VMs are deployed on top of the same physical hardware, overpassing the limitations of per-server aggregated power metering. Such approach can bring an unprecedented level of flexibility to the char

  • HiPEAC Paper Award

     Vujic, Nikola; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Cabarcas Jaramillo, Felipe; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Award or recognition

     Share

  • Analysis of task offloading for accelerators

     Ferrer, Roger; Beltran, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Conference on High Performance Embedded Architectures & Compilers
    p. 322-336
    DOI: 10.1007/978-3-642-11515-8_24
    Presentation's date: 2010-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    As an answer to the forthcoming heterogeneous multicore and accelerator¿based architectures, we have proposed some syntactic extensions to C in the form of C pragmas, based on OpenMP, that make easier for programmers to offload parts of their applications to the auxiliary processors. Offloaded tasks can be made more profitable using a simple blocking strategy. And the runtime system is used to better support computation and communication overlap, while moving data to and from accelerators. In order to prove the feasibility and usefulness of our proposal, we have considered the IBM Cell architecture. The performance of the whole system has been evaluated using HPCC STREAM Triad and several NAS benchmarks. We present their evaluation and a detailed performance breakdown at the level of parallel regions. We also classify the parallel regions according to their suitability to be exploited in accelerators. Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor.

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Gonzalez Tallada, Marc; Cabarcas Jaramillo, Felipe; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Symposium on High-Performance Computer Architecture (HPCA)
    p. 1-12
    DOI: 10.1109/HPCA.2010.5463057
    Presentation's date: 2010
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

     Vujic, Nikola; Álvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Languages and Compilers for Parallel Computing
    p. 218-232
    DOI: 10.1007/978-3-642-13374-9_15
    Presentation's date: 2009-10
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Achieving high memory performance from heterogeneous architectures with the SARC programming model

     Ferrer, Roger; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Workshop on Memory Performance: dealing with Applications, Systems and Architecture
    p. 15-21
    DOI: doi.acm.org/10.1145/1621960.1621963
    Presentation's date: 2009-09
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Gonzalez Tallada, Marc; Labarta Mancho, Jesus Jose; Tejedor Saavedra, Enric; Alonso López, Javier; Farreras Esclusa, Montserrat; Costa Prats, Juan Jose; Corbalan Gonzalez, Julita; Cortes Rossello, Antonio; Becerra Fontal, Yolanda; Badia Sala, Rosa Maria; Torres Viñals, Jordi; Herrero Zaragoza, José Ramón; Martorell Bofill, Xavier; Carrera Perez, David; Guitart Fernández, Jordi; Sirvent Pardell, Raül; Navarro Mas, Nacho; Gil Gómez, Maria Luisa; Nou Castell, Ramon; Ayguade Parra, Eduard
    Competitive project

     Share

  • Access to the full text
    Speeding up distributed MapReduce applications using hardware accelerators  Open access

     Becerra Fontal, Yolanda; Beltran Querol, Vicenç; Carrera Perez, David; Gonzalez Tallada, Marc; Torres Viñals, Jordi; Ayguade Parra, Eduard
    International Conference on Parallel Processing
    p. 42-49
    DOI: 10.1109/ICPP.2009.59
    Presentation's date: 2009-09-22
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogeneous at multiple levels: from asymmetric processors, to different system architectures, operating systems and networks. Exploiting the intrinsic multi-level parallelism present in such a complex execution environment has become a challenging task using traditional parallel and distributed programming models. As a result, an increasing need for novel approaches to exploiting parallelism has arisen in these environments. MapReduce is a data-driven programming model originally proposed by Google back in 2004 as a flexible alternative to the existing models, specially devoted to hiding the complexity of both developing and running massively distributed applications in large compute clusters. In some recent works, the MapReduce model has been also used to exploit parallelism in other non-distributed environments, such as multi-cores, heterogeneous processors and GPUs. In this paper we introduce a novel approach for exploiting the heterogeneity of a Cell BE cluster linking an existing MapReduce runtime implementation for distributed clusters and one runtime to exploit the parallelism of the Cell BE nodes. The novel contribution of this work is the design and evaluation of a MapReduce execution environment that effectively exploits the parallelism existing at both the Cell BE cluster level and the heterogeneous processors level.

    In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogeneous at multiple levels: from asymmetric processors, to different system architectures, operating systems and networks. Exploiting the intrinsic multi-level parallelism present in such a complex execution environment has become a challenging task using traditional parallel and distributed programming models. As a result, an increasing need for novel approaches to exploiting parallelism has arisen in these environments. MapReduce is a data-driven programming model originally proposed by Google back in 2004 as a flexible alternative to the existing models, specially devoted to hiding the complexity of both developing and running massively distributed applications in large compute clusters. In some recent works, the MapReduce model has been also used to exploit parallelism in other non-distributed environments, such as multi-cores, heterogeneous processors and GPUs. In this paper we introduce a novel approach for exploiting the heterogeneity of a Cell BE cluster linking an existing MapReduce runtime implementation for distributed clusters and one runtime to exploit the parallelism of the Cell BE nodes. The novel contribution of this work is the design and evaluation of a MapReduce execution environment that effectively exploits the parallelism existing at both the Cell BE cluster level and the heterogeneous processors level.

  • A proposal to extend the OpenMP tasking model for heterogeneous architectures

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Cabrera, Daniel; Duran Gonzalez, Alejandro; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Mayo, Rafael; Pérez, Josep M.; Quintana Ortí, Enrique Salvador; Martorell Bofill, Xavier; Gonzalez Tallada, Marc
    International Workshop on OpenMP
    p. 154-167
    DOI: 10.1007/978-3-642-02303-3
    Presentation's date: 2009-06-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Laboratorio de Introducción a los Computadores: funcionamiento y dificultades docentes

     Navarro Guerrero, Juan Jose; Cruz Diaz, Josep-llorenç; Faúndez Zanuy, Marcos; Gonzalez Tallada, Marc; Manso Cortes, Oscar; Muntés Mulero, Víctor; Palomar Perez, Oscar; Rodero Castro, Ivan; Sanchez Castaño, Friman; Solé Simó, Marc
    Jornades de Docència del Departament d'Arquitectura de Computadors
    p. 1-20
    Presentation's date: 2009-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Evaluation of memory performance on the cell BE with the SARC programming model

     Ferrer, Roger; Gonzalez Tallada, Marc; Federico, Silla; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Workshop on Memory Performance: dealing with Applications, Systems and Architecture
    p. 77-84
    DOI: 10.1145/1509084.1509095
    Presentation's date: 2008-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    With the advent of multicore architectures, especially with the heterogeneous ones, both computational and memory top performance are difficult to obtain using traditional programming models. Usually, programmers have to fully reorganize the code and data of their applications in order to maximize resource usage, and work with the low-level interfaces offered by the vendor-provided SDKs, to obtain high computational and memory performances. In this paper, we present the evaluation of the SARC programming model on the Cell BE architecture, with respect to memory performance. We show how we have annotated the HPL STREAM and RandomAccess applications, and the memory bandwidth obtained. Results indicate that the programming model provides good productivity and competitive performance on this kind of architectures.

  • Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture

     Gonzalez Tallada, Marc
    Parallel Architectures and Compilation Techniques
    Presentation's date: 2008-10-29
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture

     Gonzalez Tallada, Marc
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2008-10-25
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Hybrid access-specific software cache techniques for the cell BE architecture

     Gonzalez Tallada, Marc; Vujic, Nikola; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Eichenberger, Alexandre E.; Chen, Tong; Sura, Zehra; Zhang, Tao; O'Brien, Kevin; O¿Brien, Kathryn
    International Conference on Parallel Architectures and Compilation Techniques
    p. 292-302
    DOI: 10.1145/1454115.1454156
    Presentation's date: 2008-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, highlocality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed codeoptimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

  • OPTIMIZED CODE GENERATION TARGETING A HIGH LOCALITY SOFTWARE CACHE

     Gonzalez Tallada, Marc; Tong, Chen; Eichenberger, Alex; Zera, Sura; Kathryn, O'brien; O'brien, Kevin; Zhang, Tao
    Date of request: 2008-10-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • DYNAMICALLY CONTROLLING A PREFETCHING RANGE OF A SOFTWARE CONTROLLED CACHE

     Gonzalez Tallada, Marc; Tong, Chen; Zhang, Tao; Zehra, Sura
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • PREFETCHING IRREGULAR DATA REFERENCES FOR SOFTWARE CONTROLLED CACHE

     Gonzalez Tallada, Marc; Tong, Chen; Zhang, Tao; Zehra, Sura
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • REDUCING CACHE POLLUTION OF A SOFTWARE CONTROLLED CACHE

     Gonzalez Tallada, Marc
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • EFFICIENT SOFTWARE CACHE ACCESSING WITH HANDLING REUSE

     Gonzalez Tallada, Marc
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Prefetching Irregular References for Software Cache on Cell

     Gonzalez Tallada, Marc
    International Symposium on Code Generation and Optimization
    Presentation's date: 2008-04-06
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • DATA TRANSFER OPTIMIZED SOFTWARE CACHE FOR IRREGULAR MEMORY REFRENCES

     Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Eichenberger, Alex; Tong, Chen; Zehra, Sura; Zhang, Tao; Kathryn, O'brien; O'brien, Kevin
    Date of request: 2008-03-28
    Invention patent

     Share Reference managers Reference managers Open in new window

  • DATA TRANSFER OPTIMIZED SOFTWARE CACHE FOR REGULAR MEMORY REFERENCES

     Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Tong, Chen; Eichenberger, Alex; Zera, Sura; Kathryn, O'brien; O'brien, Kevin; Zhang, Tao
    Date of request: 2008-03-28
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Prefetching Irregular References for Software Cache on Cell

     Tong, Chen; Zhang, Tao; Zehra, Sura; Gonzalez Tallada, Marc; Kathryn, O'brien; O'brien, Kevin
    International Symposium on Code Generation and Optimization
    p. 155-164
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Novel Asynchronous Software Cache Implementation for the Cell-Be Processor

     Balart, Jairo; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Zehra, Sura; Tong, Chen; Zhang, Tao; O'brien, Kevin; Kathryn, O'brien
    Workshop on Languages and Compilers for Parallel Computing
    p. 125-140
    DOI: 10.1007/978-3-540-85261-2_9
    Presentation's date: 2007-10
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Improving Data Locality in NAS BT Benchmark

     Vaquero, Jordi; Gonzalez Tallada, Marc; Costa Prats, Juan Jose; Javier, Bueno; Martorell Bofill, Xavier; Cortes Rossello, Antonio; Ayguade Parra, Eduard
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 199-202
    Presentation's date: 2007-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Mercurium C/C ++ source-to-source compiler

     Ferrer, Roger; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    p. 239-242
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A proposal for error handling in OpenMP

     Duran González, Alejandro; Ferrer, Roger; Costa Prats, Juan Jose; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    2nd International Workshop on OpenMP (IWOMP 2006)
    p. 1
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Techniques Supporting threadprivate in OpenMP

     Martorell Bofill, Xavier; Gonzalez Tallada, Marc; Duran González, Alejandro; Balart, Jairo; Ferrer, Roger; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Workshop on High-Level Parallel Programming Models and Supportive Environments
    p. 244-251
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Proposal for Error Handling in OpenMP

     Duran González, Alejandro; Ferrer, Roger; Costa Prats, Juan Jose; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International Workshops, IWOMP 2005 and IWOMP 2006. OpenMP Shared Memory Parallel Programming
    p. 422-434
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Runtime Address Space Computation for SDSM Systems

     Balart, Jairo; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Languages and Compilers for Parallel Computing
    p. 330-344
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Experiences Parallelizing a Web Server with OpenMP

     Balart, Jairo; Duran González, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International Workshops, IWOMP 2005 and IWOMP 2006. OpenMP Shared Memory Parallel Programming
    p. 191-202
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Runtime Address Space Compuitation for SDSM Systems

     Balart, Jairo; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    19th International Workshop on Languages and Compilers for Parallel Computing (LCPC'06)
    p. 330-334
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Automatic Thread Distribution For Nested Parallelism In OpenMP

     Gonzalez Tallada, Marc
    19th ACM International Conference on Supercomputing (ISC'2005)
    Presentation's date: 2005-06-20
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Automatic Thread Distribution For Nested Parallelism In OpenMP

     Duran González, Alejandro; Gonzalez Tallada, Marc; Corbalan Gonzalez, Julita
    19th ACM International Conference on Supercomputing (ISC'2005)
    p. 121-130
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Experiencies parallelizing a web server with Open MP

     Duran González, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Jornadas de Paralelismo
    p. 403-410
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Skeleton driven transformations for an OpenMP compiler

     Gonzalez Tallada, Marc
    11th International Workshop on Compilers for Parallel Computers
    Presentation's date: 2004-07-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

     Gonzalez Tallada, Marc
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2004-04-26
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Nanos Mercurium: a Research Compiler for OpenMP

     Balart, J; Duran González, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    EWOMP 2004 6th European Workshop on OpenMP
    p. 103-109
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

     Ayguade Parra, Eduard; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Gabriele, Jost
    IEEE International Parallel and Distributed Processing Symposium
    p. 6
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Skeleton Driven transformations for an OpenMP compiler

     Balart, J; Duran González, Alejandro; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    11th International Workshop on Compilers for Parallel Computers
    p. 123-134
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window