Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 89 results
  • A systematic methodology to generate decomposable and responsive power models for CMPs

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2013-07
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand the power behavior of real systems. Consequently, several power-aware policies use models to guide their decisions. Hence, the presence of power models that are informative, accurate, and capable of detecting power phases is critical to improve the success of power-saving techniques. Additionally, the design of current processors varied considerably with the appearance of CMPs (multiple cores sharing resources). Thus, PMC-based power models warrant further investigation on current energy-efficient multicore processors. In this paper, we present a systematic methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we study theirresponsiveness -the capacity to detect power phases-. Specifically, we produce power models for an Intel Core 2 Duo with one and two cores enabled for all the DVFS configurations. The models are empirically validated using the SPECcpu2006, NAS and LMBENCH benchmarks. Finally, we compare the models against existing approaches concluding that the proposed methodology produces more accurate, responsive, and informative models.

  • Performance and Power Optimizations in Chip Multiprocessors for Throughput-Aware Computation  Open access

     Vega, Augusto Javier
    Defense's date: 2013-07-30
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.

    El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias.

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks  Open access

     Bertran Monfort, Ramon; Buyuktosunoglu, Alper; Gupta, Meeta S.; Gonzalez Tallada, Marc; Bose, Pradip
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2012-12-01
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).

    Postprint (author’s final draft)

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

     Vujic, Nikola; Alvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Counter-based power modeling methods: top-down vs. bottom-up

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The computer journal (Kalispell, Mont.)
    Date of publication: 2012-08-24
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS Performance Evaluation Review
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Software Caching Techniques and Hardware Optimizations for On-Chip Local Memories  Open access

     Vujic, Nikola
    Defense's date: 2012-06-05
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.

    Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals.

  • Energy accounting for shared virtualized environments under DVFS using PMC-based power models

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Torres Viñals, Jordi; Ayguade Parra, Eduard
    Future generation computer systems
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Cabarcas Jaramillo, Felipe; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Programming, Debugging, Profiling and Optimizing Transactional Memory Programs  Open access

     Hasanov Zyulkyarov, Ferad
    Defense's date: 2011-07-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional memory (TM) is a new optimistic synchronization technique which has the potential of making shared memory parallel programming easier compared to locks without giving up from the performance. This thesis explores four aspects in the research of transactional memory. First, it studies how programming with TM compares to locks. During the course of work, it develops the first real transactional application ¿ AtomicQuake. AtomicQuake is adapted from the parallel version of the Quake game server by replacing all lock-based synchronization with atomic blocks. Findings suggest that programming with TM is indeed easier than locks. However the performance of current software TM systems falls behind the efficiently implemented lock-based versions of the same program. Also, the same findings report that the proposed language level extensions are not sufficient for developing robust production level software and that the existing development tools such as compilers, debuggers, and profilers lack support for developing transactional application. Second, this thesis introduces new set of debugging principles and abstractions. These new debugging principles and abstractions enable debugging synchronization errors which manifest at coarse atomic block level, wrong code inside atomic blocks, and also performance errors related to the implementation of the atomic block. The new debugging principles distinguish between debugging at the language level constructs such as atomic blocks and debugging the atomic blocks based on how they are implemented whether TM or lock inference. These ideas are demonstrated by implementing a debugger extension for WinDbg and the ahead-of-time C# to X86 Bartok-STM compiler. Third, this thesis investigates the type of performance bottlenecks in TM applications and introduces new profiling techniques to find and understand these bottlenecks. The new profiling techniques provide in-depth and comprehensive information about the wasted work caused by aborting transactions. The individual profiling abstractions can be grouped in three groups: (i) techniques to identify multiple conflicts from a single program run, (ii) techniques to describe the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize which transactions conflict most. The ideas were demonstrated by building a lightweight profiling framework for Bartok-STM and an offline tool which process and display the profiling data. Forth, this thesis explores and introduces new TM specific optimizations which target the wasted work due to aborting transactions. Using the results obtained with the profiling tool it analyzes and optimizes several applications from the STAMP benchmark suite. The profiling techniques effectively revealed TM-specific bottlenecks such as false conflicts and contentions accesses to data structures. The discovered bottlenecks were subsequently eliminated with using the new optimization techniques. Among the optimization highlights are the transaction checkpoints which reduced the wasted work in Intruder with 40%, decomposing objects to eliminate false conflicts in Bayes, early release in Labyrinth which decreased wasted work from 98% to 1%, using less contentions data structures such as chained hashtable in Intruder and Genome which have higher degree of parallelism.

  • Efficient OpenMP over sequentially consistent distributed shared memory systems  Open access

     Costa Prats, Juan Jose
    Defense's date: 2011-07-20
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Nowadays clusters are one of the most used platforms in High Performance Computing and most programmers use the Message Passing Interface (MPI) library to program their applications in these distributed platforms getting their maximum performance, although it is a complex task. On the other side, OpenMP has been established as the de facto standard to program applications on shared memory platforms because it is easy to use and obtains good performance without too much effort. So, could it be possible to join both worlds? Could programmers use the easiness of OpenMP in distributed platforms? A lot of researchers think so. And one of the developed ideas is the distributed shared memory (DSM), a software layer on top of a distributed platform giving an abstract shared memory view to the applications. Even though it seems a good solution it also has some inconveniences. The memory coherence between the nodes in the platform is difficult to maintain (complex management, scalability issues, high overhead and others) and the latency of the remote-memory accesses which can be orders of magnitude greater than on a shared bus due to the interconnection network. Therefore this research improves the performance of OpenMP applications being executed on distributed memory platforms using a DSM with sequential consistency evaluating thoroughly the results from the NAS parallel benchmarks. The vast majority of designed DSMs use a relaxed consistency model because it avoids some major problems in the area. In contrast, we use a sequential consistency model because we think that showing these potential problems that otherwise are hidden may allow the finding of some solutions and, therefore, apply them to both models. The main idea behind this work is that both runtimes, the OpenMP and the DSM layer, should cooperate to achieve good performance, otherwise they interfere one each other trashing the final performance of applications. We develop three different contributions to improve the performance of these applications: (a) a technique to avoid false sharing at runtime, (b) a technique to mimic the MPI behaviour, where produced data is forwarded to their consumers and, finally, (c) a mechanism to avoid the network congestion due to the DSM coherence messages. The NAS Parallel Benchmarks are used to test the contributions. The results of this work shows that the false-sharing problem is a relative problem depending on each application. Another result is the importance to move the data flow outside of the critical path and to use techniques that forwards data as early as possible, similar to MPI, benefits the final application performance. Additionally, this data movement is usually concentrated at single points and affects the application performance due to the limited bandwidth of the network. Therefore it is necessary to provide mechanisms that allows the distribution of this data through the computation time using an otherwise idle network. Finally, results shows that the proposed contributions improve the performance of OpenMP applications on this kind of environments.

  • Decomposable and responsive power models for multicore processors using performance counters

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2010-06-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Gonzalez Tallada, Marc; Cabarcas Jaramillo, Felipe; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Symposium on High-Performance Computer Architecture (HPCA)
    Presentation's date: 2010
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Accurate energy accounting for shared virtualized environments using PMC-based power modeling techniques

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Torres Viñals, Jordi; Ayguade Parra, Eduard
    ACM/IEEE International Conference on Grid Computing
    Presentation's date: 2010-10-27
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture

     Vujic, Nikola; Gonzalez Tallada, Marc; Martorell, Xavier; Ayguade Parra, Eduard
    IEEE transactions on parallel and distributed systems
    Date of publication: 2010-04
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Parallel programming models for heterogeneous multicore architectures

     Ferrer, Roger; Bellens, Pieter; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Yeom, Jae-Seung; Schneider, Scott; Koukos, Konstantinos; Alvanos, Michail; Nikolopoulos, Dimitrios S.; Bilas, Angelos
    IEEE micro
    Date of publication: 2010-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • HiPEAC Paper Award

     Vujic, Nikola; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Cabarcas Jaramillo, Felipe; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Award or recognition

     Share

  • Extending OpenMP to survive the heterogeneous multi-core era

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Bellens, Pieter; Cabrera, Daniel; Duran González, Alejandro; Ferrer, Roger; Gonzalez Tallada, Marc; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Martinell, Lluis; Martorell Bofill, Xavier; Mayo, Rafael; Pérez Cáncer, Josep Maria; Planas, Judit; Quintana Ortí, Enrique Salvador
    International journal of parallel programming
    Date of publication: 2010-10
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, theCell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

  • Local memory design space exploration for high-performance computing

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The Computer journal (paper)
    Date of publication: 2010-03-23
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A proposal to extend the OpenMP tasking model for heterogeneous architectures

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Cabrera, Daniel; Duran Gonzalez, Alejandro; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Mayo, Rafael; Pérez, Josep M.; Quintana Ortí, Enrique Salvador; Martorell Bofill, Xavier; Gonzalez Tallada, Marc
    International Workshop on OpenMP
    Presentation's date: 2009-06-03
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Speeding up distributed MapReduce applications using hardware accelerators  Open access

     Becerra Fontal, Yolanda; Beltran Querol, Vicenç; Carrera Perez, David; Gonzalez Tallada, Marc; Torres Viñals, Jordi; Ayguade Parra, Eduard
    International Conference on Parallel Processing
    Presentation's date: 2009-09-22
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogeneous at multiple levels: from asymmetric processors, to different system architectures, operating systems and networks. Exploiting the intrinsic multi-level parallelism present in such a complex execution environment has become a challenging task using traditional parallel and distributed programming models. As a result, an increasing need for novel approaches to exploiting parallelism has arisen in these environments. MapReduce is a data-driven programming model originally proposed by Google back in 2004 as a flexible alternative to the existing models, specially devoted to hiding the complexity of both developing and running massively distributed applications in large compute clusters. In some recent works, the MapReduce model has been also used to exploit parallelism in other non-distributed environments, such as multi-cores, heterogeneous processors and GPUs. In this paper we introduce a novel approach for exploiting the heterogeneity of a Cell BE cluster linking an existing MapReduce runtime implementation for distributed clusters and one runtime to exploit the parallelism of the Cell BE nodes. The novel contribution of this work is the design and evaluation of a MapReduce execution environment that effectively exploits the parallelism existing at both the Cell BE cluster level and the heterogeneous processors level.

  • Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

     Vujic, Nikola; Álvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Languages and Compilers for Parallel Computing
    Presentation's date: 2009-10
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Laboratorio de Introducción a los Computadores: funcionamiento y dificultades docentes

     Navarro Guerrero, Juan Jose; Cruz Diaz, Josep-llorenç; Faúndez Zanuy, Marcos; Gonzalez Tallada, Marc; Manso Cortes, Oscar; Muntés Mulero, Víctor; Palomar Perez, Oscar; Rodero Castro, Ivan; Sanchez Castaño, Friman; Solé Simó, Marc
    Jornades de Docència del Departament d'Arquitectura de Computadors
    Presentation's date: 2009-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • MPEXPAR: MODELS DE PROGRAMACIO I ENTORNS D'EXECUCIO PARAL·LELS

     Gonzalez Tallada, Marc; Alonso López, Javier; Sirvent Pardell, Raül; Guitart Fernández, Jordi; Carrera Perez, David; Martorell Bofill, Xavier; Torres Viñals, Jordi; Badia Sala, Rosa Maria; Cortes Rossello, Antonio; Corbalan Gonzalez, Julita; Costa Prats, Juan Jose; Farreras Esclusa, Montserrat; Gil Gómez, Maria Luisa; Navarro Mas, Nacho; Herrero Zaragoza, José Ramón; Tejedor Saavedra, Enric; Becerra Fontal, Yolanda; Nou Castell, Ramon; Labarta Mancho, Jesus Jose; Ayguade Parra, Eduard
    Participation in a competitive project

     Share

  • Achieving high memory performance from heterogeneous architectures with the SARC programming model

     Ferrer, Roger; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Workshop on Memory Performance: dealing with Applications, Systems and Architecture
    Presentation's date: 2009-09
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A novel asynchronous Software Cache implementation for the Cell-BE processor

     Balart, J; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Sura, Z; Chen, T; Zhang, T; O'Brien, K
    Lecture notes in computer science
    Date of publication: 2008-10
    Journal article

     Share Reference managers Reference managers Open in new window

  • Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

     Vujic, N; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Lecture notes in computer science
    Date of publication: 2008-01
    Journal article

     Share Reference managers Reference managers Open in new window

  • Prefetching Irregular References for Software Cache on Cell

     Gonzalez Tallada, Marc
    International Symposium on Code Generation and Optimization
    Presentation's date: 2008-04-06
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Prefetching Irregular References for Software Cache on Cell

     Tong, Chen; Zhang, Tao; Zehra, Sura; Gonzalez Tallada, Marc; Kathryn, O'brien; O'brien, Kevin
    International Symposium on Code Generation and Optimization
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture

     Gonzalez Tallada, Marc
    Parallel Architectures and Compilation Techniques
    Presentation's date: 2008-10-29
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture

     Gonzalez Tallada, Marc
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2008-10-25
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Hybrid access-specific software cache techniques for the cell BE architecture

     Gonzalez Tallada, Marc; Vujic, Nikola; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Eichenberger, Alexandre E.; Chen, Tong; Sura, Zehra; Zhang, Tao; O'Brien, Kevin; O¿Brien, Kathryn
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2008-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, highlocality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed codeoptimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

  • Evaluation of memory performance on the cell BE with the SARC programming model

     Ferrer, Roger; Gonzalez Tallada, Marc; Federico, Silla; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Workshop on Memory Performance: dealing with Applications, Systems and Architecture
    Presentation's date: 2008-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    With the advent of multicore architectures, especially with the heterogeneous ones, both computational and memory top performance are difficult to obtain using traditional programming models. Usually, programmers have to fully reorganize the code and data of their applications in order to maximize resource usage, and work with the low-level interfaces offered by the vendor-provided SDKs, to obtain high computational and memory performances. In this paper, we present the evaluation of the SARC programming model on the Cell BE architecture, with respect to memory performance. We show how we have annotated the HPL STREAM and RandomAccess applications, and the memory bandwidth obtained. Results indicate that the programming model provides good productivity and competitive performance on this kind of architectures.

  • OPTIMIZED CODE GENERATION TARGETING A HIGH LOCALITY SOFTWARE CACHE

     Gonzalez Tallada, Marc; Tong, Chen; Eichenberger, Alex; Zera, Sura; Kathryn, O'brien; O'brien, Kevin; Zhang, Tao
    Date of request: 2008-10-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • DYNAMICALLY CONTROLLING A PREFETCHING RANGE OF A SOFTWARE CONTROLLED CACHE

     Gonzalez Tallada, Marc; Tong, Chen; Zhang, Tao; Zehra, Sura
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • PREFETCHING IRREGULAR DATA REFERENCES FOR SOFTWARE CONTROLLED CACHE

     Gonzalez Tallada, Marc; Tong, Chen; Zhang, Tao; Zehra, Sura
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • REDUCING CACHE POLLUTION OF A SOFTWARE CONTROLLED CACHE

     Gonzalez Tallada, Marc
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • EFFICIENT SOFTWARE CACHE ACCESSING WITH HANDLING REUSE

     Gonzalez Tallada, Marc
    Date of request: 2008-04-02
    Invention patent

     Share Reference managers Reference managers Open in new window

  • DATA TRANSFER OPTIMIZED SOFTWARE CACHE FOR IRREGULAR MEMORY REFRENCES

     Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Eichenberger, Alex; Tong, Chen; Zehra, Sura; Zhang, Tao; Kathryn, O'brien; O'brien, Kevin
    Date of request: 2008-03-28
    Invention patent

     Share Reference managers Reference managers Open in new window

  • DATA TRANSFER OPTIMIZED SOFTWARE CACHE FOR REGULAR MEMORY REFERENCES

     Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Tong, Chen; Eichenberger, Alex; Zera, Sura; Kathryn, O'brien; O'brien, Kevin; Zhang, Tao
    Date of request: 2008-03-28
    Invention patent

     Share Reference managers Reference managers Open in new window

  • A proposal for error handling in OpenMP

     Duran González, Alejandro; Ferrer, Roger; Costa Prats, Juan Jose; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International journal of parallel programming
    Date of publication: 2007-08
    Journal article

     Share Reference managers Reference managers Open in new window

  • A Novel Asynchronous Software Cache Implementation for the Cell-Be Processor

     Balart, Jairo; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Zehra, Sura; Tong, Chen; Zhang, Tao; O'brien, Kevin; Kathryn, O'brien
    Workshop on Languages and Compilers for Parallel Computing
    Presentation's date: 2007-10
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Improving Data Locality in NAS BT Benchmark

     Vaquero, Jordi; Gonzalez Tallada, Marc; Costa Prats, Juan Jose; Javier, Bueno; Martorell Bofill, Xavier; Cortes Rossello, Antonio; Ayguade Parra, Eduard
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation's date: 2007-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Mercurium C/C ++ source-to-source compiler

     Ferrer, Roger; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Employing NestedOpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

     Ayguade Parra, Eduard; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Jost, G
    Journal of parallel and distributed computing
    Date of publication: 2006-05
    Journal article

     Share Reference managers Reference managers Open in new window

  • A Proposal for Error Handling in OpenMP

     Duran González, Alejandro; Ferrer, Roger; Costa Prats, Juan Jose; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2006-06
    Journal article

     Share Reference managers Reference managers Open in new window

  • Runtime Address Space Computation for SDSM Systems

     Balart, J; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Lecture notes in computer science
    Date of publication: 2006-11
    Journal article

     Share Reference managers Reference managers Open in new window