Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 1063 results
  • DReAM: Per-task DRAM energy metering in multicore systems

     Liu, Qixiao; Moreto Planas, Miquel; Abella Ferrer, Jaume; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    International European Conference on Parallel and Distributed Computing
    p. 111-123
    DOI: 10.1007/978-3-319-09873-9-10
    Presentation's date: 2014-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Interaction across applications in DRAM memory impacts its energy consumption. This paper makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations, such as per-task energy-aware task scheduling and energy-aware billing in datacenters. In particular, the contributions of this paper are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate, yet low cost, implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is more accurate than these other methods.

  • Physical vs. physically-aware estimation flow: case study of design space exploration of adders

     Ratkovic, Ivan; Palomar Perez, Oscar; Stanic, Milan; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    IEEE Computer Society Annual Symposium on VLSI
    p. 118-123
    DOI: 10.1109/ISVLSI.2014.14
    Presentation's date: 2014-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Selecting an appropriate estimation method for a given technology and design is of crucial interest as the estimations guide future project and design decisions. The accuracy of the estimations of area, timing, and power (metrics of interest) depends on the phase of the design flow and the fidelity of the models. In this research, we use design space exploration of low-power adders as a case study for comparative analysis of two estimation flows: Physical layout Aware Synthesis (PAS) and Place and Route (PnR). We study and compare post-PAS and post-PnR estimations of the metrics of interest and the impact of various design parameters and input switching activity factor (aI). Adders are particularly interesting for this study because they are fundamental microprocessor units, and their designinvolves many parameters that create a vast design space. We show cases when the post-PAS and post-PnR estimations could lead to different design decisions, especially from a low-power designer point of view. Our experiments reveal that post-PAS results underestimate the side-effects of clock-gating, pipelining, and extensive timing optimizations compared to post-PnR results. We also observe that PnR estimation flow sometimes reports counterintuitive results

  • Techniques for Improving the Performance of Software Transactional Memory  Open access

     Stipic, Srdan
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional Memory (TM) da al desallorador de software la oportunidad de escribir programas concurenter mas facil comparado con todos los paradigmas de programacion previas y da la rendimiento comarable a lock-based sincronisaciones.Actuales Software TM (STM) implementaciones tienen gastos generales de rendimiento que se pueden reducir con la intoduccion de nuevas abstracciones enTransactional Memory modelo de programacion.En la tesis, presentamos cuatro nuevas tecnicas para mejorar el rendimiento de Software TM: (i) Abstract Nested Transactions (ANT), (ii) TagTM, (iii) profile-guided transaction coalescing, and (iv) dynamic transaction coalescing.ANT mejora el rendimiento de aplicaciones transaccionales si romper la semantica de paradigma transaccional, TagTM accelera los accesos a meta-data transaccional,profile-guided transaction coalescing baja gastos generales transaccionales en tiempo de compilacion, y dynamic transaction coalescing baja gastos generales transaccionales en tiempo de ejecucion.Nuestra analisis muestra que Abstract Nested Transactions, TagTM, profile-guided transaction coalescing, y dynamic transaction coalescing mejoran el rendimiento de los programas originales que utilisan Software Transactional Memory.

    Transactional Memory (TM) gives software developers the opportunity to write concurrent programs more easily compared to any previous programming paradigms and gives a performance comparable to lock-based synchronizations. Current Software TM (STM) implementations have performance overheads that can be reduced by introducing new abstractions in Transactional Memory programming model. In this thesis we present four new techniques for improving the performance of Software TM: (i) Abstract Nested Transactions (ANT), (ii) TagTM, (iii) profile-guided transaction coalescing, and (iv) dynamic transaction coalescing. ANT improves performance of transactional applications without breaking the semantics of the transactional paradigm, TagTM speeds up accesses to transactional meta-data, profile-guided transaction coalescing lowers transactional overheads at compile time, and dynamic transaction coalescing lowers transactional overheads at runtime. Our analysis shows that Abstract Nested Transactions, TagTM, profile-guided transaction coalescing, and dynamic transaction coalescing improve the performance of the original programs that use Software Transactional Memory.

  • PVMC: Programmable Vector Memory Controller

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo
    International Conference on Application-Specific Systems, Architectures and Processors
    p. 240-247
    DOI: 10.1109/ASAP.2014.6868668
    Presentation's date: 2014-06-18
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized local memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. We compare the performance of our proposal with a vector system without PVMC as well as a scalar only system. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 2.2× to 14.9× faster, achieves between 2.16× to 3.18× of speedup for 5 applications and consumes 2.56 to 4.04 times less energy. © 2014 IEEE.

  • Automatic exploration of potential parallelism in sequential applications

     Subotic, Vladimir; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    International Supercomputing Conference
    p. 156-171
    DOI: 10.1007/978-3-319-07518-1-10
    Presentation's date: 2014-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The multicore era has increased the need for highly parallel software. Since automatic parallelization turned out ineffective for many production codes, the community hopes for the development of tools that may assist parallelization, providing hints to drive the parallelization process. In our previous work, we had designed Tareador, a tool based on dynamic instrumentation that identifies potential task-based parallelism inherent in applications. Also, we showed how a programmer can use Tareador to explore the potential of different parallelization strategies. In this paper, we build up on our previous work by automating the process of exploring parallelism. We have designed an environment that, given a sequential code and configuration of the target parallel architecture, iteratively runs Tareador to find an efficient parallelization strategy. We propose an autonomous algorithm based on simple metrics and a cost function. The algorithm finds an efficient parallelization strategy and provides the programmer with sufficient information to turn that parallelization strategy into an actual parallel program.

  • CPU Accounting in Multi-Threaded Processors  Open access

     Ruiz Luque, Jose Carlos
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En los últimos años, los procesadores multihilo se han convertido más y más populares en la industria con el fin de aumentar el rendimiento del sistema y de cada aplicación, superando las limitaciones impuestas por el limitado paralelismo a nivel de instrucciones y por las restricciones de potencia y térmica. Los procesadores multihilo son ampliamente utilizados en servidores, ordenadores de sobremesa, ordenadores portátiles y dispositivos móviles.Sin embargo, los procesadores multihilo introducen complejidades en la medición de la capacidad del procesador, ya que la capacidad del procesador medida a una aplicación no sólo depende del tiempo que la aplicación está ejecutándose en un procesador, pero también de la cantidad de los recursos hardware que recibe durante ese intervalo. Y dado que en un procesador multihilo, los recursos hardware son compartidos dinámicamente entre las aplicaciones, la capacidad del procesador medida a una aplicación depende de las aplicaciones que se ejecutan concurrentemente en el procesador. Esto es un inconveniente debido a que la medición de la capacidad del procesador de la misma aplicación con el mismo conjunto de datos de entrada puede ser medida significativamente diferente dependiendo de la carga de trabajo en el que se ejecuta. El uso de sistemas con mecanismos precisos de medida de la capacidad del procesador es necesario para mejor el uso justo del sistema entre las aplicaciones en el sistema. Además, permite a los usuarios ser cargado justamente en centro de datos compartidos, facilitando la consolidación de servidores en sistemas futuros.Esta tesis analiza los conceptos de capacidad del procesador y medida de la capacidad del procesador para procesadores multihilo. En este estudio, nosotros mostramos que los mecanismos actuales de medida de la capacidad del procesador no son tan precisos que deberían ser en procesadores multihilo. Por esta razón, nosotros presentamos dos novedosos mecanismos hardware de medida de la capacidad del procesador que mejora la precisión en la medición de la capacidad de procesador en procesadores multihilo con un bajo coste hardware. Nosotros nos centramos en diferentes tipos de arquitectura de procesadores multihilo tales como procesadores multinúcleo y procesadores multinúcleo que cada núcleo soporta multihilo. Finalmente, nosotros analizamos el impacto de los recursos hardware compartidos en los procesadores multihilo en el planificador del procesador en el sistema operativo y proponemos varios modelos de planificador que mejora el conocimiento de los recursos hardware compartidos.

    In recent years, multi-threaded processors have become more and more popular in industry in order to increase the system aggregated performance and per-application performance, overcoming the limitations imposed by the limited instruction-level parallelism, and by power and thermal constraints. Multi-threaded processors are widely used in servers, desktop computers, lap-tops, and mobile devices. However, multi-threaded processors introduce complexities when accounting CPU (computation) capacity (CPU accounting), since the CPU capacity accounted to an application not only depends upon the time that the application is scheduled onto a CPU, but also on the amount of hardware resources it receives during that period. And given that in a multi-threaded processor hardware resources are dynamically shared between applications, the CPU capacity accounted to an application in a multi-threaded processor depends upon the workload in which it executes. This is inconvenient because the CPU accounting of the same application with the same input data set may be accounted significantly different depending upon the workload in which it executes. Deploying systems with accurate CPU accounting mechanisms is necessary to increase fairness among running applications. Moreover, it will allow users to be fairly charged on a shared data center, facilitating server consolidation in future systems. This Thesis analyses the concepts of CPU capacity and CPU accounting for multi-threaded processors. In this study, we demonstrate that current CPU accounting mechanisms are not as accurate as they should be in multi-threaded processors. For this reason, we present two novel CPU accounting mechanisms that improve the accuracy in measuring the CPU capacity for multi-threaded processors with low hardware overhead. We focus our attention on several current multi-threaded processors, including chip multiprocessors and simultaneous multithreading processors. Finally, we analyse the impact of shared resources in multi-threaded processors in operating system CPU scheduler and we propose several schedulers that improve the knowledge of shared hardware resources at the software level.

  • Scalable System Software for High Performance Large-scale Applications  Open access

     Morari, Alessandro
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En las últimas décadas, los sistemas a gran escala de alto rendimiento han sido una herramienta fundamental para el descubrimiento científico y la ingeniería. El crecimiento de las peformance los supercomputadores y la consiguiente reducción de los costes han hecho que esta tecnología sea disponible para un gran número de científicos e ingenieros que trabajan en muchos problemas diferentes . El diseño de la próxima generación de supercomputadoras incluirán requisitos de High Perfomance Computing (HPC) tradicionales, así como los nuevos requisitos para manejar gran volumen de datos. Las aplicaciones de datos intensivos juegan un papel importante en una variedad de campos, y son el foco actual de varias líneas de investigación en HPC.Debido a los retos de escalabilidad y eficiencia, la próxima generación de superordenadores necesita un rediseño de todo lo stack del software. Se espera que el software del sistema va a cambiar drásticamente para adaptarse al próximo hardware y para satisfacer las nuevas necesidades de las aplicaciónes.Esta tesis doctoral estudia la escalabilidad del software del sistema. La tesis se inicia en el nivel de sistema operativo: primero estudia un OS general-purpose (Linux) y luego estudia light-weight kernels ( CNK ). A continuación, la tesi estudia el runtime system: implementamos un runtime system para sistemas de memoria distribuida que incluye muchos de los servicios de sistema requeridos por aplicaciones de próxima generación. Por fin, estudiamo las características hardware que pueden ser explotadas a nivel de usuario para mejorar las applicaciones, y potencialmente incluir estas en nuestro runtime system.Las contribuciones de esta tesis son las siguientes :Escalabilidad del sistema operativo: Proporcionamos un estudio preciso de los problemas de escalabilidad de los sistemas operativos modernos para HPC. Diseñamos y implementamos una metodología donde la información cuantitativa detallada puede ser obtenida para cada evento de OS noise. Validamos nuestro enfoque comparándolo con otras técnicas convencionales bien conocidas para analizar el ruido del sistema operativo, tales FTQ ( Fixed Time Quantum ) . Evaluación de la gestión de la TLB para un lightweight kernel: proporcionamos una evaluación del TLB handling - asignación de memoria dinámica, la asignación de memoria estática con las entradas de la TLB reemplazables , y asignación de memoria estática con las entradas de la TLB fijas (no TLB misses ) en un sistema IBM Blue Gene / P.Escalabilidad del runtime system : Diseñamos e implementamos un runtime system con todas las funciones y el modelo de programación para ejecutar aplicaciones irregulares en un clúster. El runtime system es una libreria llamad Global Memory and Threading ( GMT) y integra un modelo de comunicación PGAS y una estructura de programa fork/join. El runtime system usa aggregacion par cubrir la larencia de red. Comparamos GMT con otros modelos PGAS , con codigo MPI optimizado a mano y arquitecturas personalizadas ( Cray XMT) sobreun conjunto de aplicaciones irregulares a gran escala: breadth first search , random walk y concurrent hashamp. Nuestro runtime es órdenes de magnitud superior a otras soluciones para cluster systems con arquiectura similare.Escalabilidad de nivel de usuario explotando características del hardware : Mostramos la alta complejidad de las optimizaciones de hardware de bajo nivel como una motivación para incorporar esta lógica en un runtime system. Evaluamos los efectos de mecanismo de hardware-thread priority que controla la velocidad a la que cada hilo de clock decodifica la instrucciónes sobre IBM POWER5 y POWER6 . Finalmente, mostramos cómo se puede explotar eficazmente la localidad de caché y de network-on-chip en una arquitectura Tilera many-core para mejorar la escalabilidad intra-core.

    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability.

  • Stand-alone memory controller for graphics system

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo; Haider, Amna
    International Symposium on Applied Reconfigurable Computing
    p. 108-120
    DOI: 10.1007/978-3-319-05960-0_10
    Presentation's date: 2014-04
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    There has been a dramatic increase in the complexity of graphics applications in System-on-Chip (SoC) with a corresponding increase in performance requirements. Various powerful and expensive platforms to support graphical applications appeared recently. All these platforms require a high performance core that manages and schedules the high speed data of graphics peripherals (camera, display, etc.) and an efficient on chip scheduler. In this article we design and propose a SoC based Programmable Graphics Controller (PGC) that handles graphics peripherals efficiently. The data access patterns are described in the pro- gram memory; the PGC reads them, generates transactions and manages both bus and connected peripherals without the support of a master core. The proposed system is highly reliable in terms of cost, performance and power. The PGC based system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the PGC is compared with the Microblaze processor based graphic system. When compared with the baseline system, the results show that the PGC captures video at 2x of higher frame rate and achieves 3.4x to 7.4x of speedup while process- ing images. PGC consumes 30% less hardware resources and 22% less on-chip power than the baseline system.

  • EVX: vector execution on low power EDGE cores

     Duric, Milovan; Palomar Perez, Oscar; Smith, Aaron; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo; Burger, Doug
    Design, Automation and Test in Europe
    p. 1-4
    DOI: 10.7873/DATE.2014.035
    Presentation's date: 2014-03-24
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions. © 2014 EDAA.

  • APMC: advanced pattern based memory controller

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo; Rethinagiri, Santhosh Kumar
    ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
    p. 252
    DOI: 10.1145/2554688.2554732
    Presentation's date: 2014-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that supports both regular and irregular memory access patterns. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively.

  • Arquitectura de Computadors d'Altes Prestacions (ACAP)

     Olive Duran, Angel; Ramirez Bellido, Alejandro; Llosa Espuny, Jose Francisco; Sanchez Carracedo, Fermin; Jimenez Castells, Marta; Fernandez Jimenez, Agustin; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Morancho Llena, Enrique; Moretó Planas, Miquel; Carpenter, Paul M.; Palomar Perez, Oscar; Monreal Arnal, Teresa; Valero Cortes, Mateo
    Competitive project

     Share

  • Raising the Level of Abstraction: Simulation of Large Chip Multiprocessors Running Multithreaded Applications  Open access

     Rico Carro, Alejandro
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El nombre de transistors en els circuits integrats continua doblant-se cada dos anys. Aquest nombre creixent de transistors s'utilitza per integrar més nuclis de processament en el mateix chip. No obstant això, degut a la densitat de potència i la disminució de guanys per l'explotació de paral·lelisme a nivell d'instrucció, el rendiment d'un sol fil en un nucli no es dobla cada dos anys, sinó cada tres anys i mig.La investigació en arquitectura de computadors està principalment basada en simulació. En els simuladors d'arquitectura de computadors, la complexitat de la màquina simulada incrementa amb el nombre de transistors disponibles. Quant més transistors, més nuclis, i més complexe és el model. No obstant això, el rendiment dels simuladors d'arquitectura de computadors depèn del rendiment d'un sol fil a la màquina que executa el simulador i, com hem dit abans, aquest no es dobla cada dos anys, sinó cada tres anys i mig. Aquesta diferència creixent entre la complexitat de la màquina simulada i la velocitat de simulació és el que anomenem la bretxa de la velocitat de simulació.Degut a la bretxa de la velocitat de simulació, els simuladors d'arquitectura de computadors són cada cop més lents. La simulació d'una aplicació de referència pot trigar setmanes o inclús mesos. Els investigadors són conscients d'aquest problema i han estat proposant tècniques per reduir el temps de simulació. Aquestes tècniques inclouen l'ús de conjunt d'entrades d'aplicació reduïts, simulació mostrejada i paral·lelització.Un altra tècnica per reduir el temps de simulació és elevar el nivell d'abstracció del model simulat. En aquesta tesi advoquem per aquesta estratègia. Primer, decidim utilitzar simulació basada en traces perquè no necessita proporcionar simulació funcional i, per tant, permet elevar el nivell d'abstracció més enllà de la representació a nivell d'instrucció.No obstant això, la simulació basada en traces té algunes limitacions. La més important d'aquestes és que no pot reproduir el comportament dinàmic de les aplicacions multifil. En aquesta tesi, proposem una metodologia de simulació que utilitza simulació basada en traces conjuntament amb un sistema en temps d'execució que permet una simulació correcta d'aplicacions multifil al reproduir el comportament depenent del rendiment en temps d'execució.Amb aquesta metodologia, avaluem l'ús de múltiples nivells d'abstracció per reduir el temps de simulació, des d'un mode de simulació a nivell d'aplicació fins a un mode detallat a nivell d'instrucció. Fem una avaluació exhaustiva de l'impacte en la precisió i velocitat de simulació d'aquests nivells d'abstracció i també mostrem la seva aplicabilitat i utilitat depenent de les avaluacions que es volen portar a terme. També comparem aquest nivell d'abstracció amb els emprats en els simuladors d'arquitectura de computadors més populars. A més, validem el nivell d'abstracció més alt contra la màquina real.Un dels nivells d'abstracció interessants per la simulació d'arquitectures multi-nucli és el mode de memòria. Aquest mode és capaç de modelar el rendiment d'un nucli superescalar fora d'ordre utilitzant traces d'accessos a memòria. A aquest nivell d'abstracció, treballs previs han utilitzat traces filtrades que no inclouen encerts en la memòria cau de primer nivell, i permeten simular només els accessos al segon nivell per simulacions d'un sol fil. No obstant això, simular aplicacions multifil utilitzant traces filtrades igual que en treballs previs comporta algunes imprecisions. Nosaltres proposem una tècnica per reduir aquestes imprecisions i avaluem el guany en velocitat, la seva aplicabilitat i utilitat per simulacions a nivell de memòria.En conjunt, aquesta tesi contribueix al coneixement amb tècniques per la simulació de chips multiprocessador amb centenars de nuclis utilitzant traces, alhora que estableix i avalua els compromisos d'utilitzar diversos nivells d'abstracció en termes de precisió i velocitat de simulació.

    The number of transistors on an integrated circuit keeps doubling every two years. This increasing number of transistors is used to integrate more processing cores on the same chip. However, due to power density and ILP diminishing returns, the single-thread performance of such processing cores does not double every two years, but doubles every three years and a half. Computer architecture research is mainly driven by simulation. In computer architecture simulators, the complexity of the simulated machine increases with the number of available transistors. The more transistors, the more cores, the more complex is the model. However, the performance of computer architecture simulators depends on the single-thread performance of the host machine and, as we mentioned before, this is not doubling every two years but every three years and a half. This increasing difference between the complexity of the simulated machine and simulation speed is what we call the simulation speed gap. Because of the simulation speed gap, computer architecture simulators are increasingly slow. The simulation of a reference benchmark may take several weeks or even months. Researchers are concious of this problem and have been proposing techniques to reduce simulation time. These techniques include the use of reduced application input sets, sampled simulation and parallelization. Another technique to reduce simulation time is raising the level of abstraction of the simulated model. In this thesis we advocate for this approach. First, we decide to use trace-driven simulation because it does not require to provide functional simulation, and thus, allows to raise the level of abstraction beyond the instruction-stream representation. However, trace-driven simulation has several limitations, the most important being the inability to reproduce the dynamic behavior of multithreaded applications. In this thesis we propose a simulation methodology that employs a trace-driven simulator together with a runtime sytem that allows the proper simulation of multithreaded applications by reproducing the timing-dependent dynamic behavior at simulation time. Having this methodology, we evaluate the use of multiple levels of abstraction to reduce simulation time, from a high-speed application-level simulation mode to a detailed instruction-level mode. We provide a comprehensive evaluation of the impact in accuracy and simulation speed of these abstraction levels and also show their applicability and usefulness depending on the target evaluations. We also compare these levels of abstraction with the existing ones in popular computer architecture simulators. Also, we validate the highest abstraction level against a real machine. One of the interesting levels of abstraction for the simulation of multi-cores is the memory mode. This simulation mode is able to model the performanceof a superscalar out-of-order core using memory-access traces. At this level of abstraction, previous works have used filtered traces that do not include L1 hits, and allow to simulate only L2 misses for single-core simulations. However, simulating multithreaded applications using filtered traces as in previous works has inherent inaccuracies. We propose a technique to reduce such inaccuracies and evaluate the speed-up, applicability, and usefulness of memory-level simulation. All in all, this thesis contributes to knowledge with techniques for the simulation of chip multiprocessors with hundreds of cores using traces. It states and evaluates the trade-offs of using varying degress of abstraction in terms of accuracy and simulation speed.

  • Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs  Open access

     Subotic, Vladimir
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de unprograma paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela.En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo.En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica demensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes.En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento.También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones.Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas.Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa.En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programaciónMPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs.Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

    Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

  • Performance and Power Optimizations in Chip Multiprocessors for Throughput-Aware Computation  Open access

     Vega, Augusto Javier
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño.En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos.El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo.Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía.Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias.

    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.

    El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias.

  • On the selection of adder unit in energy efficient vector processing

     Ratkovic, Ivan; Palomar Perez, Oscar; Stanic, Milan; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Symposium on Quality Electronic Design
    p. 143-150
    DOI: 10.1109/ISQED.2013.6523602
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Vector processors are a very promising solution for mobile devices and servers due to their inherently energy-efficient way of exploiting data-level parallelism. Previous research on vector architectures predominantly focused on performance, so vector processors require a new design space exploration to achieve low power. In this paper, we present a design space exploration of adder unit for vector processors (VA), as it is one of the crucial components in the core design with a non-negligible impact in overall performance and power. For this interrelated circuit-architecture exploration, we developed a novel framework with both architectural- and circuit-level tools. Our framework includes both design- (e.g. adder's family type) and vector architecture-related parameters (e.g. vector length). Finally, we present guidelines on the selection of the most appropriate VA for different types of vector processors according to different sets of metrics of interest. For example, we found that 2-lane configurations are more EDP (Energy×Delay)-efficient than single lane configurations for low-end mobile processors.

  • ACM Awards

     Valero Cortes, Mateo
    Award or recognition

    View View Open in new window  Share

  • Access to the full text
    Efficient cache architectures for reliable hybrid voltage operation using EDC codes  Open access

     Maric, Bojan; Abella Ferrer, Jaume; Valero Cortes, Mateo
    Design, Automation and Test in Europe
    p. 917-920
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Semiconductor technology evolution enables the design of sensor-based battery-powered ultra-low-cost chips (e.g., below 1 ¿) required for new market segments such as body, urban life and environment monitoring. Caches have been shown to be the highest energy and area consumer in those chips. This paper proposes a novel, hybrid-operation (high Vcc, ultra-low Vcc), single-Vcc domain cache architecture based on replacing energy-hungry bitcells (e.g., 10T) by more energy-efficient and smaller cells (e.g., 8T) enhanced with Error Detection and Correction (EDC) features for high reliability and performance predictability. Our architecture is proven to largely outperform existing solutions in terms of energy and area.

    Semiconductor technology evolution enables the design of sensor-based battery-powered ultra-low-cost chips (e.g., below 1 p) required for new market segments such as body, urban life and environment monitoring. Caches have been shown to be the highest energy and area consumer in those chips. This paper proposes a novel, hybrid-operation (high Vcc, ultra-low Vcc), single-Vcc domain cache architecture based on replacing energy-hungry bitcells (e.g., 10T) by more energy-efficient and smaller cells (e.g., 8T) enhanced with Error Detection and Correction (EDC) features for high reliability and performance predictability. Our architecture is proven to largely outperform existing solutions in terms of energy and area.

    Postprint (author’s final draft)

  • Computación de Altas Prestaciones VI

     Valero Cortes, Mateo
    Competitive project

     Share

  • APPLE: Adaptive performance-predictable low-energy caches for reliable hybrid voltage operation

     Maric, Bojan; Abella Ferrer, Jaume; Valero Cortes, Mateo
    Design Automation Conference
    p. 1-8
    DOI: 10.1145/2463209.2488837
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Semiconductor technology evolution enables the design of resource-constrained battery-powered ultra-low-cost chips required for new market segments such as environment, urban life and body monitoring. Caches have been shown to be the main energy and area consumer in those chips. This paper proposes simple, hybrid-operation (high Vcc, ultra-low Vcc), single-Vcc domain Adaptive Performance- Predictable Low-Energy (APPLE) L1 cache designs based on replacing energy-hungry SRAM cells by more energy-efficient and smaller cells enhanced with extra cache lines set up in an adapted victim cache to still enable strong performance guarantees. APPLE caches are proven to largely outperform existing solutions in terms of energy and area efficiency.

  • Global misrouting policies in two-level hierarchical networks

     Garcia, Marina; Vallejo, Enrique; Beivide Palacio, Ramon; Odriozola, Miguel; Camarero Coterillo, Cristobal; Valero Cortes, Mateo; Labarta Mancho, Jesus Jose; Rodriguez, German
    International Workshop on Interconnection Network Architectures: On-Chip, Multi-Chip
    p. 13-16
    DOI: 10.1145/2482759.2482763
    Presentation's date: 2013-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Dragonfly networks are composed of interconnected groups of routers. Adaptive routing allows packets to be forwarded minimally or non-minimally adapting to the traffic conditions in the network. While minimal routing sends traffic directly between groups, non-minimal routing employs an intermediate group to balance network load. A random selection of this intermediate group (denoted as RRG) typically implies an extra local hop in the source group, what increases average path length and can reduce performance. In this paper we identify different policies for the selection of such intermediate group and explore their performance. Interestingly, simulation results show that an eager policy (denoted as CRG) that selects the intermediate group only between those directly connected to the ongoing router causes starvation in some network nodes. On the contrary, the best performance is obtained by a "mixed mode" policy (denoted as MM) that adds a local hop when the packet has moved away from the source router

    Dragonfly networks are composed of interconnected groups of routers. Adaptive routing allows packets to be forwarded minimally or non-minimally adapting to the traffic conditions in the network. While minimal routing sends traffic directly between groups, non-minimal routing employs an intermediate group to balance network load. A random selection of this intermediate group (denoted as RRG) typically implies an extra local hop in the source group, what increases average path length and can reduce performance. In this paper we identify different policies for the selection of such intermediate group and explore their performance. Interestingly, simulation results show that an eager policy (denoted as CRG) that selects the intermediate group only between those directly connected to the ongoing router causes starvation in some network nodes. On the contrary, the best performance is obtained by a by a \mixed mode" policy (denoted as MM) that adds a local hop when the packet has moved away from the source router.

  • Identifying critical code sections in dataflow programming models

     Subotic, Vladimir; Sancho, Jose Carlos; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
    p. 29-37
    DOI: 10.1109/PDP.2013.15
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The years of practice in optimizing applications point that the major issue is focus - identifying the critical code section whose optimization would yield the highest overall speedup. While this issue is mainly solved for sequential applications, it remains a serious hurdle in the world of parallel computing. Furthermore, the newest dataflow parallel programming models expose very irregular parallelism, making the identification of the critical code section even harder. To address this issue, we designed an environment that identifies critical code sections in applications. The programmer can use this environment to estimate the potential benefits of the optimization for a specific parallel platform. This is very important because the programmer can anticipate the benefits of his optimization and assure that the optimization is worth the effort. Furthermore, we showed that in many applications, the choice of the critical code section decisively depends on the configuration of the target machine. For instance, in HP Linpack, optimizing a task that takes 0.49% of the total computation time yields the overall speedup of less than 0.25% on one machine, and at the same time, yields the overall speedup of more than 24% on a machine with different number of cores.

  • Vector extensions for decision support DBMS acceleration

     Hayes, Timothy; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Valero Cortes, Mateo
    IEEE/ACM International Symposium on Microarchitecture
    p. 166-176
    DOI: 10.1109/MICRO.2012.24
    Presentation's date: 2012-12-04
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Database management systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centres. As a result of this criticality, efficient execution of DBMS engines has become an important area of investigation. This work takes a top-down approach to accelerating decision support systems (DSS) on x86-64 microprocessors using vector ISA exten- sions. In the first step, a leading DSS DBMS is analysed for potential data-level parallelism. We discuss why the existing multimedia SIMD extensions (SSE/AVX) are not suitable for capturing this parallelism and propose a complementary instruction set reminiscent of classical vector architectures. The instruction set is implemented using unin- trusive modifications to a modern x86-64 microarchitecture tailored for DSS DBMS. The ISA and microarchitecture are evaluated using a cycle-accurate x86-64 microarchitectural simulator coupled with a highly-detailed memory simulator. We have found a single oper- ator is responsible for 41% of total execution time for the TPC-H DSS benchmark. Our results show performance speedups between 1.94x and 4.56x for an implementation of this operator run with our proposed hardware modifications.

  • Improving cache management policies using dynamic reuse distances

     Duong, Nam; Zhao, Dali; Kim, Taesu; Cammarota, Rosario; Valero Cortes, Mateo; Veidenbaum, Alexander V
    IEEE/ACM International Symposium on Microarchitecture
    p. 389-400
    DOI: 10.1109/MICRO.2012.43
    Presentation's date: 2012-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Cache management policies such as replacement, bypass, or shared cache partitioning have been relying on data reuse behavior to predict the future. This paper proposes a new way to use dynamic reuse distances to further improve such policies. A new replacement policy is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD). The policy protects a cache line long enough for it to be reused, but not beyond that to avoid cache pollution. This can be combined with a bypass mechanism that also relies on dynamic reuse analysis to bypass lines with less expected reuse. A miss fetch is bypassed if there are no unprotected lines. A hit rate model based on dynamic reuse history is proposed and the PD that maximizes the hit rate is dynamically computed. The PD is recomputed periodically to track a program's memory access behavior and phases...

  • The Multi-State Processors

     Gonzalez Martin, Isidro
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Doctor Honoris Causa en Informàtica

     Valero Cortes, Mateo
    Award or recognition

    View View Open in new window  Share

  • Hardware/software techniques for assisted execution runtime systems

     Kestor, Gokcen; Gioiosa, Roberto; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    Runtime Environments, Systems, Layering and Virtualized Environments
    p. I-VIII
    Presentation's date: 2012-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The increasing complexity of modern and future multi-core/multi- threaded processors rises the question of how to best utilize pro- cessor resources. On one side, Amdahl¿s Law limits the max- imum theoretical speedup of parallel applications while, on the other side, the increasing complexity of runtime programming lan- guage may introduce implicit serialization points. Several studies demonstrated that it is often more convenient to use some of the hardware threads to assist execution than running supplementary application threads. Assisted execution approaches, however, may lead to low pro- cessor utilization: in this paper, we explore fine-grained hardware resource allocation techniques to assign hardware resources to ap- plication and auxiliary threads at runtime, according to their actual computing power demand. As a test case, we apply fine-grained resource allocation to STM 2 , the first parallel STM system that of- fload STM management operations to auxiliary threads. We im- plemented an integrated hardware/software solution in which each level performs well-defined tasks efficiently: 1) STM 2 is enriched with a runtime mechanism to express computing power require- ments of application and auxiliary threads; 2) the hardware en- forces dynamic resource partitioning among running threads; 3) the operating system provides a simple and efficient interface between STM 2 and the hardware resource allocation mechanism. In this paper, we leverage the IBM POWER7 hardware thread prioritization mechanism to bias the allocation of hardware resources in favor of computing intensive application threads or overloaded auxiliary threads. We test fine-grained resource allocation solutions on a real IBM POWER7 system running a simple and malleable TM benchmark (Eigenbench) and applications from the STAMP benchmark suite. Results show that the proposed integrated so- lution achieves up to 65% and 11% of performance improvement over the standard STM 2 design for Eigenbench and STAMP appli- cations, respectively

  • Enhancing the performance of assisted execution runtime systems

     Kestor, Gokcen; Gioiosa, Roberto; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Conference on Architectural Support for Programming Languages and Operating Systems
    Presentation's date: 2012-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    To meet the expected performance, future exascale systems will require programmers to increase the level of parallelism of their applications. Novel programming models simplify parallel programming at the cost of increasing runtime overheard. Assisted execution models have the potential of reducing this overhead but they generally also reduce processor utilization. We propose an integrated hardware/software solution that automatically partition hardware resources between application and auxiliary threads. Each system level performs well-defined tasks efficiently: 1) the runtime system is enriched with a mechanism that automatically detects computing power requirements of running threads and drives the hardware actuators; 2) the hardware enforces dynamic resource partitioning; 3) the operating system provides an efficient interface between the runtime system and the hardware resource allocation mechanism. As a test case, we apply this adaptive approach to STM2, an software transactional memory system that implements the assisted execution model. We evaluate the proposed adaptive solution on an IBM POWER7 system using Eigenbench and STAMP benchmark suite. Results show that our approach performs equal or better than the original STM2 and achieves up to 65% and 86% performance improvement for Eigenbench and STAMP applications, respectively.

  • HIPEAC 3 - European Network of Excellence on HighPerformance Embedded Architecture and Compilers

     Gil Gómez, Maria Luisa; Navarro Mas, Nacho; Martorell Bofill, Xavier; Valero Cortes, Mateo; Ayguade Parra, Eduard; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Labarta Mancho, Jesus Jose; Llaberia Griño, Jose M.
    Competitive project

     Share

  • TIN2012-34557 - Computación de Altas Prestaciones VI

     Valero Cortes, Mateo; Monreal Arnal, Teresa
    Competitive project

     Share

  • Advanced Grant European Research Council (convocatòria 2012)

     Valero Cortes, Mateo
    Award or recognition

     Share

  • ADAM : an efficient data management mechanism for hybrid high and ultra-low voltage operation caches

     Maric, Bojan; Abella Ferrer, Jaume; Valero Cortes, Mateo
    ACM Great Lakes Symposium on VLSI
    p. 245-250
    DOI: 10.1145/2206781.2206840
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Semiconductor technology evolution enables the design of ultra-low-cost chips (e.g., below 1 USD) required for new market segments such as environment, urban life and body monitoring, etc. Recently, hybrid-operation (high Vcc, ultra-low Vcc) single-Vcc-domain cache designs have been proposed to tackle the needs of those chips. However, existing data management policies are far from being optimal during high Vcc operation. This paper presents ADAM, a new and extremely simple Adaptive Data Management mechanism, which is tailored to detect hit distribution and changing application conditions dynamically at ne grain with negligible hardware overhead. ADAM is proven to save signi cant energy (29% on average) in L1 caches with negligible performance degradation (1.7% on average), thus improving the energy-delay product (EDP) noticeably across di erent cache con gurations with respect to all existing data management approaches.

  • Optimal task assignment in multithreaded processors: a statistical approach

     Cakarevic, Vladimir; Radojkovic, Petar; Moreto Planas, Miquel; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla, Francisco J.; Nemirovsky, Mario; Valero Cortes, Mateo
    International Conference on Architectural Support for Programming Languages and Operating Systems
    p. 235-248
    DOI: 10.1145/2150976.2151002
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The introduction of massively multithreaded (MMT) processors, comprised of a large number of cores with many shared resources, has made task scheduling, in particular task to hardware thread assignment, one of the most promising ways to improve system performance. However, finding an optimal task assignment for a workload running on MMT processors is an NP-complete problem. Due to the fact that the performance of the best possible task assignment is unknown, the room for improvement of current task-assignment algorithms cannot be determined. This is a major problem for the industry because it could lead to: (1)~A waste of resources if excessive effort is devoted to improving a task assignment algorithm that already provides a performance that is close to the optimal one, or (2)~significant performance loss if insufficient effort is devoted to improving poorly-performing task assignment algorithms. In this paper, we present a method based on Extreme Value Theory that allows the prediction of the performance of the optimal task assignment in MMT processors. We further show that executing a sample of several hundred or several thousand random task assignments is enough to obtain, with very high confidence, an assignment with a performance that is close to the optimal one. We validate our method with an industrial case study for a set of multithreaded network applications running on an UltraSPARC~T2 processor.

  • STM2: A parallel STM for high performance simultaneous multithreading systems

     Kestor, Gokcen; Gioiosa, Roberto; Harris, Tim; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Hur, Ibrahim; Valero Cortes, Mateo
    International Conference on Parallel Architectures and Compilation Techniques
    p. 221-231
    DOI: 10.1109/PACT.2011.54
    Presentation's date: 2011-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.

  • CASTELL: A HETEROGENEOUS CMP ARCHITECTURE SCALABLE TO HUNDREDS OF PROCESSORS  Open access

     Cabarcas Jaramillo, Felipe
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. iii

  • PARALLEL VIDEO DECODING  Open access

     Alvarez Mesa, Mauricio
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Digital video is a popular technology used in many different applications. The quality of video, expressed in the spatial and temporal resolution, has been increasing continuously in the last years. In order to reduce the bitrate required for its storage and transmission, a new generation of video encoders and decoders (codecs) have been developed. The latest video codec standard, known as H.264/AVC, includes sophisticated compression tools that require more computing resources than any previous video codec. The combination of high quality video and the advanced compression tools found in H.264/AVC has resulted in a significant increase in the computational requirements of video decoding applications. The main objective of this thesis is to provide the performance required for real-time operation of high quality video decoding using programmable architectures. Our solution has been the simultaneous exploitation of multiple levels of parallelism. On the one hand, video decoders have been modified in order to extract as much parallelism as possible. And, on the other hand, general purpose architectures has been enhanced for exploiting the type of parallelism that is present in video codec applications.

    El vídeo digital es una tecnología popular utilizada en una gran variedad de aplicaciones. La calidad de vídeo, expresada en la resolución espacial y temporal, ha ido aumentando constantemente en los últimos años. Con el fin de reducir la tasa de bits requerida para su almacenamiento y transmisión, se ha desarrollado una nueva generación de codificadores y decodificadores (códecs) de vídeo. El códec estándar de vídeo más reciente, conocido como H.264/AVC, incluye herramientas sofisticadas de compresión que requieren más recursos de computación que los códecs de vídeo anteriores. El efecto combinado del vídeo de alta calidad y las herramientas de compresión avanzada incluidas en el H.264/AVC han llevado a un aumento significativo de los requerimientos computacionales de la decodificación de vídeo. El objetivo principal de esta tesis es proporcionar el rendimiento necesario para la decodificación en tiempo real de vídeo de alta calidad. Nuestra solución ha sido la explotación simultánea de múltiples niveles de paralelismo. Por un lado, se realizaron modificaciones en el decodificador de vídeo con el fin de extraer múltiples niveles de paralelismo. Y, por otro lado, se modificaron las arquitecturas de propósito general para mejorar la explotación del tipo paralelismo que está presente en las aplicaciones de vídeo. Primero hicimos un análisis de la escalabilidad de dos extensiones de Instrucción Simple con Múltiples Datos (SIMD por sus siglas en inglés): una de una dimensión (1D) y otra matricial de dos dimensiones (2D). Se demostró que al escalar la extensión 2D se obtiene un mayor rendimiento con una menor complejidad que al escalar la extensión 1D. Luego se realizó una caracterización de la decodificación de H.264/AVC en aplicaciones de alta definición (HD) donde se identificaron los núcleos principales. Debido a la falta de un punto de referencia (benchmark) adecuado para la decodificación de vídeo HD, desarrollamos uno propio, llamado HD-VideoBench el cual incluye aplicaciones completas de codificación y decodificación de vídeo junto con una serie de secuencias de vídeo en HD. Después optimizamos los núcleos más importantes del decodificador H.264/AVC usando instrucciones SIMD. Sin embargo, los resultados no alcanzaron el máximo rendimiento posible debido al efecto negativo de la desalineación de los datos en memoria. Como solución, evaluamos el hardware y el software necesarios para realizar accesos no alineados. Este soporte produjo mejoras significativas de rendimiento en la aplicación. Aparte se realizó una investigación sobre cómo extraer paralelismo de nivel de tarea. Se encontró que ninguno de los mecanismos existentes podía escalar para sistemas masivamente paralelos. Como alternativa, desarrollamos un nuevo algoritmo que fue capaz de encontrar miles de tareas independientes al explotar paralelismo de nivel de macrobloque. Luego implementamos una versión paralela del decodificador de H.264 en una máquina de memoria compartida distribuida (DSM por sus siglas en inglés). Sin embargo esta implementación no alcanzó el máximo rendimiento posible debido al impacto negativo de las operaciones de sincronización y al efecto del núcleo de decodificación de entropía. Con el fin de eliminar estos cuellos de botella se evaluó la paralelización al nivel de imagen de la fase de decodificación de entropía combinada con la paralelización al nivel de macrobloque de los demás núcleos. La sobrecarga de las operaciones de sincronización se eliminó casi por completo mediante el uso de operaciones aceleradas por hardware. Con todas las mejoras presentadas se permitió la decodificación, en tiempo real, de vídeo de alta definición y alta tasa de imágenes por segundo. Como resultado global se creó una solución escalable capaz de usar el número creciente procesadores en las arquitecturas multinúcleo.

  • Linear programming based parallel job scheduling for power constrained systems

     Etinski, Maja; Corbalan Gonzalez, Julita; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    International Conference on High Performance Computing and Simulation
    p. 72-80
    DOI: 10.1109/HPCSim.2011.5999809
    Presentation's date: 2011-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • An abstraction methodology for the evaluation of multi-core multi-threaded architectures

     Zilan, Ruken; Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Milito, Rodolfo; Valero Cortes, Mateo
    IEEE International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems
    p. 478-481
    DOI: 10.1109/MASCOTS.2011.11
    Presentation's date: 2011-07-25
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    As the evolution of multi-core multi-threaded processors continues, the complexity demanded to perform an extensive trade-off analysis, increases proportionally. Cycle-accurate or trace-driven simulators are too slow to execute the large amount of experiments required to obtain indicative results. To achieve a thorough analysis of the system, software benchmarks or traces are required. In many cases when an analysis is needed most, during the earlier stages of the processor design, benchmarks or traces are not available. Analytical models overcome these limitations but do not provide the fine grain details needed for a deep analysis of these architectures. In this work we present a new methodology to abstract processor architectures, at a level between cycle-accurate and analytical simulators. To apply our methodology we use queueing modeling techniques. Thus, we introduce Q-MAS, a queueing based tool targeting a real chip (the Ultra SPARC T2 processor) and aimed at facilitating the quantification of trade-offs during the design phase of multi-core multi-threaded processor architectures. The results demonstrate that Q-MAS, the tool that we developed, provides accurate results very close to the actual hardware, with a minimal cost of running what-if scenarios.

  • RVC-based time-predictable faulty caches for safety-critical systems

     Abella Ferrer, Jaume; Quiñones Moreno, Eduardo; Cazorla, Francisco J.; Valero Cortes, Mateo; Sazeides, Yanos
    IEEE International On-Line Testing Symposium
    p. 25-30
    DOI: 10.1109/IOLTS.2011.5993806
    Presentation's date: 2011-07-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Programming, Debugging, Profiling and Optimizing Transactional Memory Programs  Open access

     Hasanov Zyulkyarov, Ferad
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional memory (TM) is a new optimistic synchronization technique which has the potential of making shared memory parallel programming easier compared to locks without giving up from the performance. This thesis explores four aspects in the research of transactional memory. First, it studies how programming with TM compares to locks. During the course of work, it develops the first real transactional application ¿ AtomicQuake. AtomicQuake is adapted from the parallel version of the Quake game server by replacing all lock-based synchronization with atomic blocks. Findings suggest that programming with TM is indeed easier than locks. However the performance of current software TM systems falls behind the efficiently implemented lock-based versions of the same program. Also, the same findings report that the proposed language level extensions are not sufficient for developing robust production level software and that the existing development tools such as compilers, debuggers, and profilers lack support for developing transactional application. Second, this thesis introduces new set of debugging principles and abstractions. These new debugging principles and abstractions enable debugging synchronization errors which manifest at coarse atomic block level, wrong code inside atomic blocks, and also performance errors related to the implementation of the atomic block. The new debugging principles distinguish between debugging at the language level constructs such as atomic blocks and debugging the atomic blocks based on how they are implemented whether TM or lock inference. These ideas are demonstrated by implementing a debugger extension for WinDbg and the ahead-of-time C# to X86 Bartok-STM compiler. Third, this thesis investigates the type of performance bottlenecks in TM applications and introduces new profiling techniques to find and understand these bottlenecks. The new profiling techniques provide in-depth and comprehensive information about the wasted work caused by aborting transactions. The individual profiling abstractions can be grouped in three groups: (i) techniques to identify multiple conflicts from a single program run, (ii) techniques to describe the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize which transactions conflict most. The ideas were demonstrated by building a lightweight profiling framework for Bartok-STM and an offline tool which process and display the profiling data. Forth, this thesis explores and introduces new TM specific optimizations which target the wasted work due to aborting transactions. Using the results obtained with the profiling tool it analyzes and optimizes several applications from the STAMP benchmark suite. The profiling techniques effectively revealed TM-specific bottlenecks such as false conflicts and contentions accesses to data structures. The discovered bottlenecks were subsequently eliminated with using the new optimization techniques. Among the optimization highlights are the transaction checkpoints which reduced the wasted work in Intruder with 40%, decomposing objects to eliminate false conflicts in Bayes, early release in Labyrinth which decreased wasted work from 98% to 1%, using less contentions data structures such as chained hashtable in Intruder and Genome which have higher degree of parallelism.

  • Access to the full text
    Object oriented execution model (OOM)  Open access

     Markovic, Nikola; Nemirovsky, Daniel; González Blanco, Ruben; Unsal, Osman Sabri; Valero Cortes, Mateo; Cristal Kestelman, Adrian
    New Directions in Computer Architecture
    Presentation's date: 2011-06
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper considers implementing the Object Oriented Programming Model directly in the hardware to serve as a base to exploit object-level parallelism, speculation and heterogeneous computing. Towards this goal, we present a new execution model called Object Oriented execution Model - OOM - that implements the OO Programming Models. All OOM hardware structures are objects and the OOM Instruction Set directly utilizes objects while hiding other complex hardware structures. OOM maintains all high-level programming language information until execution time. This enables efficient extraction of available parallelism in OO serial code at execution time with minimal compiler support. Our results show that OOM utilizes the available parallelism better than the OoO (Out-of-Order) model

  • Hybrid high-performance low-power and ultra-low energy reliable caches

     Maric, Bojan; Abella Ferrer, Jaume; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    ACM International Conference on Computing Frontiers
    p. 1-2
    DOI: 10.1145/2016604.2016619
    Presentation's date: 2011-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Ubiquitous computing has become a very popular paradigm. The most suitable technological solution for those systems consists of using hybrid processors able to operate at high voltage to provide high performance and at near-/sub-threshold voltage to provide ultra-low energy consumption. This paper studies different non-hybrid and hybrid SRAM L1 cache designs using several SRAM cell types and compare them in terms of delay, dynamic energy, leakage power and area.

  • Integrating dataflow abstractions into transactional memory

     Gajinov, Vladimir; Milovanovic, Milos; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Ayguade Parra, Eduard; Valero Cortes, Mateo
    Workshop on Systems for Future Multi-Core Architectures
    p. 1-7
    Presentation's date: 2011-04-16
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Many concurrent programs require some form of conditional synchronization to coordinate the execution of different program tasks. Programming these algorithms using transactional memory (TM) often results in a high conflict rate between transactions. In this paper we propose an Atomic dataflow model - ADF, which aims to reduce transaction conflicts by incorporating dataflow scheduling principles into transactional memory. The ADF model is based on the execution of atomic units of work called ADF tasks. A programmer explicitly defines data dependencies for the ADF task using the trigger set extension. Trigger set data is implicitly tracked by the TM runtime system, which detects changes and enables the re-execution of a transaction when its dependencies are satisfied. In this paper we fully describe the ADF model, present its syntax and show advantages of the model on a practical example.

  • Trace-driven simulation of multithreaded applications

     Rico Carro, Alejandro; Duran González, Alejandro; Cabarcas Jaramillo, Felipe; Etsion, Yoav; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    IEEE International Symposium on Performance Analysis of Systems and Software
    p. 87-96
    DOI: 10.1109/ISPASS.2011.5762718
    Presentation's date: 2011-04-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Over the past few years, computer architecture research has moved towards execution-driven simulation, due to the inability of traces to capture timing-dependent thread execution interleaving. However, trace-driven simulation has many advantages over execution-driven that are being missed in multithreaded application simulations. We present a methodology to properly simulate multithreaded applications using trace-driven environments. We distinguish the intrinsic application behavior from the computation for managing parallelism. Application traces capture the intrinsic behavior in the sections of code that are independent from the dynamic multithreaded nature, and the points where parallelism-management computation occurs. The simulation framework is composed of a trace-driven simulation engine and a dynamic-behavior component that implements the parallelism-management operations for the application. Then, at simulation time, these operations are reproduced by invoking their implementation in the dynamic-behavior component. The decisions made by these operations are based on the simulated architecture, allowing to dynamically reschedule sections of code taken from the trace to the target simulated components. As the captured sections of code are independent from the parallel state of the application, they can be simulated on the trace-driven engine, while the parallelism-management operations, that require to be re-executed, are carried out by the execution-driven component, thus achieving the best of both trace- and execution-driven worlds. This simulation methodology creates several new research opportunities, including research on scheduling and other parallelism-management techniques for future architectures, and hardware support for programming models.

  • IA3: An interference aware allocation algorithm for multicore hard real-time systems

     Paolieri, Marco; Quiñones Moreno, Eduardo; Cazorla Almeida, Francisco Javier; Davis, R.; Valero Cortes, Mateo
    The IEEE Real-Time and Embedded Technology and Applications Symposium
    p. 280-290
    DOI: 10.1109/RTAS.2011.34
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In multicore processors, the execution environment is defined as the environment in which tasks run and it is determined by the hardware resources they get and the workload with which they are executed. Thus, different execution environments lead to different inter-task interferences accessing shared hardware resources due to conflicts with the other corunning tasks, making the WCET estimation of a task dependent on the execution environment in which it runs. Despite such dependency, current partitioned scheduling approaches use a single WCET estimation per task: typically the highest for all execution environments in which a task runs. In this paper we introduce IA3: an interference-aware allocation algorithm that considers not a single WCET estimation but a set of WCET estimations per task. IA3 is based on two novel concepts: the WCET-matrix and the WCET-sensitivity. The former associates every WCET estimation with its corresponding execution environment. The latter measures the impact of changing the execution environment on the WCET estimation. This allows IA3 to reduce the number of resources required to schedule a given taskset. In particular, our results show that in a four-core processor considering tasksets with a total utilization of 2.9, IA3 is able to schedule 70% of the tasksets using 3-cores while a classical partitioned approach with a First-Fit Decreasing heuristic is able to schedule only 5% of the tasksets using 3-cores.

  • RMS-TM: a comprehensive benchmark suite for transactional memory systems

     Kestor, Gokcen; Karakostas, Vasileios; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Hur, Ibrahim; Valero Cortes, Mateo
    ACM/SPEC International Conference on Performance Engineering
    p. 335-346
    DOI: 10.1145/1958746.1958795
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Transactional Memory (TM) has been proposed as an alternative concurrency mechanism for the shared memory parallel programming model. Its main goal is to make parallel programming for Chip Multiprocessors (CMPs) easier than using the traditional lock synchronization constructs, without compromising the performance and the scalability. This topic has received substantial research attention and several TM designs have been proposed using various TM benchmarks. We believe that the evaluation of TM proposals would be more solid if it included realistic applications, that address on-going TM research issues, and that provide the potential for straightforward comparison against locks. In this paper, we introduce RMS-TM, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis (RMS) domain. In addition to featuring current TM research issues such as nesting and I/O and system calls inside transactions, the RMS-TM applications also provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates. These characteristics, as well as providing lock-based versions of the applications, make RMS-TM a useful TM tool. Current TM benchmarks do not explore all these features. In our evaluation with selected STM and HTM systems, we find that our benchmark suite is also scalable, which is useful for evaluating TM designs on high core counts.

  • Access to the full text
    From plasma to beefarm: Design experience of an FPGA-based multicore prototype  Open access

     Sonmez, N.; Arcas Abella, Oriol; Sayilar, G.; Unsal, Osman Sabri; Cristal Kestelman, Adrián; Hur, Ibrahim; Singh, S.; Valero Cortes, Mateo
    International Symposium on Applied Reconfigurable Computing
    p. 350-362
    DOI: 10.1007/978-3-642-19475-7_37
    Presentation's date: 2011
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we take a MIPS-based open-source uniprocessor soft core, Plasma, and extend it to obtain the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in the FPGA and the computer architecture communities. We discuss various design tradeo s and we demonstrate superior scalability through experimental results compared to traditional software instruction set simulators. Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports runtime and compiler infrastructure and on the actual executions of our experiments running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.

    Postprint (author’s final draft)

  • RVC: A mechanism for time-analyzable real-time processors with faulty caches

     Abella Ferrer, Jaume; Quiñones Moreno, Eduardo; Cazorla, Francisco J.; Sazeides, Yanos; Valero Cortes, Mateo
    International Conference on High Performance and Embedded Architectures and Compilers
    p. 97-106
    DOI: 10.1145/1944862.1944878
    Presentation's date: 2011-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Geometry scaling due to technology evolution as well as Vcc scaling lead to failures in large SRAM arrays such as caches. Faulty bits can be tolerated from the average performance perspective, but make critical realtime embedded systems non time-analyzable or worstcase execution time (WCET) estimations unacceptably large. This paper proposes a mechanism to tolerate faulty bits in caches while still providing safe and tightWCET. Our solution is based on adapting structures such as the victim cache, cache eviction buffers or miss state handle registers to serve as replacement for faulty cache storage. We show how modest modifications in the hardware help providing safe and tight WCET on the face of permanent faulty bits with negligible impact in power and performance.

  • Access to the full text
    Task superscalar: an out-of-order task pipeline  Open access

     Etsion, Yoav; Cabarcas Jaramillo, Felipe; Rico Carro, Alejandro; Ramirez Bellido, Alejandro; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    IEEE/ACM International Symposium on Microarchitecture
    p. 89-100
    DOI: 10.1109/MICRO.2010.13
    Presentation's date: 2010-12-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task superscalar pipeline dynamically detects intertask data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task superscalar pipeline frontend, that can be embedded into any manycore fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as ~50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task superscalar thus enables programmers to exploit manycore systems effectively, while simultaneously simplifying their programming model.