Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 1109 results
  • PAMS: pattern aware memory system for embedded systems

     Hussain, Tassadaq; Sonmez, N.; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo; Gursal, Shakaib A.
    International Conference on ReConFigurable Computing and FPGAs
    p. 1-7
    DOI: 10.1109/ReConFig.2014.7032544
    Presentation's date: 2014-12-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we propose a hardware mechanism for embedded multi-core memory system called Pattern Aware Memory System (PAMS). The PAMS supports static and dynamic data structures using descriptors and specialized memory and reduces area, cost, energy consumption and hit latency. When compared with a Baseline Memory System, the PAMS consumes between 3 and 9 times and 1.13 and 2.66 times less program memory for static and dynamic data structures respectively. The benchmarking applications (having static and dynamic data structures) results show that PAMS consumes 20% less hardware resources, 32% less on chip power and achieves a maximum speedup of 52x and 2.9x for static and dynamic data structures respectively. The results show that the PAMS multi-core system transfers data structures up to 4.65x faster than the MicroBlaze baseline system.

  • Consolider-ingenio 2014 Supercomputación y e-Ciencia

     Valero Cortes, Mateo; Monreal Arnal, Teresa
    Competitive project

     Share

  • Hybrid cache designs for reliable hybrid high and ultra-low voltage operation

     Maric, Bojan; Abella Ferrer, Jaume; Cazorla, Francisco J.; Valero Cortes, Mateo
    ACM transactions on design automation of electronic systems
    Vol. 20, num. 1, p. Article No. 10-
    DOI: 10.1145/2658988
    Date of publication: 2014-11-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Geometry scaling of semiconductor devices enables the design of ultra-low-cost (e.g., below 1 USD) battery-powered resource-constrained ubiquitous devices for environment, urban life, and body monitoring. These sensor-based devices require high performance to react in front of infrequent particular events as well as extreme energy efficiency in order to extend battery lifetime during most of the time when low performance is required. In addition, they require real-time guarantees. The most suitable technological solution for these devices consists of using hybrid processors able to operate at: (i) high voltage to provide high performance and (ii) near-/subthreshold voltage to provide ultra-low energy consumption. However, the most efficient SRAM memories for each voltage level differ and trading off different SRAM designs is mandatory. This is particularly true for cache memories, which occupy most of the processor's area.; In this article, we propose new, simple, single-Vcc-domain hybrid L1 cache architectures suitable for reliable hybrid high and ultra-low voltage operation. In particular, the cache is designed by combining heterogeneous SRAM cell types: some of the cache ways are optimized to satisfy high-performance requirements during high voltage operation, whereas the rest of the ways provide ultra-low energy consumption and reliability during near-/subthreshold voltage operation. We analyze the performance, energy, and power impact of the proposed cache designs when using them to implement L1 caches in a processor. Experimental results show that our hybrid caches can efficiently and reliably operate across a wide range of voltages, consuming little energy at near-/subthreshold voltage as well as providing high performance at high voltage without decreasing reliability levels to provide strong performance guarantees, as required for our target market.

  • Analyzing the efficiency of L1 caches for reliable hybrid-voltage operation using EDC codes

     Maric, Bojan; Abella, Jaume; Valero Cortes, Mateo
    IEEE transactions on very large scale integration (VLSI) systems
    Vol. 22, num. 10, p. 2211-2215
    DOI: 10.1109/TVLSI.2013.2282498
    Date of publication: 2014-10-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The increasing demand for highly miniaturized battery-powered ultralow cost systems (e.g., below 1 dollar) in emerging applications such as body, urban life and environment monitoring, and so on, has introduced many challenges in chip design. Such applications require high performance occasionally and very little energy consumption during most of the time to extend battery lifetime. In addition, they require real-time guarantees. Caches have been shown to be the most critical blocks in these systems due to their high energy/area consumption and hard-to-predict behavior. New, simple, hybrid-voltage operation (high V-cc and ultralow V-cc), single-V-cc domain L1 cache architectures based on replacing energy-hungry bitcells (e.g., 10T) by more energy-efficient and smaller cells (e.g., 8T) enhanced with error detection and correction codes have been recently proposed. Such designs provide significant energy and area efficiency without jeopardizing reliability levels to still provide strong performance guarantees. In this brief, we analyze the efficiency of these designs during ultralow voltage operation. We identify the limits of such approaches by finding an energy-optimal voltage region through experimental models. The experimental results show that area efficiency is always achieved in the range 200-400 mV, whereas both energy and area gains occur above 250 mV, i.e., in near-threshold regime.

  • DeTrans: Deterministic and parallel execution of transactions

     Smiljkovic, Vesna; Stipic, Srdjan; Fetzer, Christof; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Symposium on Computer Architecture and High Performance Computing
    p. 152-159
    DOI: 10.1109/SBAC-PAD.2014.20
    Presentation's date: 2014-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Deterministic execution of a multithreaded application guarantees the same output as long as the application runs with the same input parameters. Determinism helps a programmer to test and debug an application and to provide fault-tolerance in the systems based on replicas. Additionally, Transactional Memory (TM) greatly simplifies development of multithreaded applications where applications use transactions (instead of locks) as a concurrency control mechanism to synchronize accesses to shared memory. However, deterministic systems proposed so far are not TM-aware. They violate the main properties of TM (atomicity, consistency and isolation of transactions), and execute TM applications incorrectly. In this paper, we present DeTrans, a runtime system for deterministic execution of multithreaded TM applications. DeTrans executes TM applications deterministically, it executes nontransactional code serially in round-robin order, and transactional code in parallel. Also, we show how DeTrans works with both eager and lazy software TM. We compare DeTrans with Dthreads, a state-of-the-art deterministic execution system. Unlike Dthreads, DeTrans does not use memory protection hardware nor facilities of the underlying operating system (OS) to execute multithreaded applications deterministically. DeTrans uses properties of software TM to ensure deterministic execution. We evaluate DeTrans using the STAMP benchmark suite and we compare DeTrans and Dthreads performance costs. DeTrans incurs less overhead because threads execute in the same address space without any OS system calls overhead. According to our results, DeTrans is 3.99x, 3.39x, 2.44x faster on average than Dthreads for 2, 4 and 8 threads, respectively.

  • Characterizing the communications demands of the Graph500 benchmark on a commodity cluster

     Fuentes Sáez, Pablo; Bosque Orero, José Luis; Beivide Palacio, Ramon; Valero Cortes, Mateo; Minkenberg, Cyriel
    International Symposium on Big Data Computing
    Presentation's date: 2014-09-12
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • MAPC: memory access pattern based controller

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo
    International Conference on Field-Programmable Logic and Applications
    p. 1-4
    DOI: 10.1109/FPL.2014.6927397
    Presentation's date: 2014-09-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Traditionally, system designers have attempted to improve system performance by scheduling the processing cores and by exploring different memory system configurations and there is comparatively less work done scheduling the accesses at the memory system level and exploring data accesses on the memory systems. In this paper, we propose a memory access pattern based controller (MAPC). MAPC organizes data accesses in descriptors, prioritizes them with respect to the number and size of transfer requests. When compared to the baseline multicore system, the MAPC based system achieves between 2.41× to 5.34× of speedup for different applications, consumes 28% less hardware resources and 13% less dynamic power.

  • MAPC: memory access pattern based controller

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo
    International Conference on Field Programmable Logic and Applications
    p. 1-4
    DOI: 10.1109/FPL.2014.6927397
    Presentation's date: 2014-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Traditionally, system designers have attempted to improve system performance by scheduling the processing cores and by exploring different memory system configurations and there is comparatively less work done scheduling the accesses at the memory system level and exploring data accesses on the memory systems. In this paper, we propose a memory access pattern based controller (MAPC). MAPC organizes data accesses in descriptors, prioritizes them with respect to the number and size of transfer requests. When compared to the baseline multicore system, the MAPC based system achieves between 2.41× to 5.34× of speedup for different applications, consumes 28% less hardware resources and 13% less dynamic power.

  • DReAM: Per-task DRAM energy metering in multicore systems

     Liu, Qixiao; Moreto Planas, Miquel; Abella Ferrer, Jaume; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    International European Conference on Parallel and Distributed Computing
    p. 111-123
    DOI: 10.1007/978-3-319-09873-9-10
    Presentation's date: 2014-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Interaction across applications in DRAM memory impacts its energy consumption. This paper makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations, such as per-task energy-aware task scheduling and energy-aware billing in datacenters. In particular, the contributions of this paper are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate, yet low cost, implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is more accurate than these other methods.

  • Techniques for Improving the Performance of Software Transactional Memory  Open access

     Stipic, Srdan
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional Memory (TM) da al desallorador de software la oportunidad de escribir programas concurenter mas facil comparado con todos los paradigmas de programacion previas y da la rendimiento comarable a lock-based sincronisaciones.Actuales Software TM (STM) implementaciones tienen gastos generales de rendimiento que se pueden reducir con la intoduccion de nuevas abstracciones enTransactional Memory modelo de programacion.En la tesis, presentamos cuatro nuevas tecnicas para mejorar el rendimiento de Software TM: (i) Abstract Nested Transactions (ANT), (ii) TagTM, (iii) profile-guided transaction coalescing, and (iv) dynamic transaction coalescing.ANT mejora el rendimiento de aplicaciones transaccionales si romper la semantica de paradigma transaccional, TagTM accelera los accesos a meta-data transaccional,profile-guided transaction coalescing baja gastos generales transaccionales en tiempo de compilacion, y dynamic transaction coalescing baja gastos generales transaccionales en tiempo de ejecucion.Nuestra analisis muestra que Abstract Nested Transactions, TagTM, profile-guided transaction coalescing, y dynamic transaction coalescing mejoran el rendimiento de los programas originales que utilisan Software Transactional Memory.

    Transactional Memory (TM) gives software developers the opportunity to write concurrent programs more easily compared to any previous programming paradigms and gives a performance comparable to lock-based synchronizations. Current Software TM (STM) implementations have performance overheads that can be reduced by introducing new abstractions in Transactional Memory programming model. In this thesis we present four new techniques for improving the performance of Software TM: (i) Abstract Nested Transactions (ANT), (ii) TagTM, (iii) profile-guided transaction coalescing, and (iv) dynamic transaction coalescing. ANT improves performance of transactional applications without breaking the semantics of the transactional paradigm, TagTM speeds up accesses to transactional meta-data, profile-guided transaction coalescing lowers transactional overheads at compile time, and dynamic transaction coalescing lowers transactional overheads at runtime. Our analysis shows that Abstract Nested Transactions, TagTM, profile-guided transaction coalescing, and dynamic transaction coalescing improve the performance of the original programs that use Software Transactional Memory.

  • Per-task Energy Accounting in Computing Systems

     Liu, Qixiao; Jiménez, Víctor; Moreto Planas, Miquel; Abella, Jaume; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    IEEE computer architecture letters
    Vol. 13, num. 2, p. 85-88
    DOI: 10.1109/L-CA.2013.24
    Date of publication: 2014-07-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    We present for the first time the concept of per-task energy accounting (PTEA) and relate it to per-task energy metering (PTEM). We show the benefits of supporting both in future computing systems. Using the shared last-level cache (LLC) as an example: (1) We illustrate the complexities in providing PTEM and PTEA; (2) we present an idealized PTEM model and an accurate and low-cost implementation of it; and (3) we introduce a hardware mechanism to provide accurate PTEA in the cache.

  • Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

     Stanic, Milan; Palomar Perez, Oscar; Ratkovic, Ivan; Duric, Milovan; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Conference on High Performance Computing & Simulation
    p. 47-54
    DOI: 10.1109/HPCSim.2014.6903668
    Presentation's date: 2014-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released by Intel with new features such as a wide 512-bit vector unit and vector scatter/gather instructions. Thus, the Xeon Phi allows for more efficient parallelization of Graph500 that is combined with vectorization. In this paper we vectorize Graph500 and analyze the impact of vectorization and prefetching on the Xeon Phi. We also show that the combination of parallelization, vectorization and prefetching yields a speedup of 27% over a parallel version with prefetching that does not leverage the vector capabilities of the Xeon Phi.

  • Physical vs. physically-aware estimation flow: case study of design space exploration of adders

     Ratkovic, Ivan; Palomar Perez, Oscar; Stanic, Milan; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    IEEE Computer Society Annual Symposium on VLSI
    p. 118-123
    DOI: 10.1109/ISVLSI.2014.14
    Presentation's date: 2014-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Selecting an appropriate estimation method for a given technology and design is of crucial interest as the estimations guide future project and design decisions. The accuracy of the estimations of area, timing, and power (metrics of interest) depends on the phase of the design flow and the fidelity of the models. In this research, we use design space exploration of low-power adders as a case study for comparative analysis of two estimation flows: Physical layout Aware Synthesis (PAS) and Place and Route (PnR). We study and compare post-PAS and post-PnR estimations of the metrics of interest and the impact of various design parameters and input switching activity factor (aI). Adders are particularly interesting for this study because they are fundamental microprocessor units, and their designinvolves many parameters that create a vast design space. We show cases when the post-PAS and post-PnR estimations could lead to different design decisions, especially from a low-power designer point of view. Our experiments reveal that post-PAS results underestimate the side-effects of clock-gating, pipelining, and extensive timing optimizations compared to post-PnR results. We also observe that PnR estimation flow sometimes reports counterintuitive results

  • Advanced pattern based memory controller for FPGA based HPC applications

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo
    International Conference on High Performance Computing & Simulation
    p. 287-294
    DOI: 10.1109/HPCSim.2014.6903697
    Presentation's date: 2014-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The ever-increasing complexity of high-performance computing applications limits performance due to memory constraints in FPGAs. To address this issue, we propose the Advanced Pattern based Memory Controller (APMC), which supports both regular and irregular memory patterns. The proposed memory controller systematically reduces the latency faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth by using a smart mechanism that collects and stores the different patterns and reuses them when it is needed. In order to prove the effectiveness of the proposed controller, we implemented and tested it on a Xilinx ML505 FPGA board. In order to prove that our controller is efficient in a variety of scenarios, we used several benchmarks with different memory access patterns. The benchmarking results show that our controller consumes 20% less hardware resources, 32% less on chip power and achieves a maximum speedup of 52× and 2.9× for regular and irregular applications respectively.

  • PVMC: Programmable Vector Memory Controller

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo
    International Conference on Application-Specific Systems, Architectures and Processors
    p. 240-247
    DOI: 10.1109/ASAP.2014.6868668
    Presentation's date: 2014-06-18
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized local memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. We compare the performance of our proposal with a vector system without PVMC as well as a scalar only system. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 2.2× to 14.9× faster, achieves between 2.16× to 3.18× of speedup for 5 applications and consumes 2.56 to 4.04 times less energy. © 2014 IEEE.

  • Access to the full text
    Enabling preemptive multiprogramming on GPUs  Open access

     Tanasic, Ivan; Gelado Fernandez, Isaac; Cabezas, Javier; Ramirez Bellido, Alejandro; Navarro, Nacho; Valero Cortes, Mateo
    International Symposium on Computer Architecture
    p. 193-204
    DOI: 10.1109/ISCA.2014.6853208
    Presentation's date: 2014-06-14
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.

    Postprint (author’s final draft)

  • Author retrospective for "Software trace cache"

     Ramirez Bellido, Alejandro; Falcón Samper, Ayose Jesus; Santana Jaria, Oliverio J.; Valero Cortes, Mateo
    International Supercomputing Conference
    p. 45-47
    DOI: 10.1145/2591635.2594508
    Presentation's date: 2014-06-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In superscalar processors, capable of issuing and executing multiple instructions per cycle, fetch performance represents an upper bound to the overall processor performance. Unless there is some form of instruction re-use mechanism, you cannot execute instructions faster than you can fetch them. Instruction Level Parallelism, represented by wide issue out oforder superscalar processors, was the trending topic during the end of the 90's and early 2000's. It is indeed the most promising way to continue improving processor performance in a way that does not impact application development, unlike current multicore architectures which require parallelizing the applications (a process that is still far from being automated in the general case). Widening superscalar processor issue was the promise of neverending improvements to single thread performance, as identified by Yale N. Patt et al. in the 1997 special issue of IEEE Computer about "Billion transistor processors" [1]. However, instruction fetch performance is limited by the control flow of the program. The basic fetch stage implementation can read instructions from a single cache line, starting from the current fetch address and up to the next control flow instruction. That is one basic block per cycle at most. Given that the typical basic block size in SPEC integer benchmarks is 4-6 instructions, fetch performance was limited to those same 4-6 instructions per cycle, making 8-wide and 16-wide superscalar processors impractical. It became imperative to find mechanisms to fetch more than 8 instructions per cycle, and that meant fetching more than one basic block per cycle.

  • Automatic exploration of potential parallelism in sequential applications

     Subotic, Vladimir; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    International Supercomputing Conference
    p. 156-171
    DOI: 10.1007/978-3-319-07518-1-10
    Presentation's date: 2014-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The multicore era has increased the need for highly parallel software. Since automatic parallelization turned out ineffective for many production codes, the community hopes for the development of tools that may assist parallelization, providing hints to drive the parallelization process. In our previous work, we had designed Tareador, a tool based on dynamic instrumentation that identifies potential task-based parallelism inherent in applications. Also, we showed how a programmer can use Tareador to explore the potential of different parallelization strategies. In this paper, we build up on our previous work by automating the process of exploring parallelism. We have designed an environment that, given a sequential code and configuration of the target parallel architecture, iteratively runs Tareador to find an efficient parallelization strategy. We propose an autonomous algorithm based on simple metrics and a cost function. The algorithm finds an efficient parallelization strategy and provides the programmer with sufficient information to turn that parallelization strategy into an actual parallel program.

  • CPU Accounting in Multi-Threaded Processors  Open access

     Ruiz Luque, Jose Carlos
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En los últimos años, los procesadores multihilo se han convertido más y más populares en la industria con el fin de aumentar el rendimiento del sistema y de cada aplicación, superando las limitaciones impuestas por el limitado paralelismo a nivel de instrucciones y por las restricciones de potencia y térmica. Los procesadores multihilo son ampliamente utilizados en servidores, ordenadores de sobremesa, ordenadores portátiles y dispositivos móviles.Sin embargo, los procesadores multihilo introducen complejidades en la medición de la capacidad del procesador, ya que la capacidad del procesador medida a una aplicación no sólo depende del tiempo que la aplicación está ejecutándose en un procesador, pero también de la cantidad de los recursos hardware que recibe durante ese intervalo. Y dado que en un procesador multihilo, los recursos hardware son compartidos dinámicamente entre las aplicaciones, la capacidad del procesador medida a una aplicación depende de las aplicaciones que se ejecutan concurrentemente en el procesador. Esto es un inconveniente debido a que la medición de la capacidad del procesador de la misma aplicación con el mismo conjunto de datos de entrada puede ser medida significativamente diferente dependiendo de la carga de trabajo en el que se ejecuta. El uso de sistemas con mecanismos precisos de medida de la capacidad del procesador es necesario para mejor el uso justo del sistema entre las aplicaciones en el sistema. Además, permite a los usuarios ser cargado justamente en centro de datos compartidos, facilitando la consolidación de servidores en sistemas futuros.Esta tesis analiza los conceptos de capacidad del procesador y medida de la capacidad del procesador para procesadores multihilo. En este estudio, nosotros mostramos que los mecanismos actuales de medida de la capacidad del procesador no son tan precisos que deberían ser en procesadores multihilo. Por esta razón, nosotros presentamos dos novedosos mecanismos hardware de medida de la capacidad del procesador que mejora la precisión en la medición de la capacidad de procesador en procesadores multihilo con un bajo coste hardware. Nosotros nos centramos en diferentes tipos de arquitectura de procesadores multihilo tales como procesadores multinúcleo y procesadores multinúcleo que cada núcleo soporta multihilo. Finalmente, nosotros analizamos el impacto de los recursos hardware compartidos en los procesadores multihilo en el planificador del procesador en el sistema operativo y proponemos varios modelos de planificador que mejora el conocimiento de los recursos hardware compartidos.

    In recent years, multi-threaded processors have become more and more popular in industry in order to increase the system aggregated performance and per-application performance, overcoming the limitations imposed by the limited instruction-level parallelism, and by power and thermal constraints. Multi-threaded processors are widely used in servers, desktop computers, lap-tops, and mobile devices. However, multi-threaded processors introduce complexities when accounting CPU (computation) capacity (CPU accounting), since the CPU capacity accounted to an application not only depends upon the time that the application is scheduled onto a CPU, but also on the amount of hardware resources it receives during that period. And given that in a multi-threaded processor hardware resources are dynamically shared between applications, the CPU capacity accounted to an application in a multi-threaded processor depends upon the workload in which it executes. This is inconvenient because the CPU accounting of the same application with the same input data set may be accounted significantly different depending upon the workload in which it executes. Deploying systems with accurate CPU accounting mechanisms is necessary to increase fairness among running applications. Moreover, it will allow users to be fairly charged on a shared data center, facilitating server consolidation in future systems. This Thesis analyses the concepts of CPU capacity and CPU accounting for multi-threaded processors. In this study, we demonstrate that current CPU accounting mechanisms are not as accurate as they should be in multi-threaded processors. For this reason, we present two novel CPU accounting mechanisms that improve the accuracy in measuring the CPU capacity for multi-threaded processors with low hardware overhead. We focus our attention on several current multi-threaded processors, including chip multiprocessors and simultaneous multithreading processors. Finally, we analyse the impact of shared resources in multi-threaded processors in operating system CPU scheduler and we propose several schedulers that improve the knowledge of shared hardware resources at the software level.

  • Scalable System Software for High Performance Large-scale Applications  Open access

     Morari, Alessandro
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    En las últimas décadas, los sistemas a gran escala de alto rendimiento han sido una herramienta fundamental para el descubrimiento científico y la ingeniería. El crecimiento de las peformance los supercomputadores y la consiguiente reducción de los costes han hecho que esta tecnología sea disponible para un gran número de científicos e ingenieros que trabajan en muchos problemas diferentes . El diseño de la próxima generación de supercomputadoras incluirán requisitos de High Perfomance Computing (HPC) tradicionales, así como los nuevos requisitos para manejar gran volumen de datos. Las aplicaciones de datos intensivos juegan un papel importante en una variedad de campos, y son el foco actual de varias líneas de investigación en HPC.Debido a los retos de escalabilidad y eficiencia, la próxima generación de superordenadores necesita un rediseño de todo lo stack del software. Se espera que el software del sistema va a cambiar drásticamente para adaptarse al próximo hardware y para satisfacer las nuevas necesidades de las aplicaciónes.Esta tesis doctoral estudia la escalabilidad del software del sistema. La tesis se inicia en el nivel de sistema operativo: primero estudia un OS general-purpose (Linux) y luego estudia light-weight kernels ( CNK ). A continuación, la tesi estudia el runtime system: implementamos un runtime system para sistemas de memoria distribuida que incluye muchos de los servicios de sistema requeridos por aplicaciones de próxima generación. Por fin, estudiamo las características hardware que pueden ser explotadas a nivel de usuario para mejorar las applicaciones, y potencialmente incluir estas en nuestro runtime system.Las contribuciones de esta tesis son las siguientes :Escalabilidad del sistema operativo: Proporcionamos un estudio preciso de los problemas de escalabilidad de los sistemas operativos modernos para HPC. Diseñamos y implementamos una metodología donde la información cuantitativa detallada puede ser obtenida para cada evento de OS noise. Validamos nuestro enfoque comparándolo con otras técnicas convencionales bien conocidas para analizar el ruido del sistema operativo, tales FTQ ( Fixed Time Quantum ) . Evaluación de la gestión de la TLB para un lightweight kernel: proporcionamos una evaluación del TLB handling - asignación de memoria dinámica, la asignación de memoria estática con las entradas de la TLB reemplazables , y asignación de memoria estática con las entradas de la TLB fijas (no TLB misses ) en un sistema IBM Blue Gene / P.Escalabilidad del runtime system : Diseñamos e implementamos un runtime system con todas las funciones y el modelo de programación para ejecutar aplicaciones irregulares en un clúster. El runtime system es una libreria llamad Global Memory and Threading ( GMT) y integra un modelo de comunicación PGAS y una estructura de programa fork/join. El runtime system usa aggregacion par cubrir la larencia de red. Comparamos GMT con otros modelos PGAS , con codigo MPI optimizado a mano y arquitecturas personalizadas ( Cray XMT) sobreun conjunto de aplicaciones irregulares a gran escala: breadth first search , random walk y concurrent hashamp. Nuestro runtime es órdenes de magnitud superior a otras soluciones para cluster systems con arquiectura similare.Escalabilidad de nivel de usuario explotando características del hardware : Mostramos la alta complejidad de las optimizaciones de hardware de bajo nivel como una motivación para incorporar esta lógica en un runtime system. Evaluamos los efectos de mecanismo de hardware-thread priority que controla la velocidad a la que cada hilo de clock decodifica la instrucciónes sobre IBM POWER5 y POWER6 . Finalmente, mostramos cómo se puede explotar eficazmente la localidad de caché y de network-on-chip en una arquitectura Tilera many-core para mejorar la escalabilidad intra-core.

    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability.

  • Scaling irregular applications through data aggregation and software multithreading

     Morari, A.; Tumeo, Antonio; Chavarria Miranda, Daniel; Villa, Oreste; Valero Cortes, Mateo
    IEEE International Parallel and Distributed Processing Symposium
    p. 1126-1135
    DOI: 10.1109/IPDPS.2014.117
    Presentation's date: 2014-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.

  • Dynamic transaction coalescing

     Stipic, Srdjan; Karakostas, Vasileios; Smiljkovic, Vesna; Gajinov, Vladimir; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    ACM International Conference on Computing Frontiers
    p. Article No. 18-
    DOI: 10.1145/2597917.2597930
    Presentation's date: 2014-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Prior work in Software Transactional Memory has identified high overheads related to starting and committing transactions that may degrade the application performance. To amortize these overheads, transaction coalescing techniques have been proposed that coalesce two or more small transactions into one large transaction. However, these techniques either coalesce transactions statically at compile time, or lack on-line profiling mechanisms that allow coalescing transactions dynamically. Thus, such approaches lead to sub-optimal execution or they may even degrade the performance. In this paper, we introduce Dynamic Transaction Coalescing (DTC), a compile-time and run-time technique that improves transactional throughput. DTC reduces the overheads of starting and committing a transaction. At compile-time, DTC generates several code paths with a different number of coalesced transactions. At runtime, DTC implements low overhead online profiling and dynamically selects the corresponding code path that improves throughput. Compared to coalescing transactions statically, DTC provides two main improvements. First, DTC implements online profiling which removes the dependency on a pre-compilation profiling step. Second, DTC dynamically selects the best transaction granularity to improve the transaction throughput taking into consideration the abort rate. We evaluate DTC using common TM benchmarks and micro-benchmarks. Our findings show that: (i) DTC performs like static transaction coalescing in the common case, (ii) DTC does not suffer from performance degradation, and (iii) DTC outperforms static transaction coalescing when an application exposes phased behavior.

  • Stand-alone memory controller for graphics system

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo; Haider, Amna
    International Symposium on Applied Reconfigurable Computing
    p. 108-120
    DOI: 10.1007/978-3-319-05960-0_10
    Presentation's date: 2014-04
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    There has been a dramatic increase in the complexity of graphics applications in System-on-Chip (SoC) with a corresponding increase in performance requirements. Various powerful and expensive platforms to support graphical applications appeared recently. All these platforms require a high performance core that manages and schedules the high speed data of graphics peripherals (camera, display, etc.) and an efficient on chip scheduler. In this article we design and propose a SoC based Programmable Graphics Controller (PGC) that handles graphics peripherals efficiently. The data access patterns are described in the pro- gram memory; the PGC reads them, generates transactions and manages both bus and connected peripherals without the support of a master core. The proposed system is highly reliable in terms of cost, performance and power. The PGC based system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the PGC is compared with the Microblaze processor based graphic system. When compared with the baseline system, the results show that the PGC captures video at 2x of higher frame rate and achieves 3.4x to 7.4x of speedup while process- ing images. PGC consumes 30% less hardware resources and 22% less on-chip power than the baseline system.

  • EVX: vector execution on low power EDGE cores

     Duric, Milovan; Palomar Perez, Oscar; Smith, Aaron; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo; Burger, Doug
    Design, Automation and Test in Europe
    p. 1-4
    DOI: 10.7873/DATE.2014.035
    Presentation's date: 2014-03-24
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions. © 2014 EDAA.

  • Using dynamic runtime testing for rapid development of architectural simulators

     Tomic, Saša; Cristal Kestelman, Adrian; Unsal, Osman Sabri; Valero Cortes, Mateo
    International journal of parallel programming
    Vol. 42, num. 1, p. 119-139
    DOI: 10.1007/s10766-012-0208-7
    Date of publication: 2014-02-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Architectural simulator platforms are particularly complex and error-prone programs that aim to simulate all hardware details of a given target architecture. Development of a stable cycle-accurate architectural simulator can easily take several man-years. Discovering and fixing all visible errors in a simulator often requires significant effort, much higher than for writing the simulator code in the first place. In addition, there are no guarantees that all programming errors will be eliminated, no matter how much effort is put into testing and debugging. This paper presents dynamic runtime testing, a methodology for rapid development and accurate detection of functional bugs in architectural cycle-accurate simulators. Dynamic runtime testing consists of comparing an execution of a cycle-accurate simulator with an execution of a simple and functionally equivalent emulator. Dynamic runtime testing detects a possible functional error if there is a mismatch between the execution in the simulator and the emulator. Dynamic runtime testing provides a reliable and accurate verification of a simulator, during its entire development cycle, with very acceptable performance impact, and without requiring complex setup for the simulator execution. Based on our experience, dynamic testing reduced the simulator modification time from 12-18 person-months to only 3-4 person-months, while it only modestly reduced the simulator performance (in our case under 20 %). © 2012 Springer Science+Business Media, LLC.

  • APMC: advanced pattern based memory controller

     Hussain, Tassadaq; Palomar Perez, Oscar; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo; Rethinagiri, Santhosh Kumar
    ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
    p. 252
    DOI: 10.1145/2554688.2554732
    Presentation's date: 2014-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that supports both regular and irregular memory access patterns. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively.

  • Arquitectura de Computadors d'Altes Prestacions (ACAP)

     Olive Duran, Angel; Ramirez Bellido, Alejandro; Llosa Espuny, Jose Francisco; Sanchez Carracedo, Fermin; Jimenez Castells, Marta; Fernandez Jimenez, Agustin; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Morancho Llena, Enrique; Moretó Planas, Miquel; Carpenter, Paul M.; Palomar Perez, Oscar; Monreal Arnal, Teresa; Valero Cortes, Mateo
    Competitive project

     Share

  • CODOMs: Protecting software with code-centric memory domains

     Vilanova, Lluís; Ben-Yehuda, Muli; Navarro, Nacho; Etsion, Yoav; Valero Cortes, Mateo
    Annual International Symposium on Computer Architecture
    p. 469-480
    DOI: 10.1109/ISCA.2014.6853202
    Presentation's date: 2014
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Today's complex software systems are neither secure nor reliable. The rudimentary software protection primitives provided by current hardware forces systems to run many distrusting software components (e.g., procedures, libraries, plugins, modules) in the same protection domain, or otherwise suffer degraded performance from address space switches. We present CODOMs (COde-centric memory DOMains), a novel architecture that can provide finer-grained isolation between software components with effectively zero run-time overhead, all at a fraction of the complexity of other approaches. An implementation of CODOMs in a cycle-accurate full-system x86 simulator demonstrates that with the right hardware support, finer-grained protection and run-time performance can peacefully coexist.

  • VALib and SimpleVector: Tools for rapid initial research on vector architectures

     Stanic, Milan; Palomar Perez, Oscar; Ratkovic, Ivan; Duric, Milovan; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    ACM International Conference on Computing Frontiers
    DOI: 10.1145/2597917.2597919
    Presentation's date: 2014
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Vector architectures have been traditionally applied to the supercomputing domain with many successful incarnations. The energy efficiency and high performance of vector processors, as well as their applicability in other emerging domains, encourage pursuing further research on vector architectures. However, there is a lack of appropriate tools to perform this research. This paper presents two tools for measuring and analyzing an application's suitability for vector microarchitectures. The first tool is VALib, a library that enables hand-crafted vectorization of applications and its main purpose is to collect data for detailed instruction level characterization and to generate input traces for the second tool. The second tool is SimpleVector, a fast trace-driven simulator that is used to estimate the execution time of a vectorized application on a candidate vector microarchitecture. The potential of the tools is demonstrated using six applications from emerging application domains such as speech and face recognition, video encoding, bioinformatics, machine learning and graph search. The results indicate that 63.2% to 91.1% of these contemporary applications are vectorizable. Then, over multiple use cases, we demonstrate that the tools can facilitate rapid evaluation of various vector architecture designs.

  • Runtime-aware architectures: A first approach

     Valero Cortes, Mateo; Moretó Planas, Miquel; Casas Guix, Marc; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Supercomputing frontiers and innovations
    Vol. 1, num. 1, p. 29-44
    DOI: 10.14529/jsfi140102
    Date of publication: 2014
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In the last few years, the traditional ways to keep the increase of hardware performance at the rate predicted by Moore's Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. With the irruption of multi-cores and parallel applications, this simple interface started to leak. As a consequence, the role of decoupling again applications from the hardware was moved to the runtime system. Efficiently using the underlying hardware from this runtime without exposing its complexities to the application has been the target of very active and prolific research in the last years. Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores already have to face. It is our position that the runtime has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In this paper, we introduce a first approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective.

  • Hardware support for accurate per-task energy metering in multicore systems

     Liu, Qixiao; Moreto Planas, Miquel; Jiménez, Víctor; Abella Ferrer, Jaume; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    ACM transactions on architecture and code optimization
    Vol. 10, num. 4, p. 34:1-34:27
    DOI: 10.1145/2555289.2555291
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Accurately determining the energy consumed by each task in a system will become of prominent importance in future multicore-based systems because it offers several benefits, including (i) better application energy/performance optimizations, (ii) improved energy-aware task scheduling, and (iii) energy-aware billing in data centers. Unfortunately, existing methods for energy metering in multicores fail to provide accurate energy estimates for each task when several tasks run simultaneously. This article makes a case for accurate Per-Task Energy Metering (PTEM) based on tracking the resource utilization and occupancy of each task. Different hardware implementationswith different trade-offs between energy prediction accuracy and hardware-implementation complexity are proposed. Our evaluation shows that the energy consumed in a multicore by each task can be accurately measured. For a 32-core, 2-way, simultaneous multithreaded core setup, PTEM reduces the average accuracy error from more than 12% when our hardware support is not used to less than 4% when it is used. The maximum observed error for any task in the workload we used reduces from 58% down to 9% when our hardware support is used.

  • Thread assignment of multithreaded network applications in multicore/multithreaded processors

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Vol. 24, num. 12, p. 2513-2525
    DOI: 10.1109/TPDS.2012.311
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment.

  • Profile-guided transaction coalescing—lowering transactional overheads by merging transactions

     Stipic, Srdjan; Smiljkovic, Vesna; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    ACM transactions on architecture and code optimization
    Vol. 10, num. 4, p. Article No. 50-
    DOI: 10.1145/2541228.2555306
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Previous studies in software transactional memory mostly focused on reducing the overhead of transactional read and write operations. In this article, we introduce transaction coalescing, a profile-guided compiler optimization technique that attempts to reduce the overheads of starting and committing a transaction by merging two or more small transactions into one large transaction. We develop a profiling tool and a transaction coalescing heuristic to identify candidate transactions suitable for coalescing. We implement a compiler extension to automatically merge the candidate transactions at the compile time. We evaluate the effectiveness of our technique using the hash table micro-benchmark and the STAMP benchmark suite. Transaction coalescing improves the performance of the hash table significantly and the performance of Vacation and SSCA2 benchmarks by 19.4% and 36.4%, respectively, when running with 12 threads.

  • Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?

     Rajovic, Nikola; Carpenter, Paul Matthew; Gelado Fernandez, Isaac; Puzovic, Nikola; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    International Conference for High Performance Computing, Networking, Storage and Analysis
    p. Article No. 40-
    DOI: 10.1145/2503210.2503281
    Presentation's date: 2013-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in high-performance computing. This transformation has been so effective that the June 2013 TOP500 list is still dominated by x86. In 2013, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smart-phones and tablets, most of which are built with ARM-based SoCs. This leads to the suggestion that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This paper addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. We also present our experience evaluating performance and efficiency of mobile SoCs, deploying a cluster and evaluating the network and scalability of production applications. In summary, we give a first answer as to whether mobile SoCs are ready for HPC.

  • Programmability and portability for exascale: top down programming methodology and tools with StarSs

     Subotic, Vladimir; Brinkmann, Steffen; Marjanovic, Vladimir; Badia Sala, Rosa Maria; Gracia, Jose; Niethammer, Chirstoph; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Journal of computational science
    Vol. 4, num. 6, p. 450-456
    DOI: 10.1016/j.jocs.2013.01.008
    Date of publication: 2013-11
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated tasks on heterogeneous platforms, including clusters of GPUs. This paper focuses on the methodology and tools that complements the programming model forming a consistent development environment with the objective of simplifying the live of application developers. The programming environment includes the tools TAREADOR and TEMANEJO, which have been designed specifically for StarSs. TAREADOR, a Valgrind-based tool, allows a top-down development approach by assisting the programmer in identifying tasks and their data-dependencies across all concurrency levels of an application. TEMANEJO is a graphical debugger supporting the programmer by visualizing the task dependency tree on one hand, but also allowing to manipulate task scheduling or dependencies. These tools are complemented with a set of performance analysis tools (Scalasca, Cube and Paraver) that enable to fine tune StarSs application.

  • Raising the Level of Abstraction: Simulation of Large Chip Multiprocessors Running Multithreaded Applications  Open access

     Rico Carro, Alejandro
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El nombre de transistors en els circuits integrats continua doblant-se cada dos anys. Aquest nombre creixent de transistors s'utilitza per integrar més nuclis de processament en el mateix chip. No obstant això, degut a la densitat de potència i la disminució de guanys per l'explotació de paral·lelisme a nivell d'instrucció, el rendiment d'un sol fil en un nucli no es dobla cada dos anys, sinó cada tres anys i mig.La investigació en arquitectura de computadors està principalment basada en simulació. En els simuladors d'arquitectura de computadors, la complexitat de la màquina simulada incrementa amb el nombre de transistors disponibles. Quant més transistors, més nuclis, i més complexe és el model. No obstant això, el rendiment dels simuladors d'arquitectura de computadors depèn del rendiment d'un sol fil a la màquina que executa el simulador i, com hem dit abans, aquest no es dobla cada dos anys, sinó cada tres anys i mig. Aquesta diferència creixent entre la complexitat de la màquina simulada i la velocitat de simulació és el que anomenem la bretxa de la velocitat de simulació.Degut a la bretxa de la velocitat de simulació, els simuladors d'arquitectura de computadors són cada cop més lents. La simulació d'una aplicació de referència pot trigar setmanes o inclús mesos. Els investigadors són conscients d'aquest problema i han estat proposant tècniques per reduir el temps de simulació. Aquestes tècniques inclouen l'ús de conjunt d'entrades d'aplicació reduïts, simulació mostrejada i paral·lelització.Un altra tècnica per reduir el temps de simulació és elevar el nivell d'abstracció del model simulat. En aquesta tesi advoquem per aquesta estratègia. Primer, decidim utilitzar simulació basada en traces perquè no necessita proporcionar simulació funcional i, per tant, permet elevar el nivell d'abstracció més enllà de la representació a nivell d'instrucció.No obstant això, la simulació basada en traces té algunes limitacions. La més important d'aquestes és que no pot reproduir el comportament dinàmic de les aplicacions multifil. En aquesta tesi, proposem una metodologia de simulació que utilitza simulació basada en traces conjuntament amb un sistema en temps d'execució que permet una simulació correcta d'aplicacions multifil al reproduir el comportament depenent del rendiment en temps d'execució.Amb aquesta metodologia, avaluem l'ús de múltiples nivells d'abstracció per reduir el temps de simulació, des d'un mode de simulació a nivell d'aplicació fins a un mode detallat a nivell d'instrucció. Fem una avaluació exhaustiva de l'impacte en la precisió i velocitat de simulació d'aquests nivells d'abstracció i també mostrem la seva aplicabilitat i utilitat depenent de les avaluacions que es volen portar a terme. També comparem aquest nivell d'abstracció amb els emprats en els simuladors d'arquitectura de computadors més populars. A més, validem el nivell d'abstracció més alt contra la màquina real.Un dels nivells d'abstracció interessants per la simulació d'arquitectures multi-nucli és el mode de memòria. Aquest mode és capaç de modelar el rendiment d'un nucli superescalar fora d'ordre utilitzant traces d'accessos a memòria. A aquest nivell d'abstracció, treballs previs han utilitzat traces filtrades que no inclouen encerts en la memòria cau de primer nivell, i permeten simular només els accessos al segon nivell per simulacions d'un sol fil. No obstant això, simular aplicacions multifil utilitzant traces filtrades igual que en treballs previs comporta algunes imprecisions. Nosaltres proposem una tècnica per reduir aquestes imprecisions i avaluem el guany en velocitat, la seva aplicabilitat i utilitat per simulacions a nivell de memòria.En conjunt, aquesta tesi contribueix al coneixement amb tècniques per la simulació de chips multiprocessador amb centenars de nuclis utilitzant traces, alhora que estableix i avalua els compromisos d'utilitzar diversos nivells d'abstracció en termes de precisió i velocitat de simulació.

    The number of transistors on an integrated circuit keeps doubling every two years. This increasing number of transistors is used to integrate more processing cores on the same chip. However, due to power density and ILP diminishing returns, the single-thread performance of such processing cores does not double every two years, but doubles every three years and a half. Computer architecture research is mainly driven by simulation. In computer architecture simulators, the complexity of the simulated machine increases with the number of available transistors. The more transistors, the more cores, the more complex is the model. However, the performance of computer architecture simulators depends on the single-thread performance of the host machine and, as we mentioned before, this is not doubling every two years but every three years and a half. This increasing difference between the complexity of the simulated machine and simulation speed is what we call the simulation speed gap. Because of the simulation speed gap, computer architecture simulators are increasingly slow. The simulation of a reference benchmark may take several weeks or even months. Researchers are concious of this problem and have been proposing techniques to reduce simulation time. These techniques include the use of reduced application input sets, sampled simulation and parallelization. Another technique to reduce simulation time is raising the level of abstraction of the simulated model. In this thesis we advocate for this approach. First, we decide to use trace-driven simulation because it does not require to provide functional simulation, and thus, allows to raise the level of abstraction beyond the instruction-stream representation. However, trace-driven simulation has several limitations, the most important being the inability to reproduce the dynamic behavior of multithreaded applications. In this thesis we propose a simulation methodology that employs a trace-driven simulator together with a runtime sytem that allows the proper simulation of multithreaded applications by reproducing the timing-dependent dynamic behavior at simulation time. Having this methodology, we evaluate the use of multiple levels of abstraction to reduce simulation time, from a high-speed application-level simulation mode to a detailed instruction-level mode. We provide a comprehensive evaluation of the impact in accuracy and simulation speed of these abstraction levels and also show their applicability and usefulness depending on the target evaluations. We also compare these levels of abstraction with the existing ones in popular computer architecture simulators. Also, we validate the highest abstraction level against a real machine. One of the interesting levels of abstraction for the simulation of multi-cores is the memory mode. This simulation mode is able to model the performanceof a superscalar out-of-order core using memory-access traces. At this level of abstraction, previous works have used filtered traces that do not include L1 hits, and allow to simulate only L2 misses for single-core simulations. However, simulating multithreaded applications using filtered traces as in previous works has inherent inaccuracies. We propose a technique to reduce such inaccuracies and evaluate the speed-up, applicability, and usefulness of memory-level simulation. All in all, this thesis contributes to knowledge with techniques for the simulation of chip multiprocessors with hundreds of cores using traces. It states and evaluates the trade-offs of using varying degress of abstraction in terms of accuracy and simulation speed.

  • The TERAFLUX Project: Exploiting the dataflow paradigm in next generation teradevices

     Solinas, Marco; Badia Sala, Rosa Maria; Bodin, François; Cohen, Albert; Evripidou, Paraskevas; Faraboschi, Paolo; Fechner, Bernhard; Gao, Guang R.; Garbade, Arne; Girbal, Sylvain; Goodman, Daniel; Khan, Behran; Koliai, Souad; Li, Feng; Lujan, Mikel; Morin, Laurent; Mendelson, Avi; Navarro, Nacho; Pop, Antoniu; Trancoso, Pedro; Ungerer, Theo; Valero Cortes, Mateo; Weis, Sebastian; Watson, Ian; Zuckermann, Stéphane; Giorgi, Roberto
    Euromicro Symposium on Digital Systems Design
    p. 272-279
    DOI: 10.1109/DSD.2013.39
    Presentation's date: 2013-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper describes the project and provides an overview of the research carried out by the TERAFLUX consortium.

  • Performance and Power Optimizations in Chip Multiprocessors for Throughput-Aware Computation  Open access

     Vega, Augusto Javier
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño.En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos.El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo.Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía.Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias.

    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.

    El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias.

  • Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs  Open access

     Subotic, Vladimir
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de unprograma paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela.En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo.En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica demensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes.En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento.También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones.Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas.Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa.En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programaciónMPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs.Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

    Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.

    La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

  • Moving from petaflops to petadata

     Flynn, Michael J.; Mencer, Oskar; Milutinovic, Veljko; Rakocevic, Goran; Stenstrom, Per; Trobec, Roman; Valero Cortes, Mateo
    Communications of the ACM
    Vol. 56, num. 5, p. 39-42
    DOI: 10.1145/2447976.2447989
    Date of publication: 2013-05
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The race to build ever-faster supercomputers is on, with more contenders than ever before. However, the current goals set for this race may not lead to the fastest computation for particular applications.

  • HPC system software for regular and irregular parallel applications

     Morari, Alessandro; Valero Cortes, Mateo
    IEEE International Parallel and Distributed Processing Symposium
    p. 2242-2245
    DOI: 10.1109/IPDPSW.2013.78
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The upcoming generation of system software for High Performance Computing is expected to provide a richer set of functionalities without compromising application performance. This Ph.D. thesis addresses the problem of designing scalable system software for both regular and irregular applications. The contributions are two-fold. First, we evaluate the drawbacks of current HPC system software for regular applications. We describe a methodology to precisely measure jitter on a general-purpose OS. Considering a lightweight operating system (IBM CNK), we analyze the overhead of adding support for a missing feature such as dynamic memory management. Second, we focus on irregular applications and build a specialized runtime system to enhance this kind of applications on common HPC flop intensive systems. The proposed runtime system provides a global address space abstraction of a distributed memory machine combined with a transparent fork/join execution model and it also includes lightweight multithreading and network message aggregation.

  • On the selection of adder unit in energy efficient vector processing

     Ratkovic, Ivan; Palomar Perez, Oscar; Stanic, Milan; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Symposium on Quality Electronic Design
    p. 143-150
    DOI: 10.1109/ISQED.2013.6523602
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Vector processors are a very promising solution for mobile devices and servers due to their inherently energy-efficient way of exploiting data-level parallelism. Previous research on vector architectures predominantly focused on performance, so vector processors require a new design space exploration to achieve low power. In this paper, we present a design space exploration of adder unit for vector processors (VA), as it is one of the crucial components in the core design with a non-negligible impact in overall performance and power. For this interrelated circuit-architecture exploration, we developed a novel framework with both architectural- and circuit-level tools. Our framework includes both design- (e.g. adder's family type) and vector architecture-related parameters (e.g. vector length). Finally, we present guidelines on the selection of the most appropriate VA for different types of vector processors according to different sets of metrics of interest. For example, we found that 2-lane configurations are more EDP (Energy×Delay)-efficient than single lane configurations for low-end mobile processors.

  • TM-dietlibc: A TM-aware real-world system library

     Smiljkovic, Vesna; Nowack, Martin; Miletic, Nebojša; Harris, Tim; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    IEEE International Parallel and Distributed Processing Symposium
    p. 1266-1274
    DOI: 10.1109/IPDPS.2013.45
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The simplicity of concurrent programming with Transactional Memory (TM) and its recent implementation in mainstream processors greatly motivates researchers and industry to investigate this field and propose new implementations and optimizations. However, there is still no standard C system library which a wide range of TM developers can adopt. TM application developers have been forced to avoid library calls inside of transactions or to execute them irrevocably (i.e. in serial order). In this paper, we present the first TM-aware system library, a complex software implementation integrated with TM principles and suited for software (STM), hardware (HTM) and hybrid TM (HyTM). The library we propose is derived from a modified lock-based implementation and can be used with the existing standard C API. In our work, we describe design challenges and code optimizations that would be specific to any TMbased system library or application. We argue about system call execution within transactions, highlighting the possibility of unexpected results from threads. For this reason we propose: (1) a mechanism for detecting conflicts over kernel data in user space, and (2) a new barrier to allow hybrid TM to be used effectively with system libraries. Our evaluation includes different TM implementations and the focus is on memory management and file operations since they are widely used in applications and require additional mechanisms for concurrent execution. We show the benefit we gain with our libc modifications providing parallel execution as much as possible. The library we propose shows high scalability when linked with STM and HTM. For file operations it shows on average a 1.1, 2.6 and 3.7x performance speedup for 8 cores using HyTM, STM and HTM, respectively (over a lock-based single-threaded execution). For a red-black tree it shows on average 3.14x performance speedup for 8 cores using STM (over a multi-read single-threaded execution).

  • ACM Awards

     Valero Cortes, Mateo
    Award or recognition

    View View Open in new window  Share

  • SMT malleability in IBM POWER5 and POWER6 processors

     Morari, A.; Boneti, Carlos; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Cher, Chen-Yong; Buyuktosunoglu, Alper; Bose, Prosenjit; Valero Cortes, Mateo
    IEEE transactions on computers
    Vol. 62, num. 4, p. 813-826
    DOI: 10.1109/TC.2012.34
    Date of publication: 2013-04
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    While several hardware mechanisms have been proposed to control the interaction between hardware threads in an SMT processor, few have addressed the issue of software-controllable SMT performance. The IBM POWER5 and POWER6 are the first high-performance processors implementing a software-controllable hardware-thread prioritization mechanism that controls the rate at which each hardware-thread decodes instructions. This paper shows the potential of this basic mechanism to improve several target metrics for various applications on POWER5 and POWER6 processors. Our results show that although the software interface is exactly the same, the software-controlled priority mechanism has a different effect on POWER5 and POWER6. For instance, hardware threads in POWER6 are less sensitive to priorities than in POWER5 due to the in order design. We study the SMT thread malleability to enable user-level optimizations that leverage software-controlled thread priorities...

  • Trace filtering of multithreaded applications for CMP memory simulation

     Rico Carro, Alejandro; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    IEEE International Symposium on Performance Analysis of Systems and Software
    p. 134-135
    DOI: 10.1109/ISPASS.2013.6557160
    Presentation's date: 2013-04
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Recent works have shown that modelling the performance of out-of-order superscalar cores is doable using filtered memory traces for single thread simulations. However, those techniques do not account for cache coherence actions so they cannot be used reliably in multithreaded scenarios. In this paper, we leverage the structure of parallel applications to propose a simulation methodology that enables the use of filtered memory traces for the simulation of multithreaded applications on multicore architectures. In our experiments our proposal reduced the simulation error of state-of-the-art techniques in 39% on average, while only losing 9.5% of simulation speedup.

  • Access to the full text
    Efficient cache architectures for reliable hybrid voltage operation using EDC codes  Open access

     Maric, Bojan; Abella Ferrer, Jaume; Valero Cortes, Mateo
    Design, Automation and Test in Europe
    p. 917-920
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Semiconductor technology evolution enables the design of sensor-based battery-powered ultra-low-cost chips (e.g., below 1 ¿) required for new market segments such as body, urban life and environment monitoring. Caches have been shown to be the highest energy and area consumer in those chips. This paper proposes a novel, hybrid-operation (high Vcc, ultra-low Vcc), single-Vcc domain cache architecture based on replacing energy-hungry bitcells (e.g., 10T) by more energy-efficient and smaller cells (e.g., 8T) enhanced with Error Detection and Correction (EDC) features for high reliability and performance predictability. Our architecture is proven to largely outperform existing solutions in terms of energy and area.

    Semiconductor technology evolution enables the design of sensor-based battery-powered ultra-low-cost chips (e.g., below 1 p) required for new market segments such as body, urban life and environment monitoring. Caches have been shown to be the highest energy and area consumer in those chips. This paper proposes a novel, hybrid-operation (high Vcc, ultra-low Vcc), single-Vcc domain cache architecture based on replacing energy-hungry bitcells (e.g., 10T) by more energy-efficient and smaller cells (e.g., 8T) enhanced with Error Detection and Correction (EDC) features for high reliability and performance predictability. Our architecture is proven to largely outperform existing solutions in terms of energy and area.

    Postprint (author’s final draft)

  • Computación de Altas Prestaciones VI

     Guitart Fernández, Jordi; Monreal Arnal, Teresa; Valero Cortes, Mateo
    Competitive project

     Share

  • Fair CPU time accounting in CMP+SMT processors

     Luque, Carlos; Moreto Planas, Miquel; Cazorla, Francisco J.; Valero Cortes, Mateo
    ACM transactions on architecture and code optimization
    Vol. 9, num. 4, p. 1-25
    DOI: 0.1145/2400682.2400709
    Date of publication: 2013-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window