Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 89 results
  • SMT malleability in IBM POWER5 and POWER6 processors

     Morari, A.; Boneti, Carlos; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Cher, Chen-Yong; Buyuktosunoglu, Alper; Bose, Prosenjit; Valero Cortes, Mateo
    IEEE transactions on computers
    Date of publication: 2013-04
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    While several hardware mechanisms have been proposed to control the interaction between hardware threads in an SMT processor, few have addressed the issue of software-controllable SMT performance. The IBM POWER5 and POWER6 are the first high-performance processors implementing a software-controllable hardware-thread prioritization mechanism that controls the rate at which each hardware-thread decodes instructions. This paper shows the potential of this basic mechanism to improve several target metrics for various applications on POWER5 and POWER6 processors. Our results show that although the software interface is exactly the same, the software-controlled priority mechanism has a different effect on POWER5 and POWER6. For instance, hardware threads in POWER6 are less sensitive to priorities than in POWER5 due to the in order design. We study the SMT thread malleability to enable user-level optimizations that leverage software-controlled thread priorities...

  • DTM: degraded test mode for fault-aware probabilistic timing analysis

     Slijepcevic, Mladen; Kosmidis, Leonidas; Abella, Jaume; Quiñones Moreno, Eduardo; Cazorla Almeida, Francisco Javier
    Euromicro Conference on Real-Time Systems
    Presentation's date: 2013-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Existing timing analysis techniques to derive Worst-Case Execution Time (WCET) estimates assume that hardware in the target platform (e.g., the CPU) is fault-free. Given the performance requirements increase in current Critical Real-Time Embedded Systems (CRTES), the use of high-performance features and smaller transistors in current and future hardware becomes a must. The use of smaller transistors helps providing more performance while maintaining low energy budgets, however, hardware fault rates increase noticeably, affecting the temporal behaviour of the system in general, and WCET in particular. In this paper, we reconcile these two emergent needs of CRTES, namely, tight (and trustworthy) WCET estimates and the use of hardware implemented with smaller transistors. To that end we propose the Degraded Test Mode (DTM) that, in combination with fault-tolerant hardware designs and probabilistic timing analysis techniques, (i) enables the computation of tight and trustworthy WCET estimates in the presence of faults, (ii) provides graceful average and worst-case performance degradation due to faults, and (iii) requires modifications neither in WCET analysis tools nor in applications. Our results show that DTM allows accounting for the effect of faults at analysis time with low impact in WCET estimates and negligible hardware modifications.

  • Probabilistic timing analysis on conventional cache designs

     Kosmidis, Leonidas; Curtsinger, Charlie; Quiñones Moreno, Eduardo; Abella Ferrer, Jaume; Berger, Emery D.; Cazorla Almeida, Francisco Javier
    Design, Automation and Test in Europe
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Probabilistic timing analysis (PTA), a promising alternative to traditional worst-case execution time (WCET) analyses, enables pairing time bounds (named probabilistic WCET or pWCET) with an exceedance probability (e.g., 10 -16), resulting in far tighter bounds than conventional analyses. However, the applicability of PTA has been limited because of its dependence on relatively exotic hardware: fully-associative caches using random replacement. This paper extends the applicability of PTA to conventional cache designs via a software-only approach. We show that, by using a combination of compiler techniques and runtime system support to randomise the memory layout of both code and data, conventional caches behave as fully-associative ones with random replacement.

  • A cache design for probabilistically analysable real-time systems

     Kosmidis, Leonidas; Abella Ferrer, Jaume; Quiñones Moreno, Eduardo; Cazorla Almeida, Francisco Javier
    Design, Automation and Test in Europe
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Caches provide significant performance improvements, though their use in real-time industry is low because current WCET analysis tools require detailed knowledge of program's cache accesses to provide tight WCET estimates. Probabilistic Timing Analysis (PTA) has emerged as a solution to reduce the amount of information needed to provide tight WCET estimates, although it imposes new requirements on hardware design. At cache level, so far only fully-associative random-replacement caches have been proven to fulfill the needs of PTA, but they are expensive in size and energy. In this paper we propose a cache design that allows setassociative and direct-mapped caches to be analysed with PTA techniques. In particular we propose a novel parametric random placement suitable for PTA that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.

  • The next convergence: High-performance and mission-critical markets

     Girbal, Sylvain; Moreto Planas, Miquel; Grasset, Arnaud; Abella Ferrer, Jaume; Quiñones Moreno, Eduardo; Cazorla Almeida, Francisco Javier; Yehia, Sami
    High-performance and Real-time Embedded Systems
    Presentation's date: 2013-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The well-known convergence of the high-performance computing and the mobile markets has been a dominating factor in the computing market during the last two decades. In this paper we witness a new type of convergence between the mission-critical market (such as avionic or automotive) and the mainstream consumer electronics market. Such convergence is fuelled by the common needs of both markets for more reliability, support for mission-critical functionalities and the challenge of harnessing the unsustainable increases in safety margins to guarantee either correctness or timing. In this position paper, we present a description of this new convergence, as well as the main challenges and opportunities that it brings to computing industry.

  • Measurement-based probabilistic timing analysis: Lessons from an integrated-modular avionics case study

     Wartel, Franck; Kosmidis, Leonidas; Lo, Code; Triquet, Benoit; Quiñones Moreno, Eduardo; Abella Ferrer, Jaume; Gogonel, Adriana; Baldovin, Andrea; Mezzetti, Enrico; Cucu Grosjean, Liliana; Vardanega, Tulio; Cazorla Almeida, Francisco Javier
    IEEE International Symposium on Industrial Embedded Systems
    Presentation's date: 2013-06-19
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Probabilistic Timing Analysis (PTA) in general and its measurement-based variant called MBPTA in particular can mitigate some of the problems that impair current worst-case execution time (WCET) analysis techniques. MBPTA computes tight WCET bounds expressed as probabilistic exceedance functions, without needing much information on the hardware and software internals of the system. Classic WCET analysis has information needs that may be costly and difficult to satisfy, and their omission increases pessimism. Previous work has shown that MBPTA does well with benchmark programs. Real-world applications however place more demanding requirements on timing analysis than simple benchmarks. It is interesting to see how PTA responds to them. This paper discusses the application of MBPTA to a real avionics system and presents lessons learned in that process. © 2013 IEEE.

  • Thread assignment of multithreaded network applications in multicore/multithreaded processors

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla Almeida, Francisco Javier; Nemirovski, Mario; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment.

  • Hardware support for accurate per-task energy metering in multicore systems

     Liu, Qixiao; Moreto Planas, Miquel; Jiménez, Víctor; Abella Ferrer, Jaume; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    ACM transactions on architecture and code optimization
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Accurately determining the energy consumed by each task in a system will become of prominent importance in future multicore-based systems because it offers several benefits, including (i) better application energy/performance optimizations, (ii) improved energy-aware task scheduling, and (iii) energy-aware billing in data centers. Unfortunately, existing methods for energy metering in multicores fail to provide accurate energy estimates for each task when several tasks run simultaneously. This article makes a case for accurate Per-Task Energy Metering (PTEM) based on tracking the resource utilization and occupancy of each task. Different hardware implementationswith different trade-offs between energy prediction accuracy and hardware-implementation complexity are proposed. Our evaluation shows that the energy consumed in a multicore by each task can be accurately measured. For a 32-core, 2-way, simultaneous multithreaded core setup, PTEM reduces the average accuracy error from more than 12% when our hardware support is not used to less than 4% when it is used. The maximum observed error for any task in the workload we used reduces from 58% down to 9% when our hardware support is used.

  • A Multi-core Processor for Hard Real-Time Systems  Open access

     Paolieri, Marco
    Defense's date: 2011-11-04
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption. Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks. Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems. This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD). In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs. The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs). Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs. This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform.

    La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en industrias como la automovilística y la de aviación, está impulsando un incremento en el rendimiento necesario en los actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple. Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas (formadas por tareas de tiempo real duro y laxo así como tareas sin requerimientos de tiempo real), maximizando así la utilización de los recursos de procesador y garantizando el bajo consumo de energía. Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden simultáneamente a los recursos compartidos del procesador. Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo de ejecución de la aplicación - se convierte en extremadamente difícil, si no imposible, porque el tiempo de ejecución de una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello, diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un límite superior (UBD). Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas HRT se reducen al mínimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del resto de recursos. Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo así mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las tareas HRT. Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con diferentes niveles de tiempo real (HRT y NHRTs). Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una ejecución analizable del modelo de ejecución paralela software pipelining. Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles.

  • ITCA: Inter-Task Conflict-Aware CPU accounting for CMP

     Luque, Carlos; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Thread to strand binding of parallel network applications in massive multi-threaded systems

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    Presentation's date: 2010-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

  • Adapting cache partitioning algorithms to pseudo-LRU replacement policies

     Kedzierski, Kamil; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2010-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    An analyzable memory controller for hard real-time CMPs  Open access

     Paolieri, Marco; Quiñones Moreno, Eduardo; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    IEEE Embedded Systems Letters
    Date of publication: 2010-02-05
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Multicore processors (CMPs) represent a good solution to provide the performance required by current and future hard real-time systems. However, it is difficult to compute a tight WCET estimation for CMPs due to interferences that tasks suffer when accessing shared hardware resources.We propose an analyzable JEDEC-compliant DDRx SDRAM memory controller (AMC) for hard real-time CMPs, that reduces the impact of memory interferences caused by other tasks on WCET estimation, providing a predictable memory access time and allowing the computation of tight WCET estimations.

  • Improving cache behavior in CMP architectures through cache partitioning techniques  Open access  awarded activity

     Moreto Planas, Miquel
    Defense's date: 2010-03-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The evolution of microprocessor design in the last few decades has changed significantly, moving from simple inorder single core architectures to superscalar and vector architectures in order to extract the maximum available instruction level parallelism. Executing several instructions from the same thread in parallel allows significantly improving the performance of an application. However, there is only a limited amount of parallelism available in each thread, because of data and control dependences. Furthermore, designing a high performance, single, monolithic processor has become very complex due to power and chip latencies constraints. These limitations have motivated the use of thread level parallelism (TLP) as a common strategy for improving processor performance. Multithreaded processors allow executing different threads at the same time, sharing some hardware resources. There are several flavors of multithreaded processors that exploit the TLP, such as chip multiprocessors (CMP), coarse grain multithreading, fine grain multithreading, simultaneous multithreading (SMT), and combinations of them.To improve cost and power efficiency, the computer industry has adopted multicore chips. In particular, CMP architectures have become the most common design decision (combined sometimes with multithreaded cores). Firstly, CMPs reduce design costs and average power consumption by promoting design re-use and simpler processor cores. For example, it is less complex to design a chip with many small, simple cores than a chip with fewer, larger, monolithic cores.Furthermore, simpler cores have less power hungry centralized hardware structures. Secondly, CMPs reduce costs by improving hardware resource utilization. On a multicore chip, co-scheduled threads can share costly microarchitecture resources that would otherwise be underutilized. Higher resource utilization improves aggregate performance and enables lower cost design alternatives.One of the resources that impacts most on the final performance of an application is the cache hierarchy. Caches store data recently used by the applications in order to take advantage of temporal and spatial locality of applications. Caches provide fast access to data, improving the performance of applications. Caches with low latencies have to be small, which prompts the design of a cache hierarchy organized into several levels of cache.In CMPs, the cache hierarchy is normally organized in a first level (L1) of instruction and data caches private to each core. A last level of cache (LLC) is normally shared among different cores in the processor (L2, L3 or both). Shared caches increase resource utilization and system performance. Large caches improve performance and efficiency by increasing the probability that each application can access data from a closer level of the cache hierarchy. It also allows an application to make use of the entire cache if needed.A second advantage of having a shared cache in a CMP design has to do with the cache coherency. In parallel applications, different threads share the same data and keep a local copy of this data in their cache. With multiple processors, it is possible for one processor to change the data, leaving another processor's cache with outdated data. Cache coherency protocol monitors changes to data and ensures that all processor caches have the most recent data. When the parallel application executes on the same physical chip, the cache coherency circuitry can operate at the speed of on-chip communications, rather than having to use the much slower between-chip communication, as is required with discrete processors on separate chips. These coherence protocols are simpler to design with a unified and shared level of cache onchip.Due to the advantages that multicore architectures offer, chip vendors use CMP architectures in current high performance, network, real-time and embedded systems. Several of these commercial processors have a level of the cache hierarchy shared by different cores. For example, the Sun UltraSPARC T2 has a 16-way 4MB L2 cache shared by 8 cores each one up to 8-way SMT. Other processors like the Intel Core 2 family also share up to a 12MB 24-way L2 cache. In contrast, the AMD K10 family has a private L2 cache per core and a shared L3 cache, with up to a 6MB 64-way L3 cache.As the long-term trend of increasing integration continues, the number of cores per chip is also projected to increase with each successive technology generation. Some significant studies have shown that processors with hundreds of cores per chip will appear in the market in the following years. The manycore era has already begun. Although this era provides many opportunities, it also presents many challenges. In particular, higher hardware resource sharing among concurrently executing threads can cause individual thread's performance to become unpredictable and might lead to violations of the individual applications' performance requirements. Current resource management mechanisms and policies are no longer adequate for future multicore systems.Some applications present low re-use of their data and pollute caches with data streams, such as multimedia, communications or streaming applications, or have many compulsory misses that cannot be solved by assigning more cache space to the application. Traditional eviction policies such as Least Recently Used (LRU), pseudo LRU or random are demand-driven, that is, they tend to give more space to the application that has more accesses to the cache hierarchy.When no direct control over shared resources is exercised (the last level cache in this case), it is possible that a particular thread allocates most of the shared resources, degrading other threads performance. As a consequence, high resource sharing and resource utilization can cause systems to become unstable and violate individual applications' requirements. If we want to provide a Quality of Service (QoS) to applications, we need to enhance the control over shared resources and enrich the collaboration between the OS and the architecture.In this thesis, we propose software and hardware mechanisms to improve cache sharing in CMP architectures. We make use of a holistic approach, coordinating targets of software and hardware to improve system aggregate performance and provide QoS to applications. We make use of explicit resource allocation techniques to control the shared cache in a CMP architecture, with resource allocation targets driven by hardware and software mechanisms.The main contributions of this thesis are the following:- We have characterized different single- and multithreaded applications and classified workloads with a systematic method to better understand and explain the cache sharing effects on a CMP architecture. We have made a special effort in studying previous cache partitioning techniques for CMP architectures, in order to acquire the insight to propose improved mechanisms.- In CMP architectures with out-of-order processors, cache misses can be served in parallel and share the miss penalty to access main memory. We take this fact into account to propose new cache partitioning algorithms guided by the memory-level parallelism (MLP) of each application. With these algorithms, the system performance is improved (in terms of throughput and fairness) without significantly increasing the hardware required by previous proposals.- Driving cache partition decisions with indirect indicators of performance such as misses, MLP or data re-use may lead to suboptimal cache partitions. Ideally, the appropriate metric to drive cache partitions should be the target metric to optimize, which is normally related to IPC. Thus, we have developed a hardware mechanism, OPACU, which is able to obtain at run-time accurate predictions of the performance of an application when running with different cache assignments.- Using performance predictions, we have introduced a new framework to manage shared caches in CMP architectures, FlexDCP, which allows the OS to optimize different IPC-related target metrics like throughput or fairness and provide QoS to applications. FlexDCP allows an enhanced coordination between the hardware and the software layers, which leads to improved system performance and flexibility.- Next, we have made use of performance estimations to reduce the load imbalance problem in parallel applications. We have built a run-time mechanism that detects parallel applications sensitive to cache allocation and, in these situations, the load imbalance is reduced by assigning more cache space to the slowest threads. This mechanism, helps reducing the long optimization time in terms of man-years of effort devoted to large-scale parallel applications.- Finally, we have stated the main characteristics that future multicore processors with thousands of cores should have. An enhanced coordination between the software and hardware layers has been proposed to better manage the shared resources in these architectures.

  • Runahead threads  Open access

     Ramirez Garcia, Tanausu
    Defense's date: 2010-04-15
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Los temas de investigación sobre multithreading han ganado mucho interés en la arquitectura de computadores con la aparición de procesadores multihilo y multinucleo. Los procesadores SMT (Simultaneous Multithreading) son uno de estos nuevos paradigmas, combinando la capacidad de emisión de múltiples instrucciones de los procesadores superscalares con la habilidad de explotar el paralelismo a nivel de hilos (TLP). Así, la principal característica de los procesadores SMT es ejecutar varios hilos al mismo tiempo para incrementar la utilización de las etapas del procesador mediante la compartición de recursos.Los recursos compartidos son el factor clave de los procesadores SMT, ya que esta característica conlleva tratar con importantes cuestiones pues los hilos también compiten por estos recursos en el núcleo del procesador. Si bien distintos grupos de aplicaciones se benefician de disponer de SMT, las diferentes propiedades de los hilos ejecutados pueden desbalancear la asignación de recursos entre los mismos, disminuyendo los beneficios de la ejecución multihilo. Por otro lado, el problema con la memoria está aún presente en los procesadores SMT. Estos procesadores alivian algunos de los problemas de latencia provocados por la lentitud de la memoria con respecto a la CPU. Sin embargo, hilos con grandes cargas de trabajo y con altas tasas de fallos en las caches son unas de las mayores dificultades de los procesadores SMT. Estos hilos intensivos en memoria tienden a crear importantes problemas por la contención de recursos. Por ejemplo, pueden llegar a bloquear recursos críticos debido a operaciones de larga latencia impidiendo no solo su ejecución, sino el progreso de la ejecución de los otros hilos y, por tanto, degradando el rendimiento general del sistema.El principal objetivo de esta tesis es aportar soluciones novedosas a estos problemas y que mejoren el rendimiento de los procesadores SMT. Para conseguirlo, proponemos los Runahead Threads (RaT) aplicando una ejecución especulativa basada en runahead. RaT es un mecanismo alternativo a las políticas previas de gestión de recursos las cuales usualmente restringían a los hilos intensivos en memoria para conseguir más productividad.La idea clave de RaT es transformar un hilo intensivo en memoria en un hilo ligero en el uso de recursos que progrese especulativamente. Así, cuando un hilo sufre de un acceso de larga latencia, RaT transforma dicho hilo en un hilo de runahead mientras dicho fallo está pendiente. Los principales beneficios de esta simple acción son varios. Mientras un hilo está en runahead, éste usa los diferentes recursos compartidos sin monopolizarlos o limitarlos con respecto a los otros hilos. Al mismo tiempo, esta ejecución especulativa realiza prebúsquedas a memoria que se solapan con el fallo principal, por tanto explotando el paralelismo a nivel de memoria y mejorando el rendimiento.RaT añade muy poco hardware extra y complejidad en los procesadores SMT con respecto a su implementación. A través de un mecanismo de checkpoint y lógica de control adicional, podemos dotar a los contextos hardware con la capacidad de ejecución en runahead. Por medio de RaT, contribuímos a aliviar simultaneamente dos problemas en el contexto de los procesadores SMT. Primero, RaT reduce el problema de los accesos de larga latencia en los SMT mediante el paralelismo a nivel de memoria (MLP). Un hilo prebusca datos en paralelo en vez de estar parado debido a un fallo de L2 mejorando su rendimiento individual. Segundo, RaT evita que los hilos bloqueen recursos bajo fallos de larga latencia. RaT asegura que el hilo intensivo en memoria recicle más rápido los recursos compartidos que usa debido a la naturaleza de la ejecución especulativa.La principal limitación de RaT es que los hilos especulativos pueden ejecutar instrucciones extras cuando no realizan prebúsqueda e innecesariamente consumir recursos de ejecución en el procesador SMT. Este inconveniente resulta en hilos de runahead ineficientes pues no contribuyen a la ganancia de rendimiento e incrementan el consumo de energía debido al número extra de instrucciones especulativas. Por consiguiente, en esta tesis también estudiamos diferentes soluciones dirigidas a solventar esta desventaja del mecanismo RaT. El resultado es un conjunto de soluciones complementarias para mejorar la eficiencia de RaT en términos de consumo de potencia y gasto energético.Por un lado, mejoramos la eficiencia de RaT aplicando ciertas técnicas basadas en el análisis semántico del código ejecutado por los hilos en runahead. Proponemos diferentes técnicas que analizan y controlan la utilidad de ciertos patrones de código durante la ejecución en runahead. Por medio de un análisis dinámico, los hilos en runahead supervisan la utilidad de ejecutar los bucles y subrutinas dependiendo de las oportunidades de prebúsqueda. Así, RaT decide cual de estas estructuras de programa ejecutar dependiendo de la información de utilidad obtenida, decidiendo entre parar o saltar el bucle o la subrutina para reducir el número de las instrucciones no útiles. Entre las técnicas propuestas, conseguimos reducir las instrucciones especulativas y la energía gastada mientras obtenemos rendimientos similares a la técnica RaT original.Por otro lado, también proponemos lo que denominamos hilos de runahead eficientes. Esta propuesta se basa en una técnica más fina que cubre todo el rango de ejecución en runahead, independientemente de las características del programa ejecutado. La idea principal es averiguar "cuando" y "durante cuanto" un hilo en runahead debe ser ejecutado prediciendo lo que denominamos distancia útil de runahead. Los resultados muestran que la mejor de estas propuestas basadas en la predicción de la distancia de runahead reducen significativamente el número de instrucciones extras así como también el consumo de potencia. Asimismo, conseguimos mantener los beneficios de rendimiento de los hilos en runahead, mejorando de esta forma la eficiencia energética de los procesadores SMT usando el mecanismo RaT.La evolución de RaT desarrollada durante toda esta investigación nos proporciona no sólo una propuesta orientada a un mayor rendimiento sino también una forma eficiente de usar los recursos compartidos en los procesadores SMT en presencia de operaciones de memoria de larga latencia.Dado que los diseños SMT en el futuro estarán orientados a optimizar una combinación de rendimiento individual en las aplicaciones, la productividad y el consumo de energía, los mecanismos basados en RaT aquí propuestos son interesantes opciones que proporcionan un mejor balance de rendimiento y energía que las propuestas previas en esta área.

    Research on multithreading topics has gained a lot of interest in the computer architecture community due to new commercial multithreaded and multicore processors. Simultaneous Multithreading (SMT) is one of these relatively new paradigms, which combines the multiple instruction issue features of superscalar processors with the ability of multithreaded architectures to exploit thread level parallelism (TLP). The main feature of SMT processors is to execute multiple threads that increase the utilization of the pipeline by sharing many more resources than in other types of processors.Shared resources are the key of simultaneous multithreading, what makes the technique worthwhile.This feature also entails important challenges to deal with because threads also compete for resources in the processor core. On the one hand, although certain types and mixes of applications truly benefit from SMT, the different features of threads can unbalance the resource allocation among threads, diminishing the benefit of multithreaded execution. On the other hand, the memory wall problem is still present in these processors. SMT processors alleviate some of the latency problems arisen by main memory's slowness relative to the CPUs. Nevertheless, threads with high cache miss rates that use large working sets are one of the major pitfalls of SMT processors. These memory intensive threads tend to use processor and memory resources poorly creating the highest resource contention problems. Memory intensive threads can clog up shared resources due to long latency memory operations without making progress on a SMT processor, thereby hindering overall system performance.The main goal of this thesis is to alleviate these shortcomings on SMT scenarios. To accomplish this, the key contribution of this thesis is the application of the paradigm of Runahead execution in the design of multithreaded processors by Runahead Threads (RaT). RaT shows to be a promising alternative to prior SMT resource management mechanisms which usually restrict memory bound threads in order to get higher throughputs.The idea of RaT is to transform a memory intensive thread into a light-consumer resource thread by allowing that thread to progress speculatively. Therefore, as soon as a thread undergoes a long latency load, RaT transforms the thread to a runahead thread while it has that long latency miss outstanding. The main benefits of this simple action performed by RaT are twofold. While being a runahead thread, this thread uses the different shared resources without monopolizing or limiting the available resources for other threads. At the same time, this fast speculative thread issues prefetches that overlap other memory accesses with the main miss, thereby exploiting the memory level parallelism.Regarding implementation issues, RaT adds very little extra hardware cost and complexity to an existing SMT processor. Through a simple checkpoint mechanism and little additional control logic, we can equip the hardware contexts with the runahead thread capability. Therefore, by means of runahead threads, we contribute to alleviate simultaneously the two shortcomings in the context of SMT processor improving the performance. First, RaT alleviates the long latency load problem on SMT processors by exposing memory level parallelism (MLP). A thread prefetches data in parallel (if MLP is available) improving its individual performance rather than be stalled on an L2 miss. Second, RaT prevents threads from clogging resources on long latency loads. RaT ensures that the L2-missing thread recycles faster the shared resources it uses by the nature of runahead speculative execution. This avoids memory intensive threads clogging the important processor resources up.The main limitation of RaT though is that runahead threads can execute useless instructions and unnecessarily consume execution resources on the SMT processor when there is no prefetching to be exploited. This drawback results in inefficient runahead threads which do not contribute to the performance gain and increase dynamic energy consumption due to the number of extra speculatively executed instructions. Therefore, we also propose different solutions aimed at this major disadvantage of the Runahead Threads mechanism. The result of the research on this line is a set of complementary solutions to enhance RaT in terms of power consumption and energy efficiency.On the one hand, code semantic-aware Runahead threads improve the efficiency of RaT using coarse-grain code semantic analysis at runtime. We provide different techniques that analyze the usefulness of certain code patterns during runahead thread execution. The code patterns selected to perform that analysis are loops and subroutines. By means of the proposed coarse grain analysis, runahead threads oversee the usefulness of loops or subroutines depending on the prefetches opportunities during their executions. Thus, runahead threads decide which of these particular program structures execute depending on the obtained usefulness information, deciding either stall or skip the loop or subroutine executions to reduce the number of useless runahead instructions. Some of the proposed techniques reduce the speculative instruction and wasted energy while achieving similar performance to RaT.On the other hand, the efficient Runahead thread proposal is another contribution focused on improving RaT efficiency. This approach is based on a generic technique which covers all runahead thread executions, independently of the executed program characteristics as code semantic-aware runahead threads are. The key idea behind this new scheme is to find out --when' and --how long' a thread should be executed in runahead mode by predicting the useful runahead distance. The results show that the best of these approaches based on the runahead distance prediction significantly reduces the number of extra speculative instructions executed in runahead threads, as well as the power consumption. Likewise, it maintains the performance benefits of the runahead threads, thereby improving the energy-efficiency of SMT processors using the RaT mechanism.The evolution of Runahead Threads developed in this research provides not only a high performance but also an efficient way of using shared resources in SMT processors in the presence of long latency memory operations. As designers of future SMT systems will be increasingly required to optimize for a combination of single thread performance, total throughput, and energy consumption, RaT-based mechanisms are promising options that provide better performance and energy balance than previous proposals in the field.

  • Measuring operating system overhead on Sun UltraSparc T1 processor

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Gioiosa, Roberto; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • ITCA: Inter-Task Conflict-Aware CPU accounting for CMPs

     Luque, Carlos; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Buyuktosunoglu, Alper; Valero Cortes, Mateo
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2009
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Using Randomized Caches in Probabilistic Real-Time Systems

     Cazorla Almeida, Francisco Javier
    IEEE Real-Time Systems Symposium
    Presentation's date: 2009-07-01
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

     Cazorla Almeida, Francisco Javier
    The 36th Annual International Symposium on Computer Architecture
    Presentation's date: 2009-06-20
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Thread to core assignment in SMT on-chip multiprocessors  Open access

     Acosta Ojeda, Carmelo Alexis; Cazorla Almeida, Francisco Javier; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    Computer architecture news
    Date of publication: 2009
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.

  • FlexDCP: a QoS framework for CMP architectures

     Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Ramirez Bellido, Alejandro; Sakellariou, Rizos; Valero Cortes, Mateo
    Operating systems review
    Date of publication: 2009-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Heterogeneity-awareness in multithreaded multicore processors  Open access

     Acosta Ojeda, Carmelo Alexis
    Defense's date: 2009-07-07
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    During the last decades, Computer Architecture has experienced a great series of revolutionary changes. The increasing transistor count on a single chip has led to some of the main milestones in the field, from the release of the first Superscalar (1965) to the state-of-the-art Multithreaded Multicore Architectures, like the Intel Core i7 (2009).Moore's Law has continued for almost half of a century and is not expected to stop for at least another decade, and perhaps much longer. Moore observed a trend in the process technology advances. So, the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years. Nevertheless, having more available transistors can not be always directly translated into having more performance.The complexity of state-of-the-art software has reached heights unthinkable in prior ages, both in terms of the amount of computation and the complexity involved. If we deeply analyze this complexity in software we would realize that software is comprised of smaller execution processes that, although maintaining certain spatial/temporal locality, imply an inherently heterogeneous behavior. That is, during execution time the hardware executes very different portions of software, with huge differences in terms of behavior and hardware requirements. This heterogeneity in the behaviour of the software is not specific of the latest videogame, but it is inherent to software programming itself, since the very beginning of Algorithmics.In this PhD dissertation we deeply analyze the inherent heterogeneity present in software behavior. We identify the main issues and sources of this heterogeneity, that hamper most of the state-of-the-art processor designs from obtaining their maximum potential. Hence, the heterogeneity in software turns most of the current processors, commonly called general-purpose processors, into overdesigned. That is, they have much more hardware resources than really needed to execute the software running on them. This fact would not represent a main problem if we were not concerned on the additional power consumption involved in software computation.The final goal of this PhD dissertation consists in assigning each portion of software exactly the amount of hardware resources really needed to fully exploit its maximal potential; without consuming more energy than the strictly needed. That is, obtaining complexity-effective executions using the inherent heterogeneity in software behavior as steering indicator. Thus, we start deeply analyzing the heterogenous behaviour of the software run on top of general-purpose processors and then matching it on top of a heterogeneously distributed hardware, which explicitly exploit heterogeneous hardware requirements. Only by being heterogeneity-aware in software, and appropriately matching this software heterogeneity on top of hardware heterogeneity, may we effectively obtain better processor designs.The PhD dissertation is comprised of four main contributions that cover both multithreaded single-core (hdSMT) and multicore (TCA Algorithm, hTCA Framework and MFLUSH) scenarios, deeply explained in their corresponding chapters in the PhD dissertation memory. Overall, these contributions cover a significant range of the Heterogeneity-Aware Processors' design space. Within this design space, we have focused on the state-of-the-art trend in processor design: Multithreaded Multicore (CMP+SMT) Processors.We make special emphasis on the MPsim simulation tool, specifically designed and developed for this PhD dissertation. This tool has already gone beyond this PhD dissertation, becoming a reference tool by an important group of researchers spread over the Computer Architecture Department (DAC) at the Polytechnic University of Catalonia (UPC), the Barcelona Supercomputing Center (BSC) and the University of Las Palmas de Gran Canaria (ULPGC).

  • Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

     Paolieri, Marco; Quiñones Moreno, Eduardo; Cazorla Almeida, Francisco Javier; Guillem, Bernat; Valero Cortes, Mateo
    The 36th Annual International Symposium on Computer Architecture
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Architecture performance prediction using evolutionary artificial neural networks

     Castillo, Pedro Angel; Mora, Antonio; Merelo, Juan Julían; Laredo, Juan Luís; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo; McKee, Sally
    European Workshop on Hardware Optimization Techniques
    Presentation's date: 2008
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Understanding the overhead of the spin-lock loop in CMT architectures  Open access

     Cakarevic, Vladimir; Radojkovic, Petar; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Gioiosa, Roberto; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    Workshop on the Interaction between Operating Systems and Computer Architecture
    Presentation's date: 2008-06
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Abstract—Spin locks are a synchronization mechanisms used to provide mutual exclusion to shared software resources. Spin locks are used over other synchronization mechanisms in several situations, like when the average waiting time to obtain the lock is short, in which case the probability of getting the lock is high, or when it is no possible to use other synchronization mechanisms. In this paper, we study the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock. For this purpose, we create a task that continuously executes the spin-lock loop and execute several instances of this task together with another active tasks. Our results show that, when the spin-lock tasks run with other applications in the same core of a T1 or a T2 processor, they introduce a significant overhead on other applications: 31% in T1 and 42% in T2, on average, respectively. For the T1 and T2 processors, we identify the fetch bandwidth as the main source of interaction between active threads and the spin-lock threads. We, propose 4 different variants of the Linux spin-lock loop that require less fetch bandwidth. Our proposal reduces the overhead of the spin-lock tasks over the other Applications down to 3.5% and 1.5% on average, in T1 and T2 respectively. This is a reduction of 28 percentage points with respect to the Linux spin-lock loop for T1. For T2 the reduction is about 40 percentage points.

  • Access to the full text
    Undertanding the overhead of the spin-lock loop in CMT architectures  Open access

     Cakarevic, Vladimir; Radojkovic, Petar; Verdu Mula, Javier; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Pajuelo González, Manuel Alejandro; Nemirovsky, Mario; Valero Cortes, Mateo
    workshop in the interaction between Operating Systems and Computer Architecture
    Presentation's date: 2008-06-18
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Spin locks are a synchronization mechanisms used to provide mutual exclusion to shared software resources. Spin locks are used over other synchronization mechanisms in several situations, like when the average waiting time to obtain the lock is short, in which case the probability of getting the lock is high, or when it is no possible to use other synchronization mechanisms. In this paper, we study the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock. For this purpose, we create a task that continuously executes the spin-lock loop and execute several instances of this task together with another active tasks. Our results show that, when the spin-lock tasks run with other applications in the same core of a T1 or a T2 processor, they introduce a significant overhead on other applications: 31% in T1 and 42% in T2, on average, respectively. For the T1 and T2 processors, we identify the fetch bandwidth as the main source of interaction between active threads and the spin-lock threads. We, propose 4 different variants of the Linux spin-lock loop that require less fetch bandwidth. Our proposal reduces the overhead of the spin-lock tasks over the other applications down to 3.5% and 1.5% on average, in T1 and T2 respectively. This is a reduction of 28 percentage points with respect to the Linux spin-lock loop for T1. For T2 the reduction is about 40 percentage points.

  • Evolutionary system for prediction and optimization of hardware architecture performance

     Castillo, Pedro Angel; Merelo, Juan Julían; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo; Mora, Antonio; Laredo, Juan Luís; McKee, Sally
    IEEE Congress on Evolutionary Computation
    Presentation's date: 2008-06-01
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Balancing HPC Applications Through Smart Allocation of Resource in MT Processors

     Boneti, C; Gioiosa, R; Cazorla Almeida, Francisco Javier; Corbalan Gonzalez, Julita; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    IEEE International Parallel and Distributed Processing Symposium
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Overhead of the spin-lock loop in UltraSPARC T2  Open access

     Cakarevic, Vladimir; Radojkovic, Petar; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Nemirovsky, Mario; Valero Cortes, Mateo; Pajuelo González, Manuel Alejandro; Verdu Mula, Javier
    HiPEAC Industrial Workshop
    Presentation's date: 2008-06-04
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Spin locks are task synchronization mechanism used to provide mutual exclusion to shared software resources. Spin locks have a good performance in several situations over other synchronization mechanisms, i.e., when on average tasks wait short time to obtain the lock, the probability of getting the lock is high, or when there is no other synchronization mechanism. In this paper we study the effect that the execution of spinlocks create in multithreaded processors. Besides going to multicore architectures, recent industry trends show a big move toward hardware multithreaded processors. Intel P4, IBM POWER5 and POWER6, Sun's UltraSPARC T1 and T2 all this processors implement multithreading in various degrees. By sharing more processor resources we can increase system's performance, but at the same time, it increases the impact that processes executing simultaneously introduce to each other.

  • Access to the full text
    Measuring operating system overhead on CMT processors  Open access

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Gioiosa, Roberto; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    Symposium on Computer Architecture and High Performance Computing
    Presentation's date: 2008-10-29
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing (HPC), especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1, running Linux and Solaris. Since a real system is too complex to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.

  • Access to the full text
    Selection of the register file size and the resource policy on SMT processors  Open access

     Alastruey, Jesús; Monreal Arnal, Teresa; Cazorla Almeida, Francisco Javier; Viñals Yufera, Victor; Valero Cortes, Mateo
    International Symposium on Computer Architecture and High Performance Computing
    Presentation's date: 2008-11-30
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The performance impact of the Physical Register File (PRF) size on Simultaneous Multithreading processors has not been extensively studied in spite of being a critical shared resource. In this paper we analyze the effect on performance of the PRF size for a broad set of resource allocation policies (Icount, Stall, Flush, Flush++, Static, Dcra and Hill-climbing) and evaluate them under two metrics: instructions per second (IPS) for throughput and harmonic mean of weighted IPCs (Hmean-wIPC) for fairness. We have found that resource allocation policy and PRF size should be considered together in order to obtain the best score in the proposed metrics. For instance, for the analyzed 2 and 4-threaded SPEC CPU2000 workloads, small PRFs are best managed by Flush, whereas for larger PRFs, Hill-climbing and Static lead to the best values for the throughput and fairness metrics, respectively. The second contribution of this work is a simple procedure that, for a given resource allocation policy, selects the PRF size that maximizes IPS and obtains for HmeanwIPC a value close to its maximum. According to our results, Hill-climbing with a 320-entry PRF achieves the best figures for 2-threaded workloads. When executing 4-threaded workloads, Hill-Climbing with a 384-entry PRF achieves the best throughput whereas Static obtains the best throughput-fairness balance.

  • A Two level Load/Store Queue Based on Execution Locality

     Pericas, Miquel; Cristal Kestelman, Adrian; Cazorla Almeida, Francisco Javier; González Garcia, Ruben; Veidenbaum, Alex; Jimenez, Daniel A; Valero Cortes, Mateo
    The 35th Annual International Symposium on Computer Arquitecture
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Software Controlled Priority Characterization of Powers5 Processor

     Carlos, Boneti; Cazorla Almeida, Francisco Javier; Roberto, Gioiosa; Buyuktosunoglu, Alper; Chen, yong Cher; Valero Cortes, Mateo
    The 35th Annual International Symposium on Computer Arquitecture
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Balancing HPC Applications Through Smart Allocation of Resources in MT Processors

     Cazorla Almeida, Francisco Javier
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2008-04-14
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Core to Memory Interconnection Implications for Forthcoming on -Chip Multiprocessors

     Acosta Ojeda, Carmelo Alexis; Cazorla Almeida, Francisco Javier; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2007-12-01
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A New Proposal to Evaluate Multithreaded Processors

     Cazorla Almeida, Francisco Javier; Pajuelo González, Manuel Alejandro; Oliverio, J Santana; FERNANDEZ GARCIA, ENRIQUE; Valero Cortes, Mateo
    XVIII Jornadas de Paralelismo. CEDI 2007 II Congreso Español de Informática.
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Computación de Altas Prestaciones V: Arquitecturas, Compiladores, Sistemas Operativos, Herramientas y Aplicaciones

     Ramirez Bellido, Alejandro; Valero Cortes, Mateo; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Abella Ferrer, Jaume; Figueiredo Boneti, Carlos Santieri; Gioiosa, Roberto; Pajuelo González, Manuel Alejandro; Quiñones Moreno, Eduardo; Verdu Mula, Javier; Guitart Fernández, Jordi; Fernandez Jimenez, Agustin; Garcia Almiñana, Jordi; Utrera Iglesias, Gladys Miriam
    Participation in a competitive project

     Share

  • MLP- Aware Dynamic Cache Partitioning

     Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    16th International Conference on Parallel Architectures and Compilation Techniques (PACT'07)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Evaluating Multithreaded Architectures on Simulation Environments

     Vera Gomez, Javier; Cazorla Almeida, Francisco Javier; Pajuelo González, Manuel Alejandro; Oliverio, J Santana; Fernández, Enrique; Valero Cortes, Mateo
    Date: 2007-04
    Report

     Share Reference managers Reference managers Open in new window

  • A Reconfigurable Heterogeneous Multi-Core Architecture

     Pericas, Miquel; Cristal Kestelman, Adrián; Cazorla Almeida, Francisco Javier; González Garcia, Ruben; Valero Cortes, Mateo
    Date: 2007-02
    Report

     Share Reference managers Reference managers Open in new window

  • IPC-Aware Dynamic Cache Partitioning for CMP processors*

     Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    Date: 2007-09
    Report

     Share Reference managers Reference managers Open in new window

  • Measuring the Performance of Multithreaded Architectures

     Cazorla Almeida, Francisco Javier; Pajuelo González, Manuel Alejandro; Oliverio, J Santana; FERNANDEZ GARCIA, ENRIQUE; Valero Cortes, Mateo
    2007 SPEC Benchmark Workshop
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • FAME: Evaluating multithreaded architectures

     Vera Rivera, Francisco Javier; Cazorla Almeida, Francisco Javier; Pajuelo González, Manuel Alejandro; Oliverio, J Santana; FERNANDEZ GARCIA, ENRIQUE; Valero Cortes, Mateo
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Flexible Heterogeneous Multi-Core Architecture

     Pericas, Miquel; Cristal Kestelman, Adrián; Cazorla Almeida, Francisco Javier; González Garcia, Ruben; Valero Cortes, Mateo
    16th International Conference on Parallel Architectures and Compilation Techniques (PACT'07)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • FAME: Fairly Measuring Multithreaded Architectures

     Cazorla Almeida, Francisco Javier; Pajuelo González, Manuel Alejandro; Oliverio, J Santana; FERNANDEZ GARCIA, ENRIQUE; Valero Cortes, Mateo
    16th International Conference on Parallel Architectures and Compilation Techniques (PACT'07)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Online Prediction of Throughput for Different Cache Sizes

     Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    XVIII Jornadas de Paralelismo. CEDI 2007 II Congreso Español de Informática.
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Predictable Performance in SMT processors: Synergy Between the OS and SMT

     Cazorla Almeida, Francisco Javier; Knijnenburg, P; Sakellariou, R; Fernandez, E; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    IEEE transactions on computers
    Date of publication: 2006-07
    Journal article

     Share Reference managers Reference managers Open in new window

  • Improving EDF for SMT processors

     Carlos, Boneti; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Improving EDF for SMT processors

     Cazorla Almeida, Francisco Javier
    Jornadas de Paralelismo
    Presentation's date: 2006-09-18
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window