Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 71 results
  • La influencia del orden de las preguntas en los exámenes de primer curso

     Lopez Alvarez, David; Cortes Martinez, Jordi; Fernandez Barta, Montserrat; Parcerisa Bundó, Joan Manuel; Tous Liesa, Ruben; Tubella Murgadas, Jordi
    Jornadas de Enseñanza Universitaria de la Informática
    Presentation's date: 2013-07-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    El orden de las preguntas en un examen no debería tener influencia en sus resultados. Sin embargo, los autores tenemos la sensación de que los estudiantes de primero suelen ser secuenciales a la hora de resolver los exámenes. ¿Lo son realmente?, y si lo son ¿afecta esta manera de contestar los exámenes a los resultados finales? En este artículo analizamos estas cuestiones con un experimento realizado en la asignatura Estructura de Computadores, de primer curso del grado en Ingeniería Informática.

  • Access to the full text
    TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems  Open access

     Arnau Montañes, Jose Maria; Parcerisa Bundó, Joan Manuel; Xekalakis, Polychronis
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper we present TEAPOT, a full system GPU simulator, whose goal is to allow the evaluation of the GPUs that reside in mobile phones and tablets. To this extent, it has a cycle accurate GPU model for evaluating performance, power models for the GPU, the memory subsystem and for OLED screens, and image quality metrics. Unlike prior GPU simulators, TEAPOT supports the OpenGL ES 1.1/2.0 API, so that it can simulate all commercial graphical applications available for Android systems. To illustrate potential uses of this simulating infrastructure, we perform two case studies. We first turn our attention to evaluating the impact of the OS when simulating graphical applications. We show that the overall GPU power/performance is greatly affected by common OS tasks, such as image composition, and argue that application level simulation is not sufficient to understand the overall GPU behavior. We then utilize the capabilities of TEAPOT to perform studies that trade image quality for energy. We demonstrate that by allowing for small distortions in the overall image quality, a significant amount of energy can be saved

    In this paper we present TEAPOT, a full system GPU simulator, whose goal is to allow the evaluation of the GPUs that reside in mobile phones and tablets. To this extent, it has a cycle accurate GPU model for evaluating performance, power models for the GPU, the memory subsystem and for OLED screens, and image quality metrics. Unlike prior GPU simulators, TEAPOT supports the OpenGL ES 1.1/2.0 API, so that it can simulate all commercial graphical applications available for Android systems. To illustrate potential uses of this simulating infrastructure, we perform two case studies. We first turn our attention to evaluating the impact of the OS when simulating graphical applications. We show that the overall GPU power/performance is greatly aff ected by common OS tasks, such as image composition, and argue that application level simulation is not sufficient to understand the overall GPU behavior. We then utilize the capabilities of TEAPOT to perform studies that trade image quality for energy. We demonstrate that by allowing for small distortions in the overall image quality, a signifi cant amount of energy can be saved.

    Postprint (author’s final draft)

  • Parallel frame rendering: trading responsiveness for energy on a mobile GPU

     Arnau Montañes, Jose Maria; Parcerisa Bundó, Joan Manuel; Xekalakis, Polychronis
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2013-09-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Perhaps one of the most important design aspects for smartphones and tablets is improving their energy efficiency. Unfortunately, rich media content applications typically put significant pressure to the GPU's memory subsystem. In this paper we propose a novel means of dramatically improving the energy efficiency of these devices, for this popular type of applications. The main hurdle in doing so is that GPUs require a significant amount of memory bandwidth in order to fetch all the necessary textures from memory. Although consecutive frames tend to operate on the same textures, their re-use distances are so big that to the caches fetching textures appears to be a streaming operation. Traditional designs improve the degree of multi-threading and the memory bandwidth, as a means of improving performance. In order to meet the energy efficiency standards required by the mobile market, we need a different approach. We thus propose a technique which we term Parallel Frame Rendering (PFR). Under PFR, we split the GPU into two clusters where two consecutive frames are rendered in parallel. PFR exploits the high degree of similarity between consecutive frames to save memory bandwidth by improving texture locality. Since the physics part of the rendering has to be computed sequentially for two consecutive frames, this naturally leads to an increase in the input delay latency for PFR compared with traditional systems. However we argue that this is rarely an issue, as the user interface in these devices is much slower than those of desktop systems. Moreover, we show that we can design reactive forms of PFR that allow us to bound the lag observed by the end user, thus maintaining the highest user experience when necessary. Overall we show that PFR can achieve 28% of memory bandwidth savings with only minimal loss in system responsiveness. © 2013 IEEE.

  • Design of a Distributed Memory Unit for Clustered Microarchitectures  Open access

     Bieschewski, Stefan
    Defense's date: 2013-06-20
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance.

  • Boosting mobile GPU performance with a decoupled access/execute fragment processor

     Parcerisa Bundó, Joan Manuel; Xekalakis, Polychronis; Arnau Montañes, Jose Maria
    International Symposium on Computer Architecture
    Presentation's date: 2012-06-11
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Architecture Support for Intrusion Detection Systems  Open access

     Sreekar Shenoy, Govind
    Defense's date: 2012-10-30
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    System security is a prerequisite for efficient day-to-day transactions. As a consequence, Intrusion Detection Systems (IDS) are commonly used to provide an effective security ring to systems in a network. An IDS operates by inspecting packets flowing in the network for malicious content. To do so, an IDS like Snort[49] compares bytes in a packet with a database of prior reported attacks. This functionality can also be viewed as string matching of the packet bytes with the attack string database. Snort commonly uses the Aho-Corasick algorithm[2] to detect attacks in a packet. The Aho-Corasick algorithm works by first constructing a Finite State Machine (FSM) using the attack string database. Later the FSM is traversed with the packet bytes. The main advantage of this algorithm is that it provides a linear time search irrespective of the number of strings in the database. The issue however lies in devising a practical implementation. The FSM thus constructed gets very bloated in terms of the storage size, and so is area inefficient. This also affects its performance efficiency as the memory footprint also grows. Another issue is the limited scope for exploiting any parallelism due to the inherent sequential nature in a FSM traversal. This thesis explores hardware and software techniques to accelerate attack detection using the Aho-Corasick algorithm. In the first part of this thesis, we investigate techniques to improve the area and performance efficiency of an IDS. Notable among our contributions, includes a pipelined architecture that accelerates accesses to the most frequently accessed node in the FSM. The second part of this thesis studies the resilience of an IDS to evasion attempts. In an evasion attempt an adversary saturates the performance of an IDS to disable it, and thereby gain access to the network. We explore an evasion attempt that significantly degrades the performance of the Aho-Corasick al- gorithm used in an IDS. As a counter measure, we propose a parallel architecture that improves the resilience of an IDS to an evasion attempt. The final part of this thesis explores techniques to exploit the network traffic characteristic. In our study, we observe significant redundancy in the payload bytes. So we propose a mechanism to leverage this redundancy in the FSM traversal of the Aho-Corasick algorithm. We have also implemented our proposed redundancy-aware FSM traversal in Snort.

  • Architectural Support for High-Performing Hardware Transactional Memory Systems  Open access

     Lupon Navazo, Marc
    Defense's date: 2011-12-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Parallel programming presents an efficient solution to exploit future multicore processors. Unfortunately, traditional programming models depend on programmer’s skills for synchronizing concurrent threads, which makes the development of parallel software a hard and errorprone task. In addition to this, current synchronization techniques serialize the execution of those critical sections that conflict in shared memory and thus limit the scalability of multithreaded applications. Transactional Memory (TM) has emerged as a promising programming model that solves the trade-off between high performance and ease of use. In TM, the system is in charge of scheduling transactions (atomic blocks of instructions) and guaranteeing that they are executed in isolation, which simplifies writing parallel code and, at the same time, enables high concurrency when atomic regions access different data. Among all forms of TM environments, Hardware TM (HTM) systems is the only one that offers fast execution at the cost of adding dedicated logic in the processor. Existing HTMsystems suffer considerable delays when they execute complex transactional workloads, especially when they deal with large and contending transactions because they lack adaptability. Furthermore, most HTM implementations are ad hoc and require cumbersome hardware structures to be effective, which complicates the feasibility of the design. This thesis makes several contributions in the design and analysis of low-cost HTMsystems that yield good performance for any kind of TM program. Our first contribution, FASTM, introduces a novel mechanism to elegantly manage speculative (and already validated) versions of transactional data by slightly modifying on-chip memory engine. This approach permits fast recovery when a transaction that fits in private caches is discarded. At the same time, it keeps non-speculative values in software, which allows in-place x memory updates. Thus, FASTM is not hurt from capacity issues nor slows down when it has to undo transactional modifications. Our second contribution includes two different HTM systems that integrate deferred resolution of conflicts in a conventional multicore processor, which reduces the complexity of the system with respect to previous proposals. The first one, FUSETM, combines different-mode transactions under a unified infrastructure to gracefully handle resource overflow. As a result, FUSETM brings fast transactional computation without requiring additional hardware nor extra communication at the end of speculative execution. The second one, SPECTM, introduces a two-level data versioning mechanism to resolve conflicts in a speculative fashion even in the case of overflow. Our third and last contribution presents a couple of truly flexible HTM systems that can dynamically adapt their underlying mechanisms according to the characteristics of the program. DYNTM records statistics of previously executed transactions to select the best-suited strategy each time a new instance of a transaction starts. SWAPTM takes a different approach: it tracks information of the current transactional instance to change its priority level at runtime. Both alternatives obtain great performance over existing proposals that employ fixed transactional policies, especially in applications with phase changes.

  • MICROARQUITECTURA Y COMPILADORES PARA FUTUROS PROCESADORES II

     Parcerisa Bundó, Joan Manuel; Canal Corretger, Ramon; Tubella Murgadas, Jordi; Cruz Diaz, Josep-llorenç; Gonzalez Colas, Antonio Maria
    Participation in a competitive project

     Share

  • High performance, ultra-low power streaming systems

     Arnau, Jose Maria; Parcerisa Bundó, Joan Manuel; Xekalakis, Polychronis; Gonzalez Colas, Antonio Maria
    Date: 2011-09-26
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Leveraging register windows to reduce physical registers to the bare minimum

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    IEEE transactions on computers
    Date of publication: 2010-12
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • MICROARQUITECTURA I COMPILADORS (ARCO)

     Gibert Codina, Enric; Aliagas Castell, Carles; Aleta Ortega, Alexandre; Molina Clemente, Carlos; Unsal, Osman Sabri; Piñeiro Riobo, Jose Alejandro; Vera Rivera, Francisco Javier; Gonzalez Colas, Antonio Maria; Canal Corretger, Ramon; Cruz Diaz, Josep-llorenç; Parcerisa Bundó, Joan Manuel; Pons Solé, Marc; Magklis, Grigorios; Codina Viñas, Josep M; Tubella Murgadas, Jordi
    Participation in a competitive project

     Share

  • Predicated execution and register windows for out-of-order processors  Open access

     Quiñones Moreno, Eduardo
    Defense's date: 2008-11-18
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    ISA extensions are a very powerful approach to implement new hardware techniques that require or benefit from compiler support: decisions made at compile time can be complemented at runtime, achieving a synergistic effect between the compiler and the processor. This thesis is focused on two ISA extensions: predicate execution and register windows. Predicate execution is exploited by the if-conversion compiler technique. If-conversion removes control dependences by transforming them to data dependences, which helps to exploit ILP beyond a single basic-block. Register windows help to reduce the amount of loads and stores required to save and restore registers across procedure calls by storing multiple contexts into a large architectural register file.In-order processors specially benefit from using both ISA extensions to overcome the limitations that control dependences and memory hierarchy impose on static scheduling. Predicate execution allows to move control dependence instructions past branches. Register windows reduce the amount of memory operations across procedure calls. Although if-conversion and register windows techniques have not been exclusively developed for in-order processors, their use for out-of-order processors has been studied very little. In this thesis we show that the uses of if-conversion and register windows introduce new performance opportunities and new challenges to face in out-of-order processors.The use of if-conversion in out-of-order processors helps to eliminate hard-to-predict branches, alleviating the severe performance penalties caused by branch mispredictions. However, the removal of some conditional branches by if-conversion may adversely affect the predictability of the remaining branches, because it may reduce the amount of correlation information available to the branch predictor. Moreover, predicate execution in out-of-order processors has to deal with two performance issues. First, multiple definitions of the same logical register can be merged into a single control flow, where each definition is guarded with a different predicate. Second, instructions whose guarding predicate evaluates to false consume unnecessary resources. This thesis proposes a branch prediction scheme based on predicate prediction that solves the three problems mentioned above. This scheme, which is built on top of a predicated ISA that implement a compare-and-branch model such as the one considered in this thesis, has two advantages: First, the branch accuracy is improved because the correlation information is not lost after if-conversion and the mechanism we propose permits using the computed value of the branch predicate when available, achieving 100% of accuracy. Second it avoids the predicate out-of-order execution problems.Regarding register windows, we propose a mechanism that reduces physical register requirements of an out-of-order processor to the bare minimum with almost no performance loss. The mechanism is based on identifying which architectural registers are in use by current in-flight instructions. The registers which are not in use, i.e. there is no in-flight instruction that references them, can be early released.In this thesis we propose a very efficient and low-cost hardware implementation of predicate execution and register windows that provide important benefits to out-of-order processors.

  • Cómo mejorar el feedback mediante una herramienta de corrección automática

     Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Lopez Alvarez, David; Alonso López, Javier; Tous Liesa, Ruben; Parcerisa Bundó, Joan Manuel; Barlet Ros, Pere; Fernandez Barta, Montserrat; Tubella Murgadas, Jordi; Pérez, Christian
    Jornades de Docència del Departament d'Arquitectura de Computadors
    Presentation's date: 2008-02-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • SISA-EMU: feedback automático para ensamblador

     Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Lopez Alvarez, David; Alonso López, Javier; Tous Liesa, Ruben; Parcerisa Bundó, Joan Manuel; Barlet Ros, Pere; Fernandez Barta, Montserrat; Tubella Murgadas, Jordi; Christian, Pérez
    Jornadas de Enseñanza Universitaria de la Informática
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • 6º Premio Duran Farell de Investigación Tecnológica

     Gonzalez Colas, Antonio Maria; Aleta Ortega, Alexandre; Canal Corretger, Ramon; Parcerisa Bundó, Joan Manuel; Abella Ferrer, Jaume; Bieschewski, Stefan; Qiong, Cai; Codina, Josep Maria; Chaparro, Pedro; Gibert, Enric; Fernando, Latorre
    Award or recognition

     Share

  • Work in Progress-Improving Feedback Using an Automatic Assessment Tool

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Lopez Alvarez, David; Parcerisa Bundó, Joan Manuel; Alonso López, Javier; Christian, Pérez; Tous Liesa, Ruben; Barlet Ros, Pere; Fernandez Barta, Montserrat; Tubella Murgadas, Jordi
    Frontiers in Education Conference 2008
    Presentation's date: 2008-10-22
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Una herramienta automática de feedback para ensamblador

     Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Lopez Alvarez, David; Alonso López, Javier; Tous Liesa, Ruben; Parcerisa Bundó, Joan Manuel; Barlet Ros, Pere; Fernandez Barta, Montserrat; Tubella Murgadas, Jordi; Christian, Pérez
    Date: 2008-10
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Improving Branch Prediction and Predicated Execution in Out-Of-Order Processors

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    13th International Symposium on High-Performance Computer Architecture (HPCA-13)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Early Register Release for Out-of-Order Processors with Register Windows

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    16th International Conference on Parallel Architectures and Compilation Techniques (PACT'07)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Projecte pilot d'innovació docent de l'assignatura Estructura de Computadors 1, ref. 2007MQD- 00203

     Alvarez Martinez, Carlos; Tubella Murgadas, Jordi; Parcerisa Bundó, Joan Manuel
    Participation in a competitive project

     Share

  • Microarquitectura y compiladores para futuros procesadores

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Participation in a competitive project

     Share

  • Early Register Release for Out-of-Order Processors with Register Windows

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Date: 2007-04
    Report

     Share Reference managers Reference managers Open in new window

  • A Fully-Distributed First Level Memory Architecture

     Bieschewski, S; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Date: 2007-10
    Report

     Share Reference managers Reference managers Open in new window

  • Selective Predicate Prediction for Out-of-Order Processors

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    20th ACM International Conference on Supercomputing (ISC'2006)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Arquitecturas y Compiladores, ref. 2005SGR00950

     Gonzalez Colas, Antonio Maria; Parcerisa Bundó, Joan Manuel; Canal Corretger, Ramon
    Participation in a competitive project

     Share

  • Selective Predicate Prediction for Out-of-Order Processors

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Date: 2006-03
    Report

     Share Reference managers Reference managers Open in new window

  • Improving branch prediction and predicated execution in out-of-order processors

     Quiñones Moreno, Eduardo; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Date: 2006-08
    Report

     Share Reference managers Reference managers Open in new window

  • On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures

     Parcerisa Bundó, Joan Manuel; Sahuquillo, J; Gonzalez Colas, Antonio Maria; Duato, J
    IEEE transactions on parallel and distributed systems
    Date of publication: 2005-02
    Journal article

     Share Reference managers Reference managers Open in new window

  • HW/SW Parallelism Exploitation in Chip Multiprocessor Architectures

     Gonzalez Colas, Antonio Maria; Parcerisa Bundó, Joan Manuel; Canal Corretger, Ramon
    Participation in a competitive project

     Share

  • Memory Bank Predictors

     Bieschewski, S; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Date: 2005-05
    Report

     Share Reference managers Reference managers Open in new window

  • Memory Bank Predictors

     Bieschewski, Stefan; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    23rd IEEE International Conference on Computer Design, ICCD 2005
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Design of Clustered Superscalar Microarchitectures  Open access

     Parcerisa Bundó, Joan Manuel
    Defense's date: 2004-06-17
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    L'objectiu d'aquesta tesi és proposar noves tècniques per al disseny de microarquitectures clúster superescalars eficients. Les microarquitectures clúster particionen el disseny de diversos components crítics del hardware com a mitjà per mantenir-ne el paral·lelisme i millorar-ne l'escalabilitat. El nucli d'un processador clúster, format per blocs de baixa complexitat o clústers, pot executar cadenes d'instruccions dependents sense pagar el sobrecost d'una llarga emissió, curtcircuïts, o lectura de registres, encara que si dues instruccions dependents s'executen en clústers diferents, es paga la penalització d'una comunicació. Per altra banda, les estructures distribuïdes impliquen generalment menors requisits de potència dinàmica, i simplifiquen la gestió de l'energia per mitjà de tècniques com la desactivació selectiva del rellotge o l'energia, o com la reducció a escalade la tensió.El primer objecte d'aquesta recerca és l'assignació d'instruccions a clústers, ja que aquesta juga un paper clau en el rendiment, amb l'objectiu de mantenir equilibrada la càrrega i reduir la penalització de les comunicacions crítiques. Es proposen dos diferents enfocs: primer, una família de nous esquemes que identifiquen dinàmicament certs grups d'instruccions dependents anomenats "slices", i fan l'assignació de clústers slice per slice. Es diferencien d'altres enfocs previs, ja sigui perquè són dinàmics i/o bé perquè inclouen nous mecanismes explícits de mesura i gestió de l'equilibri de càrrega. Segon, una família de nous esquemes que assignen clústers instrucció per instrucció, basats en les assignacions prèvies dels productors dels registres fonts, en la ubicació dels registres físics, i en la càrrega de treball.La segona contribució proposa la predicció de valors com a mitjà per mitigar les penalitzacions dels retards dels connectors i, en particular, per amagar les comunicacions entre clústers. Es demostra que el benefici obtingut amb l'eliminació de dependències creix amb el nombre de clústers i la latència de les comunicacions i, doncs, és major que per a una arquitectura centralitzada. Es proposa un nou esquema d'assignació de clústers que aprofita la menor densitat del graf de dependències per tal de millorar l'equilibri de càrrega.El tercer aspecte considerat es la xarxa d'interconnexió entre clústers, ja que determina la latència de les comunicacions, amb l'objectiu de trobar el millor compromís entre cost i rendiment. Es proposen diverses xarxes punt-a-punt, tant síncrones com parcialment asíncrones, que assoleixen un IPC pròxim al d'un model ideal amb ample de banda il·limitat, tot i tenir molt baixa complexitat. Llur impacte sobre els curtcircuïts, cues d'emissió o bancs de registres es molt menor que el d'altres enfocs. Es proposen també possibles implementacions dels enrutadors, que il·lustren llur factibilitat amb solucions hardware molt simples i de baixa latència. Es proposa un nou esquema d'assignació de clústers conscient de la topologia, que redueix la latència de les comunicacions.L'última contribució proposa tècniques per distribuir els components principals de les etapes inicials del processador, amb l'objectiu de reduir-ne la complexitat i evitar-ne la replicació. Es proposen tècniques eficaces per a la partició del predictor de salts i la lògica de distribució d'instruccions, a fi de minimitzar la penalització pels retards dels connectors causada per les dependències recursives en dos llaços crítics del hardware: la generació de l'adreça de búsqueda d'instruccions i la lògica d'assignació d'instruccions, respectivament. En el primer cas, es converteixen els retards dels connectors intra-estructurals d'un predictor centralitzat en retards de comunicació entre clústers, els quals se segmenten sense problemes. En el segon cas, el particionat de la lògica d'assignació d'instruccions basada en dependències implica paral·lelitzar aquesta tasca, la qual es inherentment seqüencial.

    El objetivo de esta tesis es proponer técnicas para el diseño de microarquitecturas clúster superescalares eficientes. Las microarquitecturas clúster particionan el diseño de diversos componentes críticos del hardware como medio para mantener el paralelismo y mejorar la escalabilidad. El núcleo de un procesador clúster, formado por bloques de baja complejidad o clústers, puede ejectutar cadenas de instrucciones dependientes sin pagar el sobrecoste de una larga emisión, cortocircuitos, o lectura de registros; pero si dos instrucciones dependientes se ejecutan en clústers distintos, se paga la penalización de una comunicación. Por otro lado, las estructuras distribuidas implican generalmente menores requisitos de potencia dinámica, y simplifican la gestión de la energía por medio de técnicas como la desactivación selectiva del reloj o de la alimentación, o la reducción a escala del voltaje.El primer objetivo de esta investigación es la asignación dinámica de instrucciones a clústers, ya que ésta juega un papel clave en el rendimiento, a fin de mantener equilibrada la carga y reducir la penalización de las comunicaciones críticas. Se proponen dos enfoques distintos: primero, una familia de nuevos esquemas que identifican dinámicamente ciertos grupos de instrucciones denominados "slices", y realizan la asignación slice por slice. Éstos se diferencian de otros enfoques previos, ya sea porque son dinámicos y/o porque incluyen nuevos mecanismos explícitos de medida y gestión del equilibrio de carga. Segundo, una familia de nuevos esquemas que asignan clústers instrucción a instrucción, basándose en las asignaciones previas de los productores de sus registros fuente, en la ubicación de los registros físicos, y en la carga de trabajo.La segunda contribución propone la predicción de valores como medio para mitigar las penalizaciones de los retardos de los conectores, y en particular, para esconder las comunicaciones entre clústers. Se demuestra que el beneficio obtenido con la eliminación de dependencias aumenta con el número de clústers y con la latencia de las comunicaciones, y es asimismo mayor que para una arquitectura centralizada. Se propone un nuevo esquema de asignación de clústers que aprovecha la menor densidad del grafo de dependencias con el fin de mejorar el equilibrio de la carga.El tercer aspecto considerado es la red de interconexión entre clústers, pues determina la latencia de las comunicaciones, a fin de hallar el mejor compromiso entre coste y rendimiento. Se proponen diversas redes punto a punto, tanto síncronas como parcialmente asíncronas, que aun teniendo muy baja complejidad consiguen un IPC próximo al de un modelo con ancho de banda ilimitado. Su impacto sobre la complejidad de los cortocircuitos, colas de emisión o bancos de registros es mucho menor que el de otros enfoques. Se proponen también posibles implementaciones de los enrutadores, ilustrando su factibilidad como soluciones simples y de baja latencia. Se propone un esquema de asignación de clústers consciente de la topología, que reduce la latencia de las comunicaciones.La última contribución propone técnicas para distribuir los componentes principales de las etapas iniciales del procesador, con el objetivo de reducir su complejidad y evitar su replicación. Se proponen técnicas eficaces para particionar el predictor de saltos y la lógica de distribución de instrucciones, a fin de minimizar la penalización por retardos de conectores causada por las dependencias recursivas en dos bucles críticos del hardware: la generación de la dirección de búsqueda de instrucciones y la lógica de asignación de clústers. En el primer caso, los retardos de los conectores intra-estructurales de un predictor centralizado se convierten en retardos de comunicación entre clústers, que se pueden segmentar fácilmente. En el segundo caso, el particionado de la lógica de asignación de clústers basada en dependencias implica paralelizar esta tarea, intrínsecamente secuencial.

    The objective of this thesis is to propose new techniques to design efficient clustered superscalar microarchitectures. Clustered microarchitectures partition the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. A clustered processor core, made up of several low complex blocks or clusters, can efficiently execute chains of dependent instructions without paying the overheads of a long issue, register read or bypass latencies. Of course, when two dependent instructions execute in different clusters, an inter-cluster communication penalty is incurred. Moreover, distributed structures usually imply lower dynamic power requirements, and simplify power management via techniques such a selective clock/power gating and voltage scaling.The first target of this research is the assignment of instructions to clusters, since it plays a major role on performance, with the goals of keeping the workload of clusters balanced and reducing the penalty of critical communications. Two different approaches are proposed: first, a family of new schemes that dynamically identify groups of data-dependent instructions called slices, and make cluster assignments on a per-slice basis. The proposed schemes differ from previous approaches either because they are dynamic and/or because they include new mechanisms to deal explicitly with workload balance information gathered at runtime. Second, it proposes a family of new dynamic schemes that assign instructions to clusters in a per-instruction basis, based on prior assignment of the source register producers, on the cluster location of the source physical registers, and on the workload of clusters.The second contribution proposes value prediction as a means to mitigate the penalties of wire delays and, in particular, to hide inter-cluster communications while also improving workload balance. First, it is proven that the benefit of breaking dependences with value prediction grows with the number of clusters and the communication latency, thus it is higher than for a centralized architecture. Second, it is proposed a cluster assignment scheme that exploits the less dense data dependence graph that results from predicting values to achieve a better workload balance.The third aspect considered is the cluster interconnect, which mainly determines communication latency, seeking for the best trade-off between cost and performance. First, several cost-effective point-to-point interconnects are proposed, both synchronous and partially asynchronous, that approach the IPC of an ideal model with unlimited bandwidth while keeping the complexity low. The proposed interconnects have much lower impact than other approaches on the complexity of bypasses, issue queues and register files. Second, possible router implementations are proposed, which illustrate their feasibility with very simple and low-latency hardware solutions. Third, a new topology-aware improvement to the cluster assignment scheme is proposed to reduce the distance (and latency) of inter-cluster communications.The last contribution proposes techniques for distributing the main components of the processor front-end with the goals of reducing their complexity and avoiding replication. In particular, effective techniques are proposed to cluster the branch predictor and the steering logic, that minimize the wire delay penalties caused by broadcasting recursive dependences in two critical hardware loops: the fetch address generation, and the cluster assignment logic, respectively. In the former case, the proposed technique converts the cross-structure wire delays of a centralized predictor into cross-cluster communication delays, which are smoothly pipelined. In the latter case, the partitioning of the instruction steering logic involves the parallelization of an inherently sequential task such as the dependence based cluster assignment of instructions.

  • Design of Clustered Superscalar Microarchitectures

     Parcerisa Bundó, Joan Manuel
    9th International Academic Forum 2004
    Presentation's date: 2004-04-20
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • TIN2004-07739-C02-01 Computación de Altas Prestaciones IV: Arquitecturas, Compiladores, Sistemas Operativos, Herramientas y Aplicaciones

     Valero Cortes, Mateo; Utrera Iglesias, Gladys Miriam; Martorell Bofill, Xavier; Muntés Mulero, Víctor; Gil Gómez, Maria Luisa; Ramirez Bellido, Alejandro; Alvarez Martinez, Carlos; Torres Viñals, Jordi; Farreras Esclusa, Montserrat; Gallardo Gomez, Antonia; Herrero Zaragoza, José Ramón; Guitart Fernández, Jordi; Parcerisa Bundó, Joan Manuel; Morancho Llena, Enrique; Salamí San Juan, Esther; Canal Corretger, Ramon; Moreto Planas, Miquel
    Participation in a competitive project

     Share

  • On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria; Sahuquillo, J; Kaeli, D
    Date: 2004-12
    Report

     Share Reference managers Reference managers Open in new window

  • Partitioning the Front-End on Clustered Microarchitectures

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria; Smith, J; Fu, W
    Date: 2004-12
    Report

     Share Reference managers Reference managers Open in new window

  • Efficient Interconnects for Clustered Microarchitectures

     Parcerisa Bundó, Joan Manuel; Sahuquillo, Julio; Gonzalez Colas, Antonio Maria; Duato, José
    11th International Conference on Parallel Architectues and Compilation Techniques (PACT'02)
    Presentation's date: 2002-09-22
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Clustered Front-End for Superscalar Processors

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria; Smith, J
    Date: 2002-07
    Report

     Share Reference managers Reference managers Open in new window

  • Dynamic Code Partitioning for Clustered Architectures

     Canal Corretger, Ramon; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    International journal of parallel programming
    Date of publication: 2001-02
    Journal article

     Share Reference managers Reference managers Open in new window

  • Improving Latency Tolerance of Multithreading trough Decoupling

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    IEEE transactions on computers
    Date of publication: 2001-10
    Journal article

     Share Reference managers Reference managers Open in new window

  • Computación de Altas Prestaciones III: Arquitecturas, Compiladores, Sistemas Operativos, Herramientas y Algoritmos, ref. TIC2001-0995-C02-01

     Valero Cortes, Mateo; Utrera Iglesias, Gladys Miriam; Martorell Bofill, Xavier; Muntés Mulero, Víctor; Gil Gómez, Maria Luisa; Ramirez Bellido, Alejandro; Alvarez Martinez, Carlos; Torres Viñals, Jordi; Farreras Esclusa, Montserrat; Herrero Zaragoza, José Ramón; Guitart Fernández, Jordi; Parcerisa Bundó, Joan Manuel; Morancho Llena, Enrique; Salamí San Juan, Esther; Marín Tordera, Eva; Canal Corretger, Ramon
    Participation in a competitive project

     Share

  • Efficient Interconnects for Clustered Microarchitectures

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria; Sahuquillo, J; Duato, J
    Date: 2001-11
    Report

     Share Reference managers Reference managers Open in new window

  • Building Fully Distributed Microarchitectures with Processor Slices

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria; Smith, J
    Date: 2001-11
    Report

     Share Reference managers Reference managers Open in new window

  • Dynamic Cluster Assignment Mechanisms

     Canal Corretger, Ramon; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Sixth International Symposium on High-Performance Computer Architecture (HPCA-6)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Best Student Paper Award of the 6th International Symposium on High-Performance Computer Architecture

     Canal Corretger, Ramon; Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Award or recognition

     Share

  • Dynamic Code Partitioning for Clustered Architectures

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria; Canal Corretger, Ramon
    Date: 2000-02
    Report

     Share Reference managers Reference managers Open in new window

  • Reducing Wire Delay Penalty through Value Prediction

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    Date: 2000-06
    Report

     Share Reference managers Reference managers Open in new window

  • Reducing Wire Delay Penalty through Value Prediction

     Parcerisa Bundó, Joan Manuel; Gonzalez Colas, Antonio Maria
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2000-12-10
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Dynamic Cluster Assignment Mechanisms

     Parcerisa Bundó, Joan Manuel
    Sixth International Symposium on High-Performance Computer Architecture (HPCA-6)
    Presentation's date: 2000-01-08
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Cost-Effective Clustered Architecture

     Parcerisa Bundó, Joan Manuel
    8th International Conference on Parallel Architectures and Compilation Techniques (PACT'99)
    Presentation's date: 1999-10-12
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window