Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 776 results
  • Efficient Hardware/Software Co-Designed Schemes for Low-Power Processors

     Lopez Muñoz, Pedro
    Defense's date: 2014-03-17
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    En la actualidad, la industria y la comunidad científica han adoptado los diseños multinúcleo para continuar mejorando el rendimiento de los procesadores manteniendo consumos energéticos y tamaños razonables. Sin embargo, en estos diseños hay que adoptar una solución de compromiso: dedicar el área del integrado para poner multitud de núcleos sencillos con los que favorecer el paralelismo a nivel de aplicación (TLP) sacrificando parte del paralelismo a nivel de instrucción (ILP), o dedicar el área disponible para poner una cantidad pequeña de núcleos complejos con los que favorecer ILP y explotar moderadamente TLP. En esta tesis apostamos por procesadores co-diseñados hardware/software como alternativa para continuar explotando tanto TLP como ILP. Para conseguirlo, estos diseños mejoran el rendimiento ILP de procesadores sencillos mediante la introducción de una capa software que adapta dinámicamente las aplicaciones para hacer un mejor uso de los recursos.En esta tesis proponemos tres diferentes técnicas enfocadas a simplificar el diseño del hardware y mejorar el rendimiento de procesadores co-diseñados de bajo consumo. Estas tres técnicas atacan los puntos más relevantes de su diseño: la detección de código frecuentemente ejecutado (también denominado hot code), su optimización y su ejecución.La primera técnica es un mecanismo de profiling, denominado LIU Profiler, que permite la rápida detección de hot code. LIU Profiler utiliza una pequeña tabla hardware que implementa una novedosa política de reemplazo diseñada para la detección de hot code. El mecanismo se complementa con un componente software que a partir de la información de la tabla, construye las regiones de código que después serán optimizadas y ejecutadas. El código detectado por el LIU Profiler, utilizando una tabla de 128 entradas, representa un 85,5% del total del código dinámico, mientras que otras propuestas requieren tablas de un tamaño de 4 a 8 veces superior para obtener resultados similares. La implementación del LIU Profiler solo incrementa en un 1% el área y en un 0.87% el consumo de un procesador sencillo de bajo consumo.La segunda técnica es un mecanismo para guardar y restaurar el estado arquitectónico de los registros de forma eficiente. El objetivo es permitir la ejecución de regiones de código optimizadas utilizando técnicas de especulación muy agresivas e introduciendo mínimos costes de recuperación en caso de fallo. El mecanismo propuesto, denominado HRC (Hybrid Register Checkpointing), combina técnicas hardware y software. El software se encarga de preparar los códigos para el guardado y la restauración de los valores, mientras que el hardware facilita estas funciones ofreciendo recursos especializados. HRC reduce en un 11% el área y en un 24,4% el consumo energético del banco de registros utilizado en procesadores que hacen uso de técnicas comunes que requieren doblar el número de registros. Además, HRC solo degrada en un 1% el rendimiento de éstas.La tercera técnica, Loop Parallelization (LP), permite aumentar el paralelismo a nivel de instrucción mediante la ejecución paralela de múltiples iteraciones de un mismo bucle, utilizando múltiples threads en un procesador SMT. La capa software detecta los bucles, los prepara y los adapta aplicando optimizaciones especiales para su posterior ejecución. LP utiliza un novedoso mecanismo de comunicación de valores entre registros pertenecientes a diferentes threads. Además, LP aprovecha los recursos existentes del procesador para reducir los costes adicionales derivados de la paralelización de los bucles, permitiendo paralelizar tanto bucles pequeños como bucles que iteran pocas veces. LP mejora la ejecución de los bucles en un 16,5% en comparación con un baseline especialmente optimizado. La técnica contribuye positivamente a la integración de un alto número de procesadores sencillos en un mismo integrado y permite que colaboren entre ellos para mejorar el paralelismo a nivel de instrucción.

  • Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment

     Kumar, Rakesh; Martinez, Alejandro; Gonzalez Colas, Antonio Maria
    International Conference on High Performance Computing
    Presentation's date: 2013-12-18
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Compiler based static vectorization is used widely to extract data level parallelism from computation intensive applications. Static vectorization is very effective in vectorizing traditional array based applications. However, compilers inability to reorder ambiguous memory references severely limits vectorization opportunities, especially in pointer rich applications. HW/SW co-designed processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime will help in capturing vectorization opportunities generally missed by the compilers. This paper proposes to complement the static vectorization with a speculative dynamic vectorizer in a HW/SW co-design processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides 2x performance benefit compared to the static vectorization alone, for SPECFP2006. Moreover, dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications.

  • Effectiveness of hybrid recovery techniques on parametric failures

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    International Symposium on Quality Electronic Design
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Modern day microprocessors effectively utilise supply voltage scaling for tremendous power reduction. The minimum voltage beyond which a processor cannot operate reliably is defined as V ddmin. On-chip memories like caches are the most susceptible to voltage-noise induced failures because of process variations and reduced noise-margins thereby arbitrating whole processor's V ddmin. In this paper, we evaluate the effectiveness of a new class of hybrid techniques in improving cache yield through failure prevention and correction. Proactive read/write assist techniques like body-biasing (BB) and wordline boosting (WLB) when combined with reactive techniques like ECC and redundancy are shown to offer better quality-energy-area trade offs when compared to their standalone configurations. Proactive techniques can help lower V ddmin (improving functional margin) for significant power savings and reactive techniques ensure that the resulting large number of failures are corrected (improving functional yield). Our results in 22nm technology indicate that at scaled supply voltages, hybrid techniques can improve parametric yield by atleast 28% when considering worst-case process variations

  • Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery

     Upasani, Gaurang; Vera, Xavier; Gonzalez Colas, Antonio Maria
    IEEE International On-Line Testing Symposium
    Presentation's date: 2013-07-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Cosmic radiation induced soft errors have emerged as a key challenge in computer system design. The exponential increase in the transistor count will drive the per chip fault rate sky high. New techniques for detecting errors in the logic and memories that allow meeting the desired failures in-time (FIT) budget in future chip multiprocessors (CMPs) are essential. Among the two major contributors towards soft error rate, silent data corruption (SDC) and detected unrecoverable error (DUE), DUE is the largest. Moreover, processors can experience a super-linear increase in DUE when the size of the write-back cache is doubled. This paper targets the DUE problem in write-back data caches. We analyze the cost of protection against single bit and multi-bit upsets into caches. Our results show that the proposed mechanism can reduce the DUE to ¿0¿ with minimum area, power and performance overheads.

  • Deconfigurable microprocessor architectures for silicon debug acceleration

     Foutris, Nikos; Gizopoulos, Dimitris; Vera, Xavier; Gonzalez Colas, Antonio Maria
    International Symposium on Computer Architecture
    Presentation's date: 2013-06-23
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The share of silicon debug in the overall microprocessor chips development cycle is rapidly expanding due to the ever growing design complexity and the limited efficiency of pre-silicon validation methods. Massive application of short random test programs on the prototype microprocessor chips is one of the most effective parts of silicon debug. However, a major bottleneck and source of ¿noise¿ in this phase is that large numbers of random test programs fail due to the same or similar design bugs. This redundant behavior adds long delays in the debug flow since each failing random program must be separately examined, although it does not usually bring new debug information. The development of effective techniques that detect dominant modes of failure among random programs and triage them into common categories eliminate redundant debug sessions and significantly boost silicon debug.

  • HRC: An efficient hybrid register checkpointing for HW/SW co-designed processors

     Lopez, Pedro; Codina, Josep Maria; Gibert Codina, Enric; Latorre, Fernando; Gonzalez Colas, Antonio Maria
    Workshop on Architectural and Microarchitectural Support for Binary Translation
    Presentation's date: 2013-06-24
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Dynamic selective devectorization for efficient power gatting of SIMD units in a HW/SW co-designed enviromment

     Kumar, Rakesh; Martinez, Alejandro; Gonzalez Colas, Antonio Maria
    International Symposium on Computer Architecture and High Performance Computing
    Presentation's date: 2013-10-23
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Leakage power is a growing concern in current and future microprocessors. Functional units of microprocessors are responsible for a major fraction of this power. Therefore, reducing functional unit leakage has received much attention in the recent years. Power gating is one of the most widely used techniques to minimize leakage energy. Power gating turns off the functional units during the idle periods to reduce the leakage. Therefore, the amount of leakage energy savings is directly proportional to the idle time duration. This paper focuses on increasing the idle interval for the higher SIMD lanes. The applications are profiled dynamically, in a HW/SW co-designed environment, to find the higher SIMD lanes usage pattern. If the higher lanes need to be turned-on for small time periods, the corresponding portion of the code is devectorized to keep the higher lanes off. The devectorized code is executed on the lowest SIMD lane. Our experimental results show average SIMD accelerator energy savings of 12% and 24% relative to power gating, for SPECFP2006 and Physicsbench. Moreover, the slowdown caused due to devectorization is less than 1%.

  • Vectorizing for wider vector units in a HW/SW co-designed enviroment

     Kumar, Rakesh; Martinez, Alejandro; Gonzalez Colas, Antonio Maria
    IEEE International Conference on High Performance Computing and Communications
    Presentation's date: 2013-11-13
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Abstract¿SIMD accelerators provide an energy efficient way of improving the computational power in modern microprocessors. Due to their hardware simplicity, these accelerators have evolved in terms of width from 64-bit vectors in Intel´s MMX to 512-bit wide vector units in Intel´s Xeon Phi. Although SIMD accelerators are simple in terms of hardware design, code generation for them has always been a challenge. This paper explores the scalability of SIMD accelerators from the code generation point of view. We explore the potential problems in vectorization at higher vector lengths. Furthermore, we propose Variable Length Vectorization and Selective Writing in a HW/SW co-designed environment to get around these problems. We evaluate our proposals using a set of SPECFP2006 and Physicsbench applications. Our experimental results show an average dynamic instruction elimination of 33% and 40% and an average speed up of 15% and 10% for SPECFP2006 and Physicsbench respectively, for 512-bit vector length, over the scalar baseline code.

  • Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery

     Upasani, Gaurang; Vera, Xavier; Gonzalez Colas, Antonio Maria
    IEEE International On-Line Testing Symposium
    Presentation's date: 2013-07-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Cosmic radiation induced soft errors have emerged as a key challenge in computer system design. The exponential increase in the transistor count will drive the per chip fault rate sky high. New techniques for detecting errors in the logic and memories that allow meeting the desired failures in-time (FIT) budget in future chip multiprocessors (CMPs) are essential. Among the two major contributors towards soft error rate, silent data corruption (SDC) and detected unrecoverable error (DUE), DUE is the largest. Moreover, processors can experience a super-linear increase in DUE when the size of the write-back cache is doubled. This paper targets the DUE problem in write-back data caches. We analyze the cost of protection against single bit and multi-bit upsets into caches. Our results show that the proposed mechanism can reduce the DUE to ¿0¿ with minimum area, power and performance overheads.

  • Memory controller-level extensions for GDDR5 single device data correct support

     Carretero Casado, Javier; Hernández, Isaac; Vera, Xavier; Juan Hormigo, Antonio; Herrero, Enric; Ramirez, Tanausu; Monchiero, Matteo; Gonzalez Colas, Antonio Maria; Axelos, Nikolaos; Sanchez Pedreño, Daniel
    Intel technology journal
    Date of publication: 2013-05-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Support for Reliability, Availability, and Serviceability (RAS) is one of the quintessential features of computing systems targeting the server and mission-critical markets. Among these RAS features, Chipkill* stands out as the most crucial for main memory protection. IBM Chipkill protects the main memory from the failure of an entire memory chip, as well as multi-bit faults from any portion of a memory chip. Similar technologies from other vendors are Single Device Data Correction (SDDC) from Intel, Sun Extended ECC* and HP Chipspare*. However, some advanced memory technologies (such as GDDR5) do not allow traditional SDDC implementation, since their specification does not include extra devices to store error correction codes (ECC codes). Some future high performance computing products hitting the server market will be based on these advanced memory technologies. In this article we propose a method to provide SDDC (single device data correct) support at the memory controller level for memory technologies that inherently have no RAS support for memory contents protection. Specifically, we focus on how to provide single-device SDDC support for GDDR5 memory. The technique allows the failure of 1/8 of the memory devices to be tolerated by using 25 percent of the memory to store error correction codes. We also describe how the technique can be implemented for RAS-less memory technologies feeding a wider data bus than GDDR5 (such as DDR3, which in fact uses narrower devices). This opens the possibility to offer high reliability with cheap DIMM devices. We also describe how to provide SDDC support without the use of lockstepped memory channels.

  • Replacement techniques for dynamic NUCA cache designs on CMPs

     Lira Rueda, Javier; Molina, Carlos; Rakvic, Ryan N.; Gonzalez Colas, Antonio Maria
    Journal of supercomputing
    Date of publication: 2013-05
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches.

  • Design of a Distributed Memory Unit for Clustered Microarchitectures  Open access

     Bieschewski, Stefan
    Defense's date: 2013-06-20
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance.

  • Premi Jaume I

     Gonzalez Colas, Antonio Maria
    Award or recognition

    View View Open in new window  Share

  • Improving the performance efficiency of an IDS by exploiting temporal locality in network traffic

     Sreekar Shenoy, Govind; Tubella Murgadas, Jordi; Gonzalez Colas, Antonio Maria
    IEEE International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Network traffic has traditionally exhibited temporal locality in the header field of packets. Such locality is intuitive and is a consequence of the semantics of network protocols. However, in contrast, the locality in the packet payload has not been studied in significant detail. In this work we study temporal locality in the packet payload. Temporal locality can also be viewed as redundancy, and we observe significant redundancy in the packet payload. We investigate mechanisms to exploit it in a networking application. We choose Intrusion Detection Systems (IDS) as a case study. An IDS like the popular Snort operates by scanning packet payload for known attack strings. It first builds a Finite State Machine (FSM) from a database of attack strings, and traverses this FSM using bytes from the packet payload. So temporal locality in network traffic provides us an opportunity to accelerate this FSM traversal. Our mechanism dynamically identifies redundant bytes in the packet and skips their redundant FSM traversal. We further parallelize our mechanism by performing the redundancy identification concurrently with stages of Snort packet processing. IDS are commonly deployed in commodity processors, and we evaluate our mechanism on an Intel Core i3. Our performance study indicates that the length of the redundant chunk is a key factor in performance. We also observe important performance benefits in deploying our redundancy-aware mechanism in the Snort IDS[32].

  • Analysis of CPI variance for dynamic binary translators/optimizers modules

     Brankovic, Aleksandar; Stavrou, Kyriakos; Gibert Codina, Enric; Gonzalez Colas, Antonio Maria
    International Symposium on Computer Architecture
    Presentation's date: 2012-06-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Dynamic Binary Translators and Optimizers (DBTOs) have been established as a hot research topic. They are used in many different systems, such as emulation, instrumentation tools and innovative HW/SW co-designed microarchitectures. Although many researchers worked on characterizing and reducing the emulation overhead, to the best of our knowledge, there are no published results that explain how the microarchitectural behavior of the emulation software is affected by the guest application which is emulated. In this paper we study the DBTO as an independent application, which is divided into the modules with specific functionality. We show the variance in microarchitectural behavior of DBTO among 48 applications. Moreover, we locate and explain the sources of variance. The results show that the variance is caused by interaction with the code cache (emulated application) and non uniform module execution characteristics. The insights presented in this paper can be exploited towards the design of more efficient DBTOs

  • Speculative Dynamic Vectorization for HW/SW Co-designed Processors

     Kumar, Rakesh; Martinez Vicente, Alejandro; Gonzalez Colas, Antonio Maria
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2012-09-19
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Improving the resilience of an IDS against performance throttling attacks

     Sreekar Shenoy, Govind; Tubella Murgadas, Jordi; Gonzalez Colas, Antonio Maria
    International Conference on Security and Privacy in Communication Networks
    Presentation's date: 2012-09-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Intrusion Detection Systems (IDS) have emerged as one of the most promising ways to secure systems in the network. To be effective against evasion attempts, the IDS must provide tight bounds on performance. Otherwise an adversary can bypass the IDS by carefully crafting and sending packets that throttle it. This can render the IDS ineffective, thus resulting in the network becoming vulnerable. We present a performance throttling attack mounted against the computationally intensive string matching algorithm. This algorithm performs string matching by traversing a finite-state-machine (FSM). We observe that there are some input bytes that sequentially traverse a chain of 30 pointers. This chain of traversal drastically degrades performance, and we observe a 22X performance drop in comparison to the average case performance. We investigate hardware and software mechanisms to counter this performance degradation. The software mechanism is targeted for commodity general purpose CPUs. While the hardware-based mechanism uses a parallel traversal suitable for network processor architectures. Our results show that our proposed mechanisms significantly improves (by over 3X magnitude) string matching algorithm’s worst performing cases.

  • Setting an error detection infrastructure with low cost acoustics wave detectors

     Upasani, Gaurang; Vera Rivera, Francisco Javier; Gonzalez Colas, Antonio Maria
    International Symposium on Computer Architecture
    Presentation's date: 2012-06-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The continuing decrease in dimensions and operating voltage of transistors has increased their sensitivity against radiation phenomena making soft errors an important challenge in future chip multiprocessors (CMPs). Hence, new techniques for detecting errors in the logic and memories that allow meeting the desired failures-in-time (FIT) budget in CMPs are required. This paper proposes a low-cost dynamic particle strike detection mechanism through acoustic wave detectors. Our results show that our mechanism can protect both the logic and the memory arrays. As a case study, we also show how this technique can be combined with error codes to protect the last-level cache at low cost.

  • A Novel variation-tolerant 4T-DRAM with enhance soft-error tolerance

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Alexandrescu, Dan; Costenaro, Enrico; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2012-09-30
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In view of device scaling issues, embedded DRAM (eDRAM) technology is being considered as a strong alternative to conventional SRAM for use in on-chip memories. Memory cells designed using eDRAM technology in addition to being logic-compatible, are variation tolerant and immune to noise present at low supply voltages. However, two major causes of concern are the data retention capability which is worsened by parameter variations leading to frequent data refreshes (resulting in large dynamic power overhead) and the transient reduction of stored charge increasing soft-error (SE) susceptibility. In this paper, we present a novel variation-tolerant 4T-DRAM cell whose power consumption is 20.4% lower when compared to a similar sized eDRAM cell. The retention time on-average is improved by 2.04X while incurring a delay overhead of 3% on the read-access time. Most importantly, using a soft-error (SE) rate analysis tool, we have confirmed that the cell sensitivity to SEs is reduced by 56% on-average in a natural working environment.

  • Modelling HW/SW co-designed processors

     Cano Reyes, Jose; Brankovic, Aleksandar; Kumar, Rakesh; Zivanovic, Darko; Gibert Codina, Enric; Stavrou, Kyriakos; Pavlou, Demos; Martínez Vicente, Alejandro; Dot Artigas, Gem; Latorre, Fernando; Barcelo Cuerda, Alex; Gonzalez Colas, Antonio Maria
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2012-07-08
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A novel variation-tolerant 4T-DRAM cell with enhanced soft-error tolerance

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Alexandrescu, Dan; Costenaro, Eric; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2012-10-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In view of device scaling issues, embedded DRAM (eDRAM) technology is being considered as a strong alternative to conventional SRAM for use in on-chip memories. Memory cells designed using eDRAM technology in addition to being logic-compatible, are variation tolerant and immune to noise present at low supply voltages. However, two major causes of concern are the data retention capability which is worsened by parameter variations leading to frequent data refreshes (resulting in large dynamic power overhead) and the transient reduction of stored charge increasing soft-error (SE) susceptibility. In this paper, we present a novel variation-tolerant 4T-DRAM cell whose power consumption is 20.4% lower when compared to a similar sized eDRAM cell. The retention time on-average is improved by 2.04X while incurring a delay overhead of 3% on the read-access time. Most importantly, using a soft-error (SE) rate analysis tool, we have confirmed that the cell sensitivity to SEs is reduced by 56% on-average in a natural working environment.

    In view of device scaling issues, embedded DRAM (eDRAM) technology is being considered as a strong alternative to conventional SRAM for use in on-chip memories. Memory cells designed using eDRAM technology in addition to being logic-compatible, are variation tolerant and immune to noise present at low supply voltages. However, two major causes of concern are the data retention capability which is worsened by parameter variations leading to frequent data refreshes (resulting in large dynamic power overhead) and the transient reduction of stored charge increasing soft-error (SE) susceptibility. In this paper, we present a novel variation-tolerant 4T-DRAM cell whose power consumption is 20.4% lower when compared to a similar sized eDRAM cell. The retention time on-average is improved by 2.04X while incurring a delay overhead of 3% on the read-access time. Most importantly, using a soft-error (SE) rate analysis tool, we have confirmed that the cell sensitivity to SEs is reduced by 56% on-average in a natural working environment

  • The migration prefetcher: anticipating data promotion in dynamic NUCA caches

     Lira Rueda, Javier; Jones, Timothy M.; Molina, Carlos; Gonzalez Colas, Antonio Maria
    International Conference on High Performance and Embedded Architectures and Compilers
    Presentation's date: 2012-01-24
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Hardware/software mechanisms for protecting an IDS against algorithmic complexity attacks

     Sreekar Shenoy, Govind; Tubella Murgadas, Jordi; Gonzalez Colas, Antonio Maria
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Intrusion Detection Systems (IDS) have emerged as one of the most promising ways to secure systems in the network. An IDS like the popular Snort[17] detects attacks on the network using a database of previous attacks. So in order to detect these attack strings in the packet, Snort uses the Aho-Corasick algorithm. This algorithm first constructs a Finite State Machine (FSM) from the attack strings, and subsequently traverses the FSM using bytes from the packet. We observe that there are input bytes that result in a traversal of a series of FSM states (also viewed as pointers). This chain of pointer traversal significantly degrades (22X) the processing time of an input byte. Such a wide variance in the processing time of an input byte can be exploited by an adversary to throttle the IDS. If the IDS is unable to keep pace with the network traffic, the IDS gets disabled. So in the process the network becomes vulnerable. Attacks done in this manner are referred to as algorithmic complexity attacks, and arise due to weaknesses in IDS processing. In this work, we explore defense mechanisms to the above outlined algorithmic complexity attack. Our proposed mechanisms provide over 3X improvement in the worst-case performance.

  • Exploiting temporal locality in network traffic using commodity multi-cores

     Sreekar Shenoy, Govind; Tubella Murgadas, Jordi; Gonzalez Colas, Antonio Maria
    IEEE International Symposium on Performance Analysis of Systems and Software
    Presentation's date: 2012-04-02
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Impact of positive bias temperature instability (PBTI) on 3T1D-DRAM cells

     Aymerich Capdevila, Nivard; Ganapathy, Shrikanth; Rubio Sola, Jose Antonio; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria
    Integration. The VLSI journal
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    The migration prefetcher: anticipating data promotion in dynamic NUCA caches  Open access

     Lira Rueda, Javier; Jones, Timothy M.; Molina, Carlos; Gonzalez Colas, Antonio Maria
    ACM transactions on architecture and code optimization
    Date of publication: 2012-01
    Journal article

    Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

  • A HW/SW co-designed programmable functional unit

     Deb, Abhishek; Codina, Josep Maria; Gonzalez Colas, Antonio Maria
    IEEE computer architecture letters
    Date of publication: 2012-07-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Code Optimizations for Narrow Bitwidth Architectures  Open access

     Bhagat, Indu
    Defense's date: 2012-02-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner. The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64- bit applications on to the 16-bit processor. The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate the same (correct) output as the original program. Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the data-flow as well as control-flow of the programs. Three profile-based, code optimization techniques have been proposed : 1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function. 2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction. 3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the principles of MRC for conditional branches. The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations (GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further, these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative slices, hence, making them self-sufficient to detect mis-speculation dynamically. The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the backslices containing narrow computations such that the minimal necessary computations to generate the same (correct) output are performed in the most-frequent case; the rest of the computations are performed only when necessary. With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16- bit ISA such that the overheads are considerably reduced.

    Esta tesis deriva su motivación en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y consumiendo más energía. Esta tesis utiliza una aproximación HW/SW para atacar, de forma íntegra, el problema de la ineficiencia computacional. El hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y ofrecer así un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de 16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en el procesador de 16 bits. El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las Mínimas Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo que realmente se necesita para generar la misma (correcta) salida que el programa original. Aproximarse al MRC perfecto es una meta intrínsecamente ambiciosa y que requiere predicciones perfectas de comportamiento del programa. Con este fin, la tesis propone tres heurísticas basadas en optimizaciones que tratan de inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya había almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo. Se han propuesto tres optimizaciones del código basadas en profile: 1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función. 2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de una única instrucción. 3. Computación Mínima del Salto (MBC en inglés) es una técnica de reordenación de código que aplica los principios de MRC a los saltos condicionales. El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias) utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar, dinámicamente, los casos de fallo en la especulación. La idea de la optimización MBC es reordenar las instrucciones que generan el salto condicional tal que las mínimas computaciones que general la misma (correcta) salida se ejecuten en la mayoría de los casos; el resto de las computaciones se ejecutarán sólo cuando sea necesario.

  • Architecture Support for Intrusion Detection Systems  Open access

     Sreekar Shenoy, Govind
    Defense's date: 2012-10-30
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    System security is a prerequisite for efficient day-to-day transactions. As a consequence, Intrusion Detection Systems (IDS) are commonly used to provide an effective security ring to systems in a network. An IDS operates by inspecting packets flowing in the network for malicious content. To do so, an IDS like Snort[49] compares bytes in a packet with a database of prior reported attacks. This functionality can also be viewed as string matching of the packet bytes with the attack string database. Snort commonly uses the Aho-Corasick algorithm[2] to detect attacks in a packet. The Aho-Corasick algorithm works by first constructing a Finite State Machine (FSM) using the attack string database. Later the FSM is traversed with the packet bytes. The main advantage of this algorithm is that it provides a linear time search irrespective of the number of strings in the database. The issue however lies in devising a practical implementation. The FSM thus constructed gets very bloated in terms of the storage size, and so is area inefficient. This also affects its performance efficiency as the memory footprint also grows. Another issue is the limited scope for exploiting any parallelism due to the inherent sequential nature in a FSM traversal. This thesis explores hardware and software techniques to accelerate attack detection using the Aho-Corasick algorithm. In the first part of this thesis, we investigate techniques to improve the area and performance efficiency of an IDS. Notable among our contributions, includes a pipelined architecture that accelerates accesses to the most frequently accessed node in the FSM. The second part of this thesis studies the resilience of an IDS to evasion attempts. In an evasion attempt an adversary saturates the performance of an IDS to disable it, and thereby gain access to the network. We explore an evasion attempt that significantly degrades the performance of the Aho-Corasick al- gorithm used in an IDS. As a counter measure, we propose a parallel architecture that improves the resilience of an IDS to an evasion attempt. The final part of this thesis explores techniques to exploit the network traffic characteristic. In our study, we observe significant redundancy in the payload bytes. So we propose a mechanism to leverage this redundancy in the FSM traversal of the Aho-Corasick algorithm. We have also implemented our proposed redundancy-aware FSM traversal in Snort.

  • Mitosis Based Speculative Multithreaded Architectures  Open access

     Madriles Gimeno, Carlos
    Defense's date: 2012-07-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.

  • HW/SW Mechanisms for Instruction Fusion, Issue and Commit in Modern u-Processors  Open access

     Deb, Abhishek
    Defense's date: 2012-05-03
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.

    En aquesta tesis hem explorat el paradigma de les màquines issue i commit per processadors actuals. Hem implementat una màquina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no és més que una traça de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusió d’instruccions, blocs bàsics. Aquests superblocks s’optimitzen utilitzant optimizacions especualtives i d’altres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gestió de errades en l’especulació. Al llarg d’aquesta tesis s’han fet les següents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l’execució d’aplicacions de proposit general. La PFU està formada per un conjunt d’unitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribuït a les unitats funcionals que la composen. Les entrades de la macro-operació que s’executa en la PFU es mouen del banc de registres físic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusió combina més micro-operacions en temps d’execució. Aquest algorisme es basa en un pas de planificació que mesura el benefici de les decisions de fusió. Les micro operacions corresponents a la macro operació s’emmagatzemen com a senyals de control en una configuració. Les macro-operacions tenen associat un identificador de configuració que ajuda a localitzar d’aquestes. Una petita cache de configuracions està present dintre de la PFU per tal de guardar-les. En cas de que la configuració no estigui a la cache, les configuracions es carreguen de la cache d’instruccions. Per altre banda, per tal de donar support al commit atòmic dels superblocks que sobrepassen el tamany del ROB s’ha proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d’acceleradors. Aquests acceleradors executen un parell d’instruccions fusionades. S’han considerat dos tipus de fusió d’instructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d’instruccions simples d’alu que són dependents i que s’executaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d’altres. Aquesta proposta ha estat comparada amb un processador fora d’ordre. Tercer, hem proposat un processador fora d’ordre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit atòmic del superblocks. En aquesta solució hem substituït el ROB convencional, i en lloc hem introduït el Superblock Ordering Buffer (SOB). El SOB manté l’odre de programa a granularitat de superblock. L’estat del programa consisteix en registres i memòria. L’estat dels registres es manté en una taula per superblock, mentre que l’estat de memòria es guarda en un buffer i s’actulitza atòmicament. La segona gran area de reducció de complexitat considerarada és l’ús de FIFOs a la lògica d’issue. En aquest últim àmbit hem proposat una heurística de distribució que solventa les ineficiències de l’heurística basada en dependències anteriorment proposada. Finalment, i junt amb les FIFOs, s’ha proposat un mecanisme per alliberar les entrades de la FIFO anticipadament.

  • Quantitative characterization of the software layer of a HW/SW co-designed processor

     Pavlou, Demos; Gibert Codina, Enric; Brankovic, Aleksandar; Kumar, Rakesh; Stavrou, Kyriakos; Gonzalez Colas, Antonio Maria
    Date: 2012-02-24
    Report

     Share Reference managers Reference managers Open in new window

  • Detecting soft errors via selective re-execution

     Abella Ferrer, Jaume; Ergin, Oguz; Gonzalez Colas, Antonio Maria; Unsal, Osman Sabri; Vera Rivera, Francisco Javier
    Date of request: 2012-01-03
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Access of Register Files of Other Threads Using Synchronization

     Gibert Codina, Enric; Codina, Josep Maria; Latorre, Fernando; López, Pedro; Piñeiro Riobo, Jose Alejandro; Gonzalez Colas, Antonio Maria
    Date of request: 2012-09-04
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Register Checkpointing Mechanism for Multithreading

     Gonzalez Colas, Antonio Maria; Madriles Gimeno, Carles; Latorre, Fernando; Martinez Martinez, Raul; López, Pedro; Codina, Josep Maria; Gibert Codina, Enric; Martinez Vicente, Alejandro
    Date of request: 2012-11-21
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Disabling cache portions during low voltage operations

     Abella Ferrer, Jaume; Carretero Casado, Javier Sebastian; De, Vivek; Gonzalez Colas, Antonio Maria; Khellah, Muhammad; Chaparro, Pedro; Vera Rivera, Francisco Javier; Wilkerson, Chris; Zhang, Ming
    Date of request: 2012-10-16
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Achieving coherence between dynamically optimized code and original code

     Codina, Josep Maria; Latorre, Fernando; Magklis, Grigorios; Gibert Codina, Enric; Gonzalez Colas, Antonio Maria; Vera Rivera, Francisco Javier
    Date of request: 2012-05-29
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Disabling cache portions during low voltage operations - continuation of 8,103,830

     Wilkerson, Chris; Khellah, Muhammad; De, Vivek; Zhang, Ming; Abella Ferrer, Jaume; Carretero Casado, Javier Sebastian; Chaparro, Pedro; Gonzalez Colas, Antonio Maria; Vera Rivera, Francisco Javier
    Date of request: 2012-10-16
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Enabling speculative state information in a cache coherency protocol

     Madriles Gimeno, Carles; Marcuello, Pedro; Pérez García, Carlos; Sanchez, Jesus; Latorre Salinas, Fernando; Gonzalez Colas, Antonio Maria
    Date of request: 2012-05-22
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Dynamically Estimating Lifetime Of a Semiconductor Device

     Abella Ferrer, Jaume; Ergin, Oguz; Gonzalez Colas, Antonio Maria; Unsal, Osman Sabri; Vera Rivera, Francisco Javier
    Date of request: 2012-04-03
    Invention patent

     Share Reference managers Reference managers Open in new window

  • Hardware/software-based diagnosis of load-store queues using expandable activity logs

     Vera Rivera, Francisco Javier; Carretero Casado, Javier Sebastian; Abella Ferrer, Jaume; Monchiero, Matteo; Ramirez Garcia, Tanausu; Gonzalez Colas, Antonio Maria
    International Symposium on High-Performance Computer Architecture (HPCA)
    Presentation's date: 2011-02-14
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The increasing device count and design complexity are posing significant challenges to post-silicon validation. Bug diagnosis is the most difficult step during post-silicon validation. Limited reproducibility and low testing speeds are common limitations in current testing techniques. Moreover, low observability defies full-speed testing approaches. Modern solutions like on-chip trace buffers alleviate these issues, but are unable to store long activity traces. As a consequence, the cost of post-Si validation now represents a large fraction of the total design cost. This work describes a hybrid post-Si approach to validate a modern load-store queue. We use an effective error detection mechanism and an expandable logging mechanism to observe the microarchitectural activity for long periods of time, at processor full-speed. Validation is performed by analyzing the log activity by means of a diagnosis algorithm. Correct memory ordering is checked to root the cause of errors.

  • A co-designed HW/SW approach to general purpose program acceleration using a programmable functional unit

     Deb, Abhishek; Codina Viñas, Josep M; Gonzalez Colas, Antonio Maria
    Workshop on Interaction between Compilers and Computer Architectures
    Presentation's date: 2011-02-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we propose a novel programmable functional unit (PFU) to accelerate general purpose application execution on a modern out-of-order x86 processor in a complexity-effective way. Code is transformed and instructions are generated that run on the PFU using a co-designed virtual machine (Cd-VM). Groups of frequently executed micro-operations (micro-ops) are identified and fused into a macro-op (MOP) by the Cd-VM. The MOPs are executed on PFU. Results presented in this paper show that this HW/SW co-designed approach produces average speedups in performance of 17% in SPECFP and 10% in SPECINT, and up-to 33%, over modern out-of-order processor. Moreover, we also show that the proposed scheme not only out-performs dynamic vectorization using SIMD accelerators but also outperforms an 8-wide issue out-of-order processor.

  • Fg-STP: fine-grain single thread partitioning on multicores

     Ranjan, Rakesh; Latorre, Fernando; Marcuello, Pedro; Gonzalez Colas, Antonio Maria
    International Symposium on High-Performance Computer Architecture (HPCA)
    Presentation's date: 2011-02-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.

  • Dynamic fine-grain body biasing of caches with latency and leakage 3T1D-based monitors

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Implementing a hybrid SRAM / e DRAM NUCA architecture  Open access

     Lira Rueda, Javier; Molina, Carlos; Brooks, David; Gonzalez Colas, Antonio Maria
    International Conference on High Performance Computing
    Presentation's date: 2011-12-18
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are replaced

  • DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support

     Pavlou, Demos; Gibert Codina, Enric; Latorre, Fernando; Gonzalez Colas, Antonio Maria
    ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments
    Presentation's date: 2011
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Dynamic Binary Translators (DBT) and Dynamic Binary Opti- mization (DBO) by software are used widely for several reasons including performance, design simplification and virtualization. However, the software layer in such systems introduces non- negligible overheads which affect performance and user experi- ence. Hence, reducing DBT/DBO overheads is of paramount im- portance. In addition, reduced overheads have interesting collateral effects in the rest of the software layer, such as allowing optimiza- tions to be applied earlier. A cost-effective solution to this problem is to provide hardware support to speed up the primitives of the software layer, paying special attention to automate DBT/DBO mechanisms and leave the heuristics to the software, which is more flexible. In this work, we have characterized the overheads of a DBO sys- tem using DynamoRIO implementing several basic optimizations. We have seen that the computation of the Data Dependence Graph (DDG) accounts for 5%-10% of the execution time. For this rea- son, we propose to add hardware support for this task in the form of a new functional unit, called DDGacc, which is integrated in a conventional pipeline processor and is operated through new ISA instructions. Our evaluation shows that DDGacc reduces the cost of computing the DDG by 32x, which reduces overall execution time by 5%-10% on average and up to 18% for applications where the DBO optimizes large code footprints.

  • Beforehand migration on D-NUCA caches

     Lira Rueda, Javier; Jones, Timothy M.; Molina, Carlos; Gonzalez Colas, Antonio Maria
    International Conference on Parallel Architectures and Compilation Techniques
    Presentation's date: 2011-10-10
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Moore's Law implications on energy reduction

     Gonzalez Colas, Antonio Maria
    International Conference on High Performance and Embedded Architectures and Compilers
    Presentation's date: 2011-06-24
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Fast time-to-market with via-configurable transistor array regular fabric: A delay-locked loop design case study

     Pons Solé, Marc; Barajas Ojeda, Enrique; Mateo Peña, Diego Cesar; Moll Echeto, Francesc de Borja; Rubio Sola, Jose Antonio; Gonzalez Jimenez, Jose Luis; Abella Ferrer, Jaume; Vera Rivera, Francisco Javier; Gonzalez Colas, Antonio Maria
    IEEE International Conference on Design & Technology of Integrated Systems in Nanoscale Era
    Presentation's date: 2011-04-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Time-to-market is a critical issue for nowadays integrated circuits manufacturers. In this paper the Via-Configurable Transistor Array regular layout fabric (VCTA), which aims to minimize the time-to-market and its associated costs, is studied for a Delay-Locked Loop design (DLL). The comparison with a full custom design demonstrates that VCTA can be used without loss of functionality while accelerating the design time. Layout implementations, in 90 nm CMOS process, as well as the delay, energy and jitter electrical simulations are provided.

  • Thread shuffling: combining DVFS and thread migration to reduce energy consumptions for multi-core systems

     Cai, Qiong; González, José; Magklis, Grigorios; Chaparro, Pedro; Gonzalez Colas, Antonio Maria
    International Symposium on Low Power Electronics and Design
    Presentation's date: 2011-08-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window