Canal Corretger, Ramon
Total activity: 119
Expertise
DRAM, Memory Design, Microarchitecture, Nanotechnology Circuit Design, Near and Subthreshold Architectures, Non-Volatile Memories, Processor Design, SRAM, Variability
h index
14
Professional category
University lecturer
Doctoral courses
Doctor per la Universitat Politècnica de Catalunya
University degree
Enginyer en Informàtica
Research group
ARCO - Microarchitecture and Compilers
Department
Department of Computer Architecture
School
Barcelona School of Informatics (FIB)
E-mail
rcanalac.upc.edu
Contact details
UPC directory Open in new window
Links of interest
Personal webpage Open in new window

Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 119 results
  • Thread row buffers: Improving memory performance isolation and throughput in multiprogrammed environments

     Herrero Abellanas, Enric; Gonzalez, Jose; Canal Corretger, Ramon; Tullsen, Dean
    IEEE transactions on computers
    Date of publication: 2013-09
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The widespread adoption of chip multiprocessors in recent years has increased the number of applications simultaneously accessing DRAM memories. Therefore, memory access patterns have also changed and this has reduced row buffer locality significantly, degrading performance and energy efficiency. Furthermore, concurrent execution of applications also has shown the need of performance isolation among threads in the memory controller to enforce a quality of service in virtualized environments. Existing DRAM memories, however, enforce a tradeoff between throughput and isolation. To solve these problems, this paper proposes the addition of Thread Row Buffers (TRBs) to DRAM memories. TRBs keep an active row per thread, thereby increasing DRAM efficiency by avoiding alternate accesses to a limited number of rows and allowing the implementation of a memory scheduler not bound to the throughput-isolation tradeoff. Thread Row Buffers with Service Partitioning (TRB-SP) increase the row hit-rate by 38 percent with respect to FR-FCFS and by 11 percent with respect to Cache DRAM. This, in turn, increases overall performance by 17 and 7 percent, respectively. TRB-SP is also able to reduce the standard deviation of the memory access time of an application by 40 percent over FR-FCFS, 31 percent over PAR-BS, and 42 percent over Cache DRAM. 1968-2012 IEEE.

  • Effectiveness of hybrid recovery techniques on parametric failures

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    International Symposium on Quality Electronic Design
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Modern day microprocessors effectively utilise supply voltage scaling for tremendous power reduction. The minimum voltage beyond which a processor cannot operate reliably is defined as V ddmin. On-chip memories like caches are the most susceptible to voltage-noise induced failures because of process variations and reduced noise-margins thereby arbitrating whole processor's V ddmin. In this paper, we evaluate the effectiveness of a new class of hybrid techniques in improving cache yield through failure prevention and correction. Proactive read/write assist techniques like body-biasing (BB) and wordline boosting (WLB) when combined with reactive techniques like ECC and redundancy are shown to offer better quality-energy-area trade offs when compared to their standalone configurations. Proactive techniques can help lower V ddmin (improving functional margin) for significant power savings and reactive techniques ensure that the resulting large number of failures are corrected (improving functional yield). Our results in 22nm technology indicate that at scaled supply voltages, hybrid techniques can improve parametric yield by atleast 28% when considering worst-case process variations

  • An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

     Jing, Naifeng; Shen, Yao; Lu, Yao; Ganapathy, Shrikanth; Mao, Zhigang; Guo, Minyi; Canal Corretger, Ramon; Liang, Xiaoyao
    Annual International Symposium on Computer Architecture
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.

  • Variability robustness enhancement for 7nm FinFET 3T1D-DRAM cells

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Rubio Sola, Jose Antonio; Canal Corretger, Ramon
    IEEE International Midwest Symposium on Circuits and Systems
    Presentation's date: 2013-08-05
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    3T1D-DRAM cells will still be operative with 7nm FinFETs but their performance is significantly degraded when factoring in variability. In order to improve the cell robustness against device process variation and high environment temperatures, we propose a Dual-VT strategy. Our results show a larger retention time, significant cell spread reduction and reliable behavior up to 100°C.

  • Combining RAM technologies for hard-error recovery in L1 data caches working at very-low power modes

     Lorente, Vicente; Valero, Alejandro; Sahuquillo, Julio; Petit, Salvador; Canal Corretger, Ramon; López, Pedro; Duato, José
    Design, Automation and Test in Europe
    Presentation's date: 2013-03-20
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Low-power modes in modern microprocessors rely on low frequencies and low voltages to reduce the energy budget. Nevertheless, manufacturing induced parameter variations can make SRAM cells unreliable producing hard errors at supply voltages below Vccmin.

    Low-power modes in modern microprocessors rely on low frequencies and low voltages to reduce the energy budget. Nevertheless, manufacturing induced parameter variations can make SRAM cells unreliable producing hard errors at supply voltages below Vccmin. Recent proposals provide a rather low fault-coverage due to the fault coverage/overhead trade-off. We propose a new fault- tolerant L1 cache, which combines SRAM and eDRAM cells in L1 data caches to provide 100% SRAM hard-error fault coverage. Results show that, compared to a conventional cache and assuming 50% failure probability at low-power mode, leakage and dynamic energy savings are by 85% and 62%, respectively, with a minimal impact on performance.

  • Impact of finfet and III-V/Ge technology on logic and memory cell behavior

     Amat, Esteve; Calomarde Palomino, Antonio; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Canal Corretger, Ramon; Rubio Sola, Jose Antonio
    IEEE transactions on device and materials reliability
    Date of publication: 2013-11-20
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this work, we assess the performance of a ring oscillator and a DRAM cell when they are implemented with different technologies (planar CMOS, FinFET and III-V MOSFETs), and subjected to different reliability scenarios (variability and soft errors). FinFET-based circuits show the highest robustness against variability and soft error environments.

  • Impact of FinFET technology introduction in the 3T1D-DRAM memory cell

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Canal Corretger, Ramon; Rubio Sola, Jose Antonio
    IEEE transactions on device and materials reliability
    Date of publication: 2013-01-09
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    n this paper, the 3T1D-DRAM cell based on FinFET devices is studied as an alternative to the bulk one. We observe an improvement in its behavior when IG and SG FinFETs are properly mixed, since together they provide a relevant increase in the memory circuit retention time. Moreover, our FinFET cell shows larger variability robustness, better performance at low supply voltage, and higher tolerance to elevated temperatures.

  • Improving multithreading performance for clustered VLIW architectures.  Open access

     Gupta, Manoj
    Defense's date: 2013-06-14
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others.

  • Variability robustness enhancement for 7nm FinFET 3T1D-DRAM cells

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Rubio Sola, Jose Antonio; Canal Corretger, Ramon
    IEEE International Midwest Symposium on Circuits and Systems
    Presentation's date: 2013-08
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    3T1D-DRAM cells will still be operative with 7nm FinFETs but their performance is significantly degraded when factoring in variability. In order to improve the cell robustness against device process variation and high environment temperatures, we propose a Dual-VT strategy. Our results show a larger retention time, significant cell spread reduction and reliable behavior up to 100°C.

  • Comparison of SRAM cells for 10-nm SOI FinFETs under process and environmental variations

     Jaksic, Zoran; Canal Corretger, Ramon
    IEEE transactions on electron devices
    Date of publication: 2012-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    We explore the 6T and 8T SRAM design spaces through read static noise margin (RSNM), word-line write margin, and leakage for future 10-nm FinFETs. Process variations are based on the ITRS and modeled at device (TCAD) level. We propose a method to incorporate them into a BSIM-CMG model card for time-efficient simulation. We analyze cells with different fin numbers, supply voltages, and temperatures. Results show a 1.8× improvement of RSNM for 8T SRAM cells, the need for stronger pull-downs to secure read stability in 6Ts, and high leakage sensitivity to temperature (10× between 40°C and 100°C). As a specific example, we show how the RSNM of a 6T SRAM cell can be improved by using back-gate biasing techniques for independent-gate FinFETs. We show how WLMN is increased by reducing the strength of pull-up transistors when reverse back-gate biasing is applied on it and how the RSNM can be increased by reducing the strength of access transistor by reverse back-gate biasing of pass-gate transistors. When combining these two techniques, RSNM can be improved up to 25% without compromising cell write ability for any sample. In general, when compared to previous technologies, read stability is untouched, writeability is reduced, and leakage keeps stable.

    We explore the 6T and 8T SRAM design spaces through read static noise margin (RSNM), word-line write margin, and leakage for future 10-nm FinFETs. Process variations are based on the ITRS and modeled at device (TCAD) level. We propose a method to incorporate them into a BSIM-CMG model card for time-efficient simulation. We analyze cells with different fin numbers, supply voltages, and temperatures. Results show a 1.8× improvement of RSNM for 8T SRAM cells, the need for stronger pull-downs to secure read stability in 6Ts, and high leakage sensitivity to temperature (10× between 40°C and 100°C). As a specific example, we show how the RSNM of a 6T SRAM cell can be improved by using back-gate biasing techniques for independent-gate FinFETs. We show how WLMN is increased by reducing the strength of pull-up transistors when reverse back-gate biasing is applied on it and how the RSNM can be increased by reducing the strength of access transistor by reverse back-gate biasing of pass-gate transistors. When combining these two techniques, RSNM can be improved up to 25% without compromising cell write ability for any sample. In general, when compared to previous technologies, read stability is untouched, writeability is reduced, and leakage keeps stable.

  • Analysis of FinFET technology on memories

     Amat, E.; ASenov, Asen; Canal Corretger, Ramon; Cheng, B.; Cruz Diaz, Josep-llorenç; Jaksic, Zoran; Miranda, Miguel; Rubio Sola, Jose Antonio; Zuber, Paul
    IEEE International On-Line Testing Symposium
    Presentation's date: 2012-06-29
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Mitigation strategies of the variability in 3T1D cell memories scaled beyond 22nm  Open access  awarded activity

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Canal Corretger, Ramon; Rubio Sola, Jose Antonio
    Conference on Design of Circuits and Integrated Systems
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    3T1D cell has been stated as a valid alternative to be implemented on L1 memory cache to substitute 6T, highly affected by device variability. In this contribution, we have shown that 22nm 3T1D memory cells present significant tolerance to high levels of device parameter fluctuation. Moreover, we have observed that the variability of the write access transistor has turn into the more detrimental device for the 3T1D cell performance. Furthermore, resizing and temperature control have been presented as some strategies to mitigate the cell variability.

    3T1D cell has been stated as a valid alternative to be implemented on L1 memory cache to substitute 6T, highly affected by device variability. In this contribution, we have shown that 22nm 3T1D memory cells present significant tolerance to high levels of device parameter fluctuation. Moreover, we have observed that the variability of the write access transistor has turn into the more detrimental device for the 3T1D cell performance. Furthermore, resizing and temperature control have been presented as some strategies to mitigate the cell variability.

  • Strain relevance on the improvement of the 3T1D cell performance

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Canal Corretger, Ramon; Rubio Sola, Jose Antonio
    International Conference Mixed Design of Integrated Circuits and Systems
    Presentation's date: 2012-05-26
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • A Novel variation-tolerant 4T-DRAM with enhance soft-error tolerance

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Alexandrescu, Dan; Costenaro, Enrico; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2012-09-30
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In view of device scaling issues, embedded DRAM (eDRAM) technology is being considered as a strong alternative to conventional SRAM for use in on-chip memories. Memory cells designed using eDRAM technology in addition to being logic-compatible, are variation tolerant and immune to noise present at low supply voltages. However, two major causes of concern are the data retention capability which is worsened by parameter variations leading to frequent data refreshes (resulting in large dynamic power overhead) and the transient reduction of stored charge increasing soft-error (SE) susceptibility. In this paper, we present a novel variation-tolerant 4T-DRAM cell whose power consumption is 20.4% lower when compared to a similar sized eDRAM cell. The retention time on-average is improved by 2.04X while incurring a delay overhead of 3% on the read-access time. Most importantly, using a soft-error (SE) rate analysis tool, we have confirmed that the cell sensitivity to SEs is reduced by 56% on-average in a natural working environment.

  • Enhancing 3T DRAMs for SRAM replacement under 10nm tri-gate SOI FinFETs

     Jaksic, Zoran; Canal Corretger, Ramon
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2012-10-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present the dynamic 3T memory cell for future 10 nm tri-gate FinFETs as a potential replacement for classical 6T SRAM cell for implementation in high speed cache memories. We investigate read access time, retention time, and static power consumption of the cell when it is exposed to the effects of process and environmental variations. Process variations are extracted from the ITRS predictions and they are modeled at device level. For simulation, we use 10 nm SOI tri-gate FinFET BSIM-CMG model card developed by the University of Glasgow, Device Modeling Group. When compared to the classical 6T SRAM, 3T cell has 40% smaller area, leakage is reduced up to 14 times while access time is approximately the same. In order to achieve higher retention times,we propose several cell extensions which, at the same time, enable post-fabrication/run-time adaptability.

    In this paper, we pr esent the dynamic 3T memory cell for future 10nm tri-gate FinFETs as a potential replacement for classical 6T SRAM cell for implementation in high speed cache memories. We investigate read access time, retention time, and static power consumption of the cell when it is exposed to the effects of process and environmental variations. Process variations are extracted from the ITRS predictions and they are modeled at device level. For simulation, we use 10nm SOI tri-gate FinFET BSIM-CMG model card developed by the University of Glasgow, Device Modeling Group. When compared to the classical 6T SRAM, 3T cell has 40% smaller area, leakage is reduced up to 14 times while access time is approximately the same. In order to achieve higher retention times, we propose several cell extensions which, at the same time, enable post- fabrication/run-time adaptability.

  • A novel variation-tolerant 4T-DRAM cell with enhanced soft-error tolerance

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Alexandrescu, Dan; Costenaro, Eric; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2012-10-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In view of device scaling issues, embedded DRAM (eDRAM) technology is being considered as a strong alternative to conventional SRAM for use in on-chip memories. Memory cells designed using eDRAM technology in addition to being logic-compatible, are variation tolerant and immune to noise present at low supply voltages. However, two major causes of concern are the data retention capability which is worsened by parameter variations leading to frequent data refreshes (resulting in large dynamic power overhead) and the transient reduction of stored charge increasing soft-error (SE) susceptibility. In this paper, we present a novel variation-tolerant 4T-DRAM cell whose power consumption is 20.4% lower when compared to a similar sized eDRAM cell. The retention time on-average is improved by 2.04X while incurring a delay overhead of 3% on the read-access time. Most importantly, using a soft-error (SE) rate analysis tool, we have confirmed that the cell sensitivity to SEs is reduced by 56% on-average in a natural working environment.

    In view of device scaling issues, embedded DRAM (eDRAM) technology is being considered as a strong alternative to conventional SRAM for use in on-chip memories. Memory cells designed using eDRAM technology in addition to being logic-compatible, are variation tolerant and immune to noise present at low supply voltages. However, two major causes of concern are the data retention capability which is worsened by parameter variations leading to frequent data refreshes (resulting in large dynamic power overhead) and the transient reduction of stored charge increasing soft-error (SE) susceptibility. In this paper, we present a novel variation-tolerant 4T-DRAM cell whose power consumption is 20.4% lower when compared to a similar sized eDRAM cell. The retention time on-average is improved by 2.04X while incurring a delay overhead of 3% on the read-access time. Most importantly, using a soft-error (SE) rate analysis tool, we have confirmed that the cell sensitivity to SEs is reduced by 56% on-average in a natural working environment

  • Impact of bulk/SOI 10nm FinFETs on 3T1D-DRAM cell performance

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Canal Corretger, Ramon; Rubio Sola, Jose Antonio
    International Conference on Solid-State and Integrated Circuit Technology
    Presentation's date: 2012-10
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    While the feasibility of SOI or bulk substrates for 10nm FinFETs has been shown, their impact on 3T1D memory performance has not been studied yet. In our study, bulk-based FinFETs show a better behavior for golden devices. Nevertheless, when variation is factored in, SOI-based FinFETs present better tolerance and, consequently, lower performance spread than bulk-based devices. When considering environment temperature it is always a detrimental factor for both multi-gate devices, but the impact is lower for the bulk ones.

  • Enhancing 6T SRAM cell stabilitty by back gate biasing techniques for 10nm SOI FinFETs under process and environmental variations

     Jaksic, Zoran; Canal Corretger, Ramon
    International Conference Mixed Design of Integrated Circuits and Systems
    Presentation's date: 2012-05-26
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Distributed cooperative caching: an energy efficient memory scheme for chip multiprocessors

     Herrero Abellanas, Enric; González, José; Canal Corretger, Ramon
    IEEE transactions on parallel and distributed systems
    Date of publication: 2012-05
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Impact of positive bias temperature instability (PBTI) on 3T1D-DRAM cells

     Aymerich Capdevila, Nivard; Ganapathy, Shrikanth; Rubio Sola, Jose Antonio; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria
    Integration. The VLSI journal
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Variability mitigation mechanisms in scaled 3T1D-DRAM memories to 22 nm and beyond

     Amat Bertran, Esteve; Garcia Almudever, Carmen; Aymerich Capdevila, Nivard; Canal Corretger, Ramon; Rubio Sola, Jose Antonio
    IEEE transactions on device and materials reliability
    Date of publication: 2012-09-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Code Optimizations for Narrow Bitwidth Architectures  Open access

     Bhagat, Indu
    Defense's date: 2012-02-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner. The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64- bit applications on to the 16-bit processor. The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate the same (correct) output as the original program. Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the data-flow as well as control-flow of the programs. Three profile-based, code optimization techniques have been proposed : 1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function. 2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction. 3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the principles of MRC for conditional branches. The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations (GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further, these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative slices, hence, making them self-sufficient to detect mis-speculation dynamically. The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the backslices containing narrow computations such that the minimal necessary computations to generate the same (correct) output are performed in the most-frequent case; the rest of the computations are performed only when necessary. With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16- bit ISA such that the overheads are considerably reduced.

    Esta tesis deriva su motivación en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y consumiendo más energía. Esta tesis utiliza una aproximación HW/SW para atacar, de forma íntegra, el problema de la ineficiencia computacional. El hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y ofrecer así un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de 16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en el procesador de 16 bits. El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las Mínimas Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo que realmente se necesita para generar la misma (correcta) salida que el programa original. Aproximarse al MRC perfecto es una meta intrínsecamente ambiciosa y que requiere predicciones perfectas de comportamiento del programa. Con este fin, la tesis propone tres heurísticas basadas en optimizaciones que tratan de inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya había almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo. Se han propuesto tres optimizaciones del código basadas en profile: 1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función. 2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de una única instrucción. 3. Computación Mínima del Salto (MBC en inglés) es una técnica de reordenación de código que aplica los principios de MRC a los saltos condicionales. El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias) utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar, dinámicamente, los casos de fallo en la especulación. La idea de la optimización MBC es reordenar las instrucciones que generan el salto condicional tal que las mínimas computaciones que general la misma (correcta) salida se ejecuten en la mayoría de los casos; el resto de las computaciones se ejecutarán sólo cuando sea necesario.

  • HW/SW Mechanisms for Instruction Fusion, Issue and Commit in Modern u-Processors  Open access

     Deb, Abhishek
    Defense's date: 2012-05-03
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.

    En aquesta tesis hem explorat el paradigma de les màquines issue i commit per processadors actuals. Hem implementat una màquina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no és més que una traça de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusió d’instruccions, blocs bàsics. Aquests superblocks s’optimitzen utilitzant optimizacions especualtives i d’altres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gestió de errades en l’especulació. Al llarg d’aquesta tesis s’han fet les següents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l’execució d’aplicacions de proposit general. La PFU està formada per un conjunt d’unitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribuït a les unitats funcionals que la composen. Les entrades de la macro-operació que s’executa en la PFU es mouen del banc de registres físic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusió combina més micro-operacions en temps d’execució. Aquest algorisme es basa en un pas de planificació que mesura el benefici de les decisions de fusió. Les micro operacions corresponents a la macro operació s’emmagatzemen com a senyals de control en una configuració. Les macro-operacions tenen associat un identificador de configuració que ajuda a localitzar d’aquestes. Una petita cache de configuracions està present dintre de la PFU per tal de guardar-les. En cas de que la configuració no estigui a la cache, les configuracions es carreguen de la cache d’instruccions. Per altre banda, per tal de donar support al commit atòmic dels superblocks que sobrepassen el tamany del ROB s’ha proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d’acceleradors. Aquests acceleradors executen un parell d’instruccions fusionades. S’han considerat dos tipus de fusió d’instructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d’instruccions simples d’alu que són dependents i que s’executaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d’altres. Aquesta proposta ha estat comparada amb un processador fora d’ordre. Tercer, hem proposat un processador fora d’ordre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit atòmic del superblocks. En aquesta solució hem substituït el ROB convencional, i en lloc hem introduït el Superblock Ordering Buffer (SOB). El SOB manté l’odre de programa a granularitat de superblock. L’estat del programa consisteix en registres i memòria. L’estat dels registres es manté en una taula per superblock, mentre que l’estat de memòria es guarda en un buffer i s’actulitza atòmicament. La segona gran area de reducció de complexitat considerarada és l’ús de FIFOs a la lògica d’issue. En aquest últim àmbit hem proposat una heurística de distribució que solventa les ineficiències de l’heurística basada en dependències anteriorment proposada. Finalment, i junt amb les FIFOs, s’ha proposat un mecanisme per alliberar les entrades de la FIFO anticipadament.

  • Process variability in sub-16nm bulk CMOS technology

     Rubio Sola, Jose Antonio; Figueras Pamies, Juan; Vatajelu, Elena Ioana; Canal Corretger, Ramon
    Date: 2012-03-01
    Report

     Share Reference managers Reference managers Open in new window

  • IEEE On-Line Testing Symposium 2012

     Cruz Diaz, Josep-llorenç; Canal Corretger, Ramon
    Participation in a competitive project

     Share

  • Dynamic fine-grain body biasing of caches with latency and leakage 3T1D-based monitors

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    IEEE International Conference on Computer Design: VLSI in Computers and Processors
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • TRAMS Project: variability and reliability of SRAM memories in sub-22 nm Bulk-CMOS technologies

     Canal Corretger, Ramon; Rubio Sola, Jose Antonio; ASenov, Asen; Brown, A.; Miranda, Miguel; Zuber, Paul; Gonzalez Colas, Antonio Maria; Vera, Xavier
    European Future Technologies Conference and Exhibition
    Presentation's date: 2011
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Impact of positive bias temperature instability (PBTI) on 3T1D-DRAM cells

     Aymerich Capdevila, Nivard; Ganapathy, Shrikanth; Rubio Sola, Jose Antonio; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria
    ACM Great Lakes Symposium on VLSI
    Presentation's date: 2011-05-18
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Memory circuits are playing a key role in complex multicore systems with both data and instructions storage and mailbox communication functions. There is a general concern that conventional SRAM cell based on the 6T structure could exhibit serious limitations in future CMOS technologies due to the instability caused by transistor mismatching as well as for leakage consumption reasons. For L1 data caches the new cell 3T1D DRAM is considered a potential candidate to substitute 6T SRAMs. We first evaluate the impact of the positive bias temperature instability, PBTI, on the access and retention time of the 3T1D memory cell implemented with 45 nm technology. Then, we consider all sources of variations and the effect of the degradation caused by the aging of the device on the yield at system level.

  • New reliability mechanisms in memory design for sub-22nm technologies

     Aymerich Capdevila, Nivard; Brown, A.; Canal Corretger, Ramon; Cheng, B.; Figueras Pamies, Juan; Gonzalez Colas, Antonio Maria; Herrero Abellanas, Enric; Markov, S.; Miranda, Miguel; Pouyan, Peyman; Ramirez Garcia, Tanausu; Rubio Sola, Jose Antonio; Vatajelu, I.; Vera, Xavier; Wang, W.; Zuber, Paul; ASenov, Asen
    IEEE International On-Line Testing Symposium
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Cooperative caching for clip multiprocessors

     Chang, J. Chang; Herrero Abellanas, Enric; Canal Corretger, Ramon; Sohi, G.
    Date of publication: 2011
    Book chapter

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Adaptive Memory Hierarchies For Next Generation Tiled Microarchitectures  Open access  awarded activity

     Herrero Abellanas, Enric
    Defense's date: 2011-07-05
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Les últimes dècades el rendiment dels processadors i de les memòries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferència de rendiment és un camp d'investigació d'actualitat i que requereix de noves sol·lucions. Una sol·lució a aquest problema són les memòries “cache”, que permeten reduïr l'impacte d'unes latències de memòria creixents i que conformen la jerarquia de memòria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, però, el creixent nombre de transistors disponible per xip ha permès l'aparició de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memòria específiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiència energètica de la jerarquia de memòria per CMPs, des de les “caches” fins als controladors de memòria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bé per a algunes aplicacions, un sistema que s'ajustés dinàmicament seria més eficient. Tècniques com el Cooperative Caching (CC) combinen els avantatges de les dues tècniques però requereixen un mecanisme centralitzat de coherència que té un consum energètic molt elevat. És per això que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherència en CMPs i aplica el concepte del cooperative caching de forma distribuïda. Mitjançant l'ús de directoris distribuïts s'obté una sol·lució més escalable i que, a més, disposa d'un mecanisme de marcatge més flexible i eficient energèticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribució de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organització capaç de redistribuïr la memòria “cache” dinàmicament segons els requeriments de cada aplicació. Una de les contribucions més importants d'aquesta tècnica és que la reconfiguració es decideix completament a través del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuïdes, permetent una millor escalabilitat. ElasticCC no només és capaç de reparticionar les “caches” segons els requeriments de cada aplicació, sinó que, a més a més, és capaç d'adaptar-se a les diferents fases d'execució de cada una d'elles. La nostra avaluació també demostra que la reconfiguració dinàmica de l'ElasticCC és tant eficient que gairebé proporciona la mateixa taxa de fallades que una configuració amb el doble de memòria.Finalment, la tesi es centra en l'estudi del comportament de les memòries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accés obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat múltiples sol·lucions per CMPs però totes elles es veuen limitades per un compromís entre el rendiment global i l'equitat en l'assignació de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memòries DRAM que permetria guardar files de dades específiques per a cada aplicació. Aquest mecanisme permet proporcionar un accés equitatiu a la memòria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memòria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tècniques proposades proporcionen un millor rendiment i eficiència energètica que les millors tècniques existents fins a l'actualitat.

    Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions.

  • MICROARQUITECTURA Y COMPILADORES PARA FUTUROS PROCESADORES II

     Parcerisa Bundó, Joan Manuel; Canal Corretger, Ramon; Tubella Murgadas, Jordi; Cruz Diaz, Josep-llorenç; Gonzalez Colas, Antonio Maria
    Participation in a competitive project

     Share

  • On the effectiveness of hybrid mechanisms on reduction of parametric failures in caches

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    Date: 2011-12-05
    Report

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we provide an insight on the different proactive read/write assist methods (wordline boosting & adaptive body biasing) that help in preventing (and reducing) parametric failures when coupled with reactive techniques like ECC and redundancy which cope with already existent failures. While proactive and reactive have been previously viewed as complementary techniques, we show that it is not necessarily the case when considering the benefits of such hybrid schemes.

    Postprint (author’s final draft)

  • Access to the full text
    Dynamic fine-grain body biasing of caches with latency and leakage 3T1D-based monitors  Open access

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    Date: 2011-04-15
    Report

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we propose a dynamically tunable fine-grain body biasing mechanism to reduce active & standby leakage power in caches under process variations.

  • TRAMS Project: variability and reliability of SRAM memories in sub-22nm bulk-CMOS technologies

     Canal Corretger, Ramon; Rubio Sola, Jose Antonio; ASenov, Asen; Brown, Andrew; Miranda, Miguel; Zuber, Paul; Gonzalez Colas, Antonio Maria; Vera, Xavier
    Procedia Computer Science
    Date of publication: 2011-12-22
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The TRAMS (Terascale Reliable Adaptive MEMORY Systems) project addresses in an evolutionary way the ultimate CMOS scaling technologies and paves the way for revolutionary, most promising beyond-CMOS technologies. In this abstract we show the significant variability levels of future 18 and 13 nm device bulk-CMOS technologies as well as its dramatic effect on the yield of memory cells and circuits.

  • Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors

     Herrero Abellanas, Enric; González, José; Canal Corretger, Ramon
    International Symposium on Computer Architecture
    Presentation's date: 2010-06-19
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Power-efficient spilling techniques for chip multiprocessors

     Herrero Abellanas, Enric; González, José; Canal Corretger, Ramon
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2010-09-02
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Current trends in CMPs indicate that the core count will increase in the near future. One of the main performance limiters of these forthcoming microarchitectures is the latency and high-demand of the on-chip network and the off-chip memory communication. To optimize the usage of on-chip memory space and reduce off-chip traffic several techniques have proposed to use the N-chance forwarding mechanism, a solution for distributing unused cache space in chip multiprocessors. This technique, however, can lead in some cases to extra unnecessary network traffic or inefficient cache allocation. This paper presents two alternative power-efficient spilling methods to improve the efficiency of the N-chance forwarding mechanism. Compared to traditional Spilling, our Distance-Aware Spilling technique provides an energy efficiency improvement (MIPS3/W) of 16% on average, and a reduction of the network usage of 14% in a ring configuration while increasing performance 6%. Our Selective Spilling technique is able to avoid most of the unnecessary reallocations and it doubles the reuse of spilled blocks, reducing network traffic by an average of 22%. A combination of both techniques allows to reduce the network usage by 30% on average without degrading performance, allowing a 9% increase of the energy efficiency.

  • MODEST: a model for energy estimation under spatio-temporal variability

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    International Symposium on Low Power Electronics and Design
    Presentation's date: 2010-08-20
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Circuit propagation delay estimation through multivariate regression-based modeling under spatio-temporal variability

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    Design, Automation and Test in Europe
    Presentation's date: 2010-03-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Non-Speculative Enhancements for the Scheduling Logic

     Gran Tejero, Ruben
    Defense's date: 2010-11-05
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Runahead threads  Open access

     Ramirez Garcia, Tanausu
    Defense's date: 2010-04-15
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Los temas de investigación sobre multithreading han ganado mucho interés en la arquitectura de computadores con la aparición de procesadores multihilo y multinucleo. Los procesadores SMT (Simultaneous Multithreading) son uno de estos nuevos paradigmas, combinando la capacidad de emisión de múltiples instrucciones de los procesadores superscalares con la habilidad de explotar el paralelismo a nivel de hilos (TLP). Así, la principal característica de los procesadores SMT es ejecutar varios hilos al mismo tiempo para incrementar la utilización de las etapas del procesador mediante la compartición de recursos.Los recursos compartidos son el factor clave de los procesadores SMT, ya que esta característica conlleva tratar con importantes cuestiones pues los hilos también compiten por estos recursos en el núcleo del procesador. Si bien distintos grupos de aplicaciones se benefician de disponer de SMT, las diferentes propiedades de los hilos ejecutados pueden desbalancear la asignación de recursos entre los mismos, disminuyendo los beneficios de la ejecución multihilo. Por otro lado, el problema con la memoria está aún presente en los procesadores SMT. Estos procesadores alivian algunos de los problemas de latencia provocados por la lentitud de la memoria con respecto a la CPU. Sin embargo, hilos con grandes cargas de trabajo y con altas tasas de fallos en las caches son unas de las mayores dificultades de los procesadores SMT. Estos hilos intensivos en memoria tienden a crear importantes problemas por la contención de recursos. Por ejemplo, pueden llegar a bloquear recursos críticos debido a operaciones de larga latencia impidiendo no solo su ejecución, sino el progreso de la ejecución de los otros hilos y, por tanto, degradando el rendimiento general del sistema.El principal objetivo de esta tesis es aportar soluciones novedosas a estos problemas y que mejoren el rendimiento de los procesadores SMT. Para conseguirlo, proponemos los Runahead Threads (RaT) aplicando una ejecución especulativa basada en runahead. RaT es un mecanismo alternativo a las políticas previas de gestión de recursos las cuales usualmente restringían a los hilos intensivos en memoria para conseguir más productividad.La idea clave de RaT es transformar un hilo intensivo en memoria en un hilo ligero en el uso de recursos que progrese especulativamente. Así, cuando un hilo sufre de un acceso de larga latencia, RaT transforma dicho hilo en un hilo de runahead mientras dicho fallo está pendiente. Los principales beneficios de esta simple acción son varios. Mientras un hilo está en runahead, éste usa los diferentes recursos compartidos sin monopolizarlos o limitarlos con respecto a los otros hilos. Al mismo tiempo, esta ejecución especulativa realiza prebúsquedas a memoria que se solapan con el fallo principal, por tanto explotando el paralelismo a nivel de memoria y mejorando el rendimiento.RaT añade muy poco hardware extra y complejidad en los procesadores SMT con respecto a su implementación. A través de un mecanismo de checkpoint y lógica de control adicional, podemos dotar a los contextos hardware con la capacidad de ejecución en runahead. Por medio de RaT, contribuímos a aliviar simultaneamente dos problemas en el contexto de los procesadores SMT. Primero, RaT reduce el problema de los accesos de larga latencia en los SMT mediante el paralelismo a nivel de memoria (MLP). Un hilo prebusca datos en paralelo en vez de estar parado debido a un fallo de L2 mejorando su rendimiento individual. Segundo, RaT evita que los hilos bloqueen recursos bajo fallos de larga latencia. RaT asegura que el hilo intensivo en memoria recicle más rápido los recursos compartidos que usa debido a la naturaleza de la ejecución especulativa.La principal limitación de RaT es que los hilos especulativos pueden ejecutar instrucciones extras cuando no realizan prebúsqueda e innecesariamente consumir recursos de ejecución en el procesador SMT. Este inconveniente resulta en hilos de runahead ineficientes pues no contribuyen a la ganancia de rendimiento e incrementan el consumo de energía debido al número extra de instrucciones especulativas. Por consiguiente, en esta tesis también estudiamos diferentes soluciones dirigidas a solventar esta desventaja del mecanismo RaT. El resultado es un conjunto de soluciones complementarias para mejorar la eficiencia de RaT en términos de consumo de potencia y gasto energético.Por un lado, mejoramos la eficiencia de RaT aplicando ciertas técnicas basadas en el análisis semántico del código ejecutado por los hilos en runahead. Proponemos diferentes técnicas que analizan y controlan la utilidad de ciertos patrones de código durante la ejecución en runahead. Por medio de un análisis dinámico, los hilos en runahead supervisan la utilidad de ejecutar los bucles y subrutinas dependiendo de las oportunidades de prebúsqueda. Así, RaT decide cual de estas estructuras de programa ejecutar dependiendo de la información de utilidad obtenida, decidiendo entre parar o saltar el bucle o la subrutina para reducir el número de las instrucciones no útiles. Entre las técnicas propuestas, conseguimos reducir las instrucciones especulativas y la energía gastada mientras obtenemos rendimientos similares a la técnica RaT original.Por otro lado, también proponemos lo que denominamos hilos de runahead eficientes. Esta propuesta se basa en una técnica más fina que cubre todo el rango de ejecución en runahead, independientemente de las características del programa ejecutado. La idea principal es averiguar "cuando" y "durante cuanto" un hilo en runahead debe ser ejecutado prediciendo lo que denominamos distancia útil de runahead. Los resultados muestran que la mejor de estas propuestas basadas en la predicción de la distancia de runahead reducen significativamente el número de instrucciones extras así como también el consumo de potencia. Asimismo, conseguimos mantener los beneficios de rendimiento de los hilos en runahead, mejorando de esta forma la eficiencia energética de los procesadores SMT usando el mecanismo RaT.La evolución de RaT desarrollada durante toda esta investigación nos proporciona no sólo una propuesta orientada a un mayor rendimiento sino también una forma eficiente de usar los recursos compartidos en los procesadores SMT en presencia de operaciones de memoria de larga latencia.Dado que los diseños SMT en el futuro estarán orientados a optimizar una combinación de rendimiento individual en las aplicaciones, la productividad y el consumo de energía, los mecanismos basados en RaT aquí propuestos son interesantes opciones que proporcionan un mejor balance de rendimiento y energía que las propuestas previas en esta área.

    Research on multithreading topics has gained a lot of interest in the computer architecture community due to new commercial multithreaded and multicore processors. Simultaneous Multithreading (SMT) is one of these relatively new paradigms, which combines the multiple instruction issue features of superscalar processors with the ability of multithreaded architectures to exploit thread level parallelism (TLP). The main feature of SMT processors is to execute multiple threads that increase the utilization of the pipeline by sharing many more resources than in other types of processors.Shared resources are the key of simultaneous multithreading, what makes the technique worthwhile.This feature also entails important challenges to deal with because threads also compete for resources in the processor core. On the one hand, although certain types and mixes of applications truly benefit from SMT, the different features of threads can unbalance the resource allocation among threads, diminishing the benefit of multithreaded execution. On the other hand, the memory wall problem is still present in these processors. SMT processors alleviate some of the latency problems arisen by main memory's slowness relative to the CPUs. Nevertheless, threads with high cache miss rates that use large working sets are one of the major pitfalls of SMT processors. These memory intensive threads tend to use processor and memory resources poorly creating the highest resource contention problems. Memory intensive threads can clog up shared resources due to long latency memory operations without making progress on a SMT processor, thereby hindering overall system performance.The main goal of this thesis is to alleviate these shortcomings on SMT scenarios. To accomplish this, the key contribution of this thesis is the application of the paradigm of Runahead execution in the design of multithreaded processors by Runahead Threads (RaT). RaT shows to be a promising alternative to prior SMT resource management mechanisms which usually restrict memory bound threads in order to get higher throughputs.The idea of RaT is to transform a memory intensive thread into a light-consumer resource thread by allowing that thread to progress speculatively. Therefore, as soon as a thread undergoes a long latency load, RaT transforms the thread to a runahead thread while it has that long latency miss outstanding. The main benefits of this simple action performed by RaT are twofold. While being a runahead thread, this thread uses the different shared resources without monopolizing or limiting the available resources for other threads. At the same time, this fast speculative thread issues prefetches that overlap other memory accesses with the main miss, thereby exploiting the memory level parallelism.Regarding implementation issues, RaT adds very little extra hardware cost and complexity to an existing SMT processor. Through a simple checkpoint mechanism and little additional control logic, we can equip the hardware contexts with the runahead thread capability. Therefore, by means of runahead threads, we contribute to alleviate simultaneously the two shortcomings in the context of SMT processor improving the performance. First, RaT alleviates the long latency load problem on SMT processors by exposing memory level parallelism (MLP). A thread prefetches data in parallel (if MLP is available) improving its individual performance rather than be stalled on an L2 miss. Second, RaT prevents threads from clogging resources on long latency loads. RaT ensures that the L2-missing thread recycles faster the shared resources it uses by the nature of runahead speculative execution. This avoids memory intensive threads clogging the important processor resources up.The main limitation of RaT though is that runahead threads can execute useless instructions and unnecessarily consume execution resources on the SMT processor when there is no prefetching to be exploited. This drawback results in inefficient runahead threads which do not contribute to the performance gain and increase dynamic energy consumption due to the number of extra speculatively executed instructions. Therefore, we also propose different solutions aimed at this major disadvantage of the Runahead Threads mechanism. The result of the research on this line is a set of complementary solutions to enhance RaT in terms of power consumption and energy efficiency.On the one hand, code semantic-aware Runahead threads improve the efficiency of RaT using coarse-grain code semantic analysis at runtime. We provide different techniques that analyze the usefulness of certain code patterns during runahead thread execution. The code patterns selected to perform that analysis are loops and subroutines. By means of the proposed coarse grain analysis, runahead threads oversee the usefulness of loops or subroutines depending on the prefetches opportunities during their executions. Thus, runahead threads decide which of these particular program structures execute depending on the obtained usefulness information, deciding either stall or skip the loop or subroutine executions to reduce the number of useless runahead instructions. Some of the proposed techniques reduce the speculative instruction and wasted energy while achieving similar performance to RaT.On the other hand, the efficient Runahead thread proposal is another contribution focused on improving RaT efficiency. This approach is based on a generic technique which covers all runahead thread executions, independently of the executed program characteristics as code semantic-aware runahead threads are. The key idea behind this new scheme is to find out --when' and --how long' a thread should be executed in runahead mode by predicting the useful runahead distance. The results show that the best of these approaches based on the runahead distance prediction significantly reduces the number of extra speculative instructions executed in runahead threads, as well as the power consumption. Likewise, it maintains the performance benefits of the runahead threads, thereby improving the energy-efficiency of SMT processors using the RaT mechanism.The evolution of Runahead Threads developed in this research provides not only a high performance but also an efficient way of using shared resources in SMT processors in the presence of long latency memory operations. As designers of future SMT systems will be increasingly required to optimize for a combination of single thread performance, total throughput, and energy consumption, RaT-based mechanisms are promising options that provide better performance and energy balance than previous proposals in the field.

  • TERASCALE RELIABLE ADAPTIVE MEMORY SYSTEMS

     Figueras Pamies, Juan; Calomarde Palomino, Antonio; Aymerich Capdevila, Nivard; Moll Echeto, Francesc de Borja; Vatajelu, Elena Ioana; Garcia Almudever, Carmen; Canal Corretger, Ramon; Pouyan, Peyman; Rubio Sola, Jose Antonio
    Participation in a competitive project

     Share

  • TERASCALE RELIABLE ADAPTIVE MEMORY SYSTEMS

     Canal Corretger, Ramon; Cruz Diaz, Josep-llorenç; Tubella Murgadas, Jordi; Gonzalez Colas, Antonio Maria
    Participation in a competitive project

     Share

  • Access to the full text
    vPROBE: Variation aware post-silicon power/performance binning using embedded 3T1D cells  Open access

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio Sola, Jose Antonio
    Date: 2010-09-05
    Report

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In this paper, we present an on-die post-silicon binning methodology that takes into account the effect of static and dynamic variations and categorizes every processor based on power/performance.The proposed scheme is composed of a discretization hardware that exploits the delay/leakage dependence on variability sources characteristic for categorization

  • Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

     Herrero Abellanas, Enric; González, Jose; Canal Corretger, Ramon
    Computer architecture news
    Date of publication: 2010
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Using coherence information and decay techniques to optimize L2 cache leakage in CMPs  Open access

     Monchiero, Matteo; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria
    International Conference on Parallel Processing
    Presentation's date: 2009
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper evaluates several techniques to save leakage in CMP L2 caches by selectively switching off the less used lines. We primarily focus on private snoopy L2 caches. In this case, coherence must be enforced in all situations and specially when a line is turned off to save power. In particular, we introduce three techniques: the first one turns off the cache lines by using the coherence protocol invalidations, the second one is an implementation of a cache decay technique specific for coherent caches, the third one is a performance-optimized decay-based technique for coherent caches. Experimental results, carried out by using accurate performance/thermal/energy models, show that appreciable power savings can be achieved by properly designing a leakage optimization technique. We target a CMP composed of 4 cores and 1 to 8 MB of total cache. For 4MB, the proposed techniques show a 13%, 30%, and 21% energy reduction, respectively, at the cost of 0%, 8%, and 2% performance loss. For other cache sizes the behavior is qualitatively similar.

  • Circuit propagation delay estimation through multivariate regression-based modeling under spatio-temporal variability

     Ganapathy, Shrikanth; Canal Corretger, Ramon; Gonzalez Colas, Antonio Maria; Rubio, Antonio
    Date: 2009-09-07
    Report

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • MICROARQUITECTURA I COMPILADORS (ARCO)

     Gibert Codina, Enric; Aliagas Castell, Carles; Aleta Ortega, Alexandre; Molina Clemente, Carlos; Unsal, Osman Sabri; Piñeiro Riobo, Jose Alejandro; Vera Rivera, Francisco Javier; Gonzalez Colas, Antonio Maria; Canal Corretger, Ramon; Cruz Diaz, Josep-llorenç; Parcerisa Bundó, Joan Manuel; Pons Solé, Marc; Magklis, Grigorios; Codina Viñas, Josep M; Tubella Murgadas, Jordi
    Participation in a competitive project

     Share

  • An hybrid eDRAM/SRAM macrocell to implement first-level data caches

     Valero, Alejandro; Sahuquillo, Julio; Petit, Salvador; Lorente, Vicente; Canal Corretger, Ramon; López, Pedro; Duato, José
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2009-12-14
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    SRAM and DRAM cells have been the predominant technologies used to implement memory cells in computer systems, each one having its advantages and shortcomings. SRAM cells are faster and require no refresh since reads are not destructive. In contrast, DRAM cells provide higher density and minimal leakage energy since there are no paths within the cell from Vdd to ground. Recently, DRAM cells have been embedded in logic-based technology, thus overcoming the speed limit of typical DRAM cells. In this paper we propose an n-bit macrocell that implements one static cell, and n-1 dynamic cells. This cell is aimed at being used in an n-way set-associative first-level data cache. Our study shows that in a four-way set-associative cache with this macrocell compared to an SRAM based with the same capacity, leakage is reduced by about 75% and area more than half with a minimal impact on performance. Architectural mechanisms have also been devised to avoid refresh logic. Experimental results show that no performance is lost when the retention time is larger than 50K processor cycles. In addition, the proposed delayed writeback policy that avoids refreshing performs a similar amount of writebacks than a conventional cache with the same organization, so no power wasting is incurred.

  • Replacing 6T Srams with 3T1D Dram's in the L1 Data Cache to Combat Process Variability

     Liang, X; Canal Corretger, Ramon; Yeon, Wei G
    IEEE micro
    Date of publication: 2008-02
    Journal article

     Share Reference managers Reference managers Open in new window