Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 92 results
  • Techniques to Improve Concurrency in Hardware Transactional Memory

     Armejach Sanosa, Adrià
    Defense's date: 2014-06-13
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Los mecanismos de Memoria Transaccional (TM) tienen como objetivo facilitar la programación paralela proporcionando una abstracción sobre la gestión de los datos compartidos. El programador define secciones de código, llamadas transacciones, y el sistema de TM garantiza que estas se ejecutan de forma atómica y aislada del resto del sistema . El programador no tiene que implementar este comportamiento, como ocurre en las técnicas tradicionales de exclusión mutua que usan cerrojos - esta responsabilidad se delega al sistema de TM subyacente. Además, con transacciones se puede aprovechar el paralelismo, que no estaría disponible en técnicas de exclusión mutua. Esto se logra permitiendo la ejecución optimista de transacciones, asumiendo que ninguna otra transacción opera simultáneamente en los mismos datos. Si esta hipótesis es cierta, la transacción expone sus cambios a la memoria compartida al final de su ejecución, de lo contrario , se produce un conflicto y el sistema de TM debe abortar una de las transacciones en conflicto para garantizar una correcta ejecución; la operación abortada revierte sus cambios locales y se ejecuta de nuevo. Implementaciones hardware y software de TM se han estudiado en detalle. Sin embargo, la adopción a gran escala de los enfoques software se ha visto frenada debido a limitaciones de rendimiento.En esta tesis, nos centramos en la identificación y solución de problemas en enfoques hardware (HTM) con el fin de mejorar la concurrencia y escalabilidad. Dos dimensiones clave determinan el espacio de diseño en HTM: la detección de conflictos y gestión de versiones especulativas. El primero determina cómo se detectan conflictos entre transacciones concurrentes y la forma de resolverlos. El último define donde se almacenan las actualizaciones transaccionales y cómo el sistema trata con dos versiones de los mismos datos lógicos. Esta tesis propone un mecanismo flexible que permite el almacenamiento y el acceso a dos versiones de los mismos datos lógicos, que mejora el rendimiento general del sistema y la eficiencia energética.Además, en esta tesis se exploran dos soluciones para reducir la contención del sistema - circunstancias en que las transacciones abortan debido a dependencias de datos - con el fin de mejorar la concurrencia de los sistemas HTM. El primer mecanismo proporciona un diseño adecuado para aplicar precarga de datos con el fin de acelerar la ejecución de transacciones, reduciendo de la ventana de tiempo en que dichas transacciones pueden experimentar contención. El segundo, es un mecanismo preciso de predicción de conflictos. Este mecanismo utiliza la historia del comportamiento de las transacciones y la localidad temporal en las referencias a memoria para inferir predicciones.Por último, esta tesis también analiza protocolos de HTM que han sido implementados en recientes productos comerciales. Estos protocolos han sido diseñados para ser simples y fáciles de incorporar en procesadores existentes. Sin embargo, esta simplicidad supone una degradación del rendimiento grave debido a condiciones bloqueo activo transitorias o persistentes, lo que podría prevenir progreso de una aplicación. Demostramos que las técnicas existentes no son capaces de mitigar esta degradación de forma efectiva. Para hacer frente a este problema se proponen un conjunto de técnicas que conservan la sencillez del protocolo al mismo tiempo que proporcionan un mejor rendimiento y garantías de progreso sobre una amplia variedad de cargas de trabajo transaccionales.

  • Using dynamic runtime testing for rapid development of architectural simulators

     Tomic, Sa¿a; Cristal Kestelman, Adrian; Unsal, Osman Sabri; Valero Cortes, Mateo
    International journal of parallel programming
    Date of publication: 2014-02-01
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Architectural simulator platforms are particularly complex and error-prone programs that aim to simulate all hardware details of a given target architecture. Development of a stable cycle-accurate architectural simulator can easily take several man-years. Discovering and fixing all visible errors in a simulator often requires significant effort, much higher than for writing the simulator code in the first place. In addition, there are no guarantees that all programming errors will be eliminated, no matter how much effort is put into testing and debugging. This paper presents dynamic runtime testing, a methodology for rapid development and accurate detection of functional bugs in architectural cycle-accurate simulators. Dynamic runtime testing consists of comparing an execution of a cycle-accurate simulator with an execution of a simple and functionally equivalent emulator. Dynamic runtime testing detects a possible functional error if there is a mismatch between the execution in the simulator and the emulator. Dynamic runtime testing provides a reliable and accurate verification of a simulator, during its entire development cycle, with very acceptable performance impact, and without requiring complex setup for the simulator execution. Based on our experience, dynamic testing reduced the simulator modification time from 12-18 person-months to only 3-4 person-months, while it only modestly reduced the simulator performance (in our case under 20 %). © 2012 Springer Science+Business Media, LLC.

  • Enhancing the Efficiency and Practicality of Software Transactional Memory on Massively Multithreaded Systems  Open access

     Kestor, Gokcen
    Defense's date: 2013-03-22
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Los procesadores con múltiple hilos (CMT) prometen dar un mejor rendimiento al correr más de una secuencia de instrucciones en paralelo. Para aprovechar las capacidades de los procesadores CMT, los programadores deben paralelizar sus aplicaciones. La Memoria Transaccional (TM) es uno de los modelos de programación que tiene por objetivo simplificar la sincronización al subir el nivel de abstracción entre la semántica de accesos atómicos y los métodos que son usados para lograr dicha atomicidad. TM es un sistema de programación prometedor pero todavía existen desafíos importantes que deben ser resueltos para que el modelo sea mas práctico y eficiente.El primer desafío que esta disertación confronta es crear sistemas de evaluación para las propuestas de TMs que sean mas sustanciales al añadir benchmarks que sean más realisticos. Primero, introducimos un grupo exhaustivo de benchmarks, RMS-TM, para evaluar sistemas HTMs y STMs. RMS-TM consiste de 7 aplicaciones que incluyen programas del area de reconocimiento, extracción y síntesis (RMS) que representan cargas de trabajo comunes para aplicaciones futuras. RMS-TM incluye temas de investigación en la area de TM, como anidación y operaciones de I/O dentro de transaciones. Muchos sistemas de STM son implementados como librerías de usuario: el programador es responsable de instrumentar manualmente los límites de la transacción y cada lectura y escritura a la memoria dentro de la transacción. Esta técnica hace comparaciones de rendimiento entre sistemas más difíciles. Para poder crear comparasiones de rendimiento que sean ¿manzanas con manzanas¿, creamos una capa de software que ayuda a los investigadores a evaluar las mismas aplicaciones en diferente STMs.El segundo desafío trata de mejorar el rendimiento y la escalabilidad de applicaciones de TM que corren en sistemas de procesadores CMT. El rendimiento y la escalabilidad de los diseños de TM de hoy en día, en particular los diseños de STM, no siempre satisfacen las expectaciones de los programadores. Para superar esta limitación, proponemos un nuevo diseño de STM, STM2, basado en un sistema de execución asistida en el cual operaciones de TM son mandadas a hilos auxiliares mientras los hilos de la aplicación corren el trabajo en una forma optimistica. Los resultados muestran que STM2 provee mejoras entre 1.8x y 5.2x en comparasión con los mejores de otros STMs. Sistemas de execución asistida pueden crear condiciones de utilizacion baja en el procesador. Para aliviar este problema, enriquecimos STM2 con un mecanismo que automaticamente detecta las demandas de los hilos de aplicación y los auxiliares y dinamicamente separa los recursos del hardware entre los hilos basado en los mecanismos de prioridad de hilo implementados en las máquinas de POWER de IBM.El tercer desafío es definir un concepto de que significa para un programa de TM que este correctamente sincronizado. La definición común de la información transaccional sincronizada requiere que todas las transacciones sean ordenadas en un orden total como si fueran serializadas por un reloj global que limita la escalabilidad de los diseños de TM. Para eliminar esta restricción, propusimos relajar la definición de carreras de datos en los sistemas de TM. Esto permite un nivel más alto de concurencia para los programas que usen este concepto. Basado en esta definición, propusimos el primer algoritmo que detecta carreras de datos para applicaciones escritas en C/C++ (TRADE) e implementamos la herramienta que utiliza este algoritmo. Ademas, creamos una nueva definición de carrera de datos transaccionales que es mas intuitiva y transparente a los subyacentes implementaciones de TM. Basado en esta definicion, propusimos T-Rex, una herramienta que detecta carreras de datos que es escalable, eficiente y creada para programas de TM en C/C++. Utilizando TRADE y T-Rex, descubrimos carreras de datos sutiles en las aplicaciones de STAMP que no han sido reportadas en el pasado.

    Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines. The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past.

  • Improving the energy efficiency of hardware-assisted watchpoint systems

     Karakostas, Vasileios; Tomic, Sa¿a; Unsal, Osman Sabri; Nemirovsky, Mario; Cristal Kestelman, Adrian
    Design Automation Conference
    Presentation's date: 2013-05-29
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Hardware-assisted watchpoint systems enhance the execu- tion of numerous dynamic software techniques, such as mem- ory protection, module isolation, deterministic execution, and data race detection. In this paper, we show that pre- vious hardware proposals may introduce signicant energy overheads, and propose WatchPoint Filtering (WPF), a novel ltering mechanism that eliminates unnecessary watchpoint checks. We evaluate WPF on two state-of-the-art proposals for hardware-assisted watchpoints using two common mem- ory checkers. WPF eliminates 83% of the watchpoint checks (up to 99.7%) and reduces 57% of the dynamic energy over- head (up to 78%) on average, without introducing additional performance execution overhead

    Hardware-assisted watchpoint systems enhance the execution of numerous dynamic software techniques, such as memory protection, module isolation, deterministic execution, and data race detection. In this paper, we show that previous hardware proposals may introduce significant energy overheads, and propose WatchPoint Filtering (WPF), a novel filtering mechanism that eliminates unnecessary watchpoint checks. We evaluate WPF on two state-of-the-art proposals for hardware-assisted watchpoints using two common memory checkers. WPF eliminates 83% of the watchpoint checks (up to 99.7%) and reduces 57% of the dynamic energy overhead (up to 78%) on average, without introducing additional performance execution overhead.

  • Access to the full text
    Circuit design of a novel adaptable and reliable L1 data cache  Open access

     Seyedi, Azam; Yalcin, Gulay; Unsal, Osman Sabri; Cristal Kestelman, Adrian
    ACM Great Lakes Symposium on VLSI
    Presentation's date: 2013-05
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper proposes a novel adaptable and reliable L1 data cache design (Adapcache) with the unique capability of automatically adapting itself for different supply voltage levels and providing the highest reliability. Depending on the supply voltage level, Adapcache defines three operating modes: In high supply voltages, Adapcache provides reliability through single-bit parity. In middle range of supply voltages, Adapcache writes data to two separate cache-lines simultaneously in order to use one line for error recovery when the other line is faulty. In near threshold supply voltages, Adapcache writes data to three separate cache-lines simultaneously in order to provide the correct data based on bitwise majority voter. We design and simulate one embodiment of the Adapcache as a 64-KB L1 data cache with 45-nm CMOS technology at 2GHz processor frequency for almost nominal supply voltages (1V-0.6V), at 900MHz for middle supply voltages (0.6V-0.4V), and at 400MHz for near threshold supply voltages (0.4V-0.32V). According to our experimental results, the energy reduction and latency as well as cache capacity usage are improved compared to typical previous proposals, Triple Modular Redundancy (TMR) and Double Modular Redundancy (DMR) techniques and also to the state of the art proposal, Parichute Error Correction Code (ECC)

    This paper proposes a novel adaptable and reliable L1 data cache design (Adapcache) with the unique capability of automatically adapting itself for different supply voltage levels and providing the highest reliability. Depending on the supply voltage level, Adapcache defines three operating modes: In high supply voltages, Adapcache provides reliability through single-bit parity. In middle range of supply voltages, Adapcache writes data to two separate cache-lines simultaneously in order to use one line for error recovery when the other line is faulty. In near threshold supply voltages, Adapcache writes data to three separate cache-lines simultaneously in order to provide the correct data based on bitwise majority voter. We design and simulate one embodiment of the Adapcache as a 64-KB L1 data cache with 45-nm CMOS technology at 2GHz processor frequency for almost nominal supply voltages (1V-0.6V), at 900MHz for middle supply voltages (0.6V-0.4V), and at 400MHz for near threshold supply voltages (0.4V-0.32V). According to our experimental results, the energy reduction and latency as well as cache capacity usage are improved compared to typical previous proposals, Triple Modular Redundancy (TMR) and Double Modular Redundancy (DMR) techniques and also to the state of the art proposal, Parichute Error Correction Code (ECC).

  • Techniques to improve performance in requester-wins hardware transactional memory

     Armejach, Adrià; Titos Gil, Ruben; Negi, Anurag; Unsal, Osman Sabri; Cristal Kestelman, Adrian
    ACM transactions on architecture and code optimization
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The simplicity of requester-wins Hardware Transactional Memory (HTM) makes it easy to incorporate in existing chip multiprocessors. Hence, such systems are expected to be widely available in the near future. Unfortunately, these implementations are prone to suffer severe performance degradation due to transient and persistent livelock conditions. This article shows that existing techniques are unable to mitigate this degradation effectively. It then proposes and evaluates four novel techniques - two software-based that employ information provided by the hardware and two that require simple core-local hardware additions - which have the potential to boost the performance of requester-wins HTM designs substantially. © 2013 ACM.

  • FaulTM: Error detection and recovery using hardware transactional memory

     Yalcin, Gulay; Unsal, Osman Sabri; Cristal Kestelman, Adrian
    Design, Automation and Test in Europe
    Presentation's date: 2013-03
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.

    Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.

  • A Multicore Emulator with a Profiling Infrastructure for Transactional Memory on FPGA

     Sonmez, Nehir
    Defense's date: 2012-09-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • The Multi-State Processors

     Gonzalez Martin, Isidro
    Defense's date: 2012-10-26
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Towards Lightweight and High-Performance Hardware Transactional Memory  Open access

     Tomic, Sa¿a
    Defense's date: 2012-07-13
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Conventional lock-based synchronization serializes accesses to critical sections guarded by the same lock. Using multiple locks brings the possibility of a deadlock or a livelock in the program, making parallel programming a difficult task. Transactional Memory (TM) is a promising paradigm for parallel programming, offering an alternative to lock-based synchronization. TM eliminates the risk of deadlocks and livelocks, while it provides the desirable semantics of Atomicity, Consistency, and Isolation of critical sections. TM speculatively executes a series of memory accesses as a single, atomic, transaction. The speculative changes of a transaction are kept private until the transaction commits. If a transaction can break the atomicity or cause a deadlock or livelock, the TM system aborts the transaction and rolls back the speculative changes. To be effective, a TM implementation should provide high performance and scalability. While implementations of TM in pure software (STM) do not provide desirable performance, Hardware TM (HTM) implementations introduce much smaller overhead and have relatively good scalability, due to their better control of hardware resources. However, many HTM systems support only the transactions that fit limited hardware resources (for example, private caches), and fall back to software mechanisms if hardware limits are reached. These HTM systems, called best-effort HTMs, are not desirable since they force a programmer to think in terms of hardware limits, to use both HTM and STM, and to manage concurrent transactions in HTM and STM. In contrast with best-effort HTMs, unbounded HTM systems support overflowed transactions, that do not fit into private caches. Unbounded HTM systems often require complex protocols or expensive hardware mechanisms for conflict detection between overflowed transactions. In addition, an execution with overflowed transactions is often much slower than an execution that has only regular transactions. This is typically due to restrictive or approximative conflict management mechanism used for overflowed transactions. In this thesis, we study hardware implementations of transactional memory, and make three main contributions. First, we improve the general performance of HTM systems by proposing a scalable protocol for conflict management. The protocol has precise conflict detection, in contrast with often-employed inexact Bloom-filter-based conflict detection, which often falsely report conflicts between transactions. Second, we propose a best-effort HTM that utilizes the new scalable conflict detection protocol, termed EazyHTM. EazyHTM allows parallel commits for all non-conflicting transactions, and generally simplifies transaction commits. Finally, we propose an unbounded HTM that extends and improves the initial protocol for conflict management, and we name it EcoTM. EcoTM features precise conflict detection, and it efficiently supports large as well as small and short transactions. The key idea of EcoTM is to leverage an observation that very few locations are actually conflicting, even if applications have high contention. In EcoTM, each core locally detects if a cache line is non-conflicting, and conflict detection mechanism is invoked only for the few potentially conflicting cache lines.

    La Sincronización tradicional basada en los cerrojos de exclusión mutua (locks) serializa los accesos a las secciones críticas protegidas este cerrojo. La utilización de varios cerrojos en forma concurrente y/o paralela aumenta la posibilidad de entrar en abrazo mortal (deadlock) o en un bloqueo activo (livelock) en el programa, está es una de las razones por lo cual programar en forma paralela resulta ser mucho mas dificultoso que programar en forma secuencial. La memoria transaccional (TM) es un paradigma prometedor para la programación paralela, que ofrece una alternativa a los cerrojos. La memoria transaccional tiene muchas ventajas desde el punto de vista tanto práctico como teórico. TM elimina el riesgo de bloqueo mutuo y de bloqueo activo, mientras que proporciona una semántica de atomicidad, coherencia, aislamiento con características similares a las secciones críticas. TM ejecuta especulativamente una serie de accesos a la memoria como una transacción atómica. Los cambios especulativos de la transacción se mantienen privados hasta que se confirma la transacción. Si una transacción entra en conflicto con otra transacción o sea que alguna de ellas escribe en una dirección que la otra leyó o escribió, o se entra en un abrazo mortal o en un bloqueo activo, el sistema de TM aborta la transacción y revierte los cambios especulativos. Para ser eficaz, una implementación de TM debe proporcionar un alto rendimiento y escalabilidad. Las implementaciones de TM en el software (STM) no proporcionan este desempeño deseable, en cambio, las mplementaciones de TM en hardware (HTM) tienen mejor desempeño y una escalabilidad relativamente buena, debido a su mejor control de los recursos de hardware y que la resolución de los conflictos así el mantenimiento y gestión de los datos se hace en hardware. Sin embargo, muchos de los sistemas de HTM están limitados a los recursos de hardware disponibles, por ejemplo el tamaño de las caches privadas, y dependen de mecanismos de software para cuando esos límites son sobrepasados. Estos sistemas HTM, llamados best-effort HTM no son deseables, ya que obligan al programador a pensar en términos de los límites existentes en el hardware que se esta utilizando, así como en el sistema de STM que se llama cuando los recursos son sobrepasados. Además, tiene que resolver que transacciones hardware y software se ejecuten concurrentemente. En cambio, los sistemas de HTM ilimitados soportan un numero de operaciones ilimitadas o sea no están restringidos a límites impuestos artificialmente por el hardware, como ser el tamaño de las caches o buffers internos. Los sistemas HTM ilimitados por lo general requieren protocolos complejos o mecanismos muy costosos para la detección de conflictos y el mantenimiento de versiones de los datos entre las transacciones. Por otra parte, la ejecución de transacciones es a menudo mucho más lenta que en una ejecución sobre un sistema de HTM que este limitado. Esto es debido al que los mecanismos utilizados en el HTM limitado trabaja con conjuntos de datos relativamente pequeños que caben o están muy cerca del núcleo del procesador. En esta tesis estudiamos implementaciones de TM en hardware. Presentaremos tres contribuciones principales: Primero, mejoramos el rendimiento general de los sistemas, al proponer un protocolo escalable para la gestión de conflictos. El protocolo detecta los conflictos de forma precisa, en contraste con otras técnicas basadas en filtros Bloom, que pueden reportar conflictos falsos entre las transacciones. Segundo, proponemos un best-effort HTM que utiliza el nuevo protocolo escalable detección de conflictos, denominado EazyHTM. EazyHTM permite la ejecución completamente paralela de todas las transacciones sin conflictos, y por lo general simplifica la ejecución. Por último, proponemos una extensión y mejora del protocolo inicial para la gestión de conflictos, que llamaremos EcoTM. EcoTM cuenta con detección de conflictos precisa, eficiente y es compatible tanto con transacciones grandes como con pequeñas. La idea clave de EcoTM es aprovechar la observación que en muy pocas ubicaciones de memoria aparecen los conflictos entre las transacciones, incluso en aplicaciones tienen muchos conflictos. En EcoTM, cada núcleo detecta localmente si la línea es conflictiva, además existe un mecanismo de detección de conflictos detallado que solo se activa para las pocas líneas de memoria que son potencialmente conflictivas.

  • Programming, Debugging, Profiling and Optimizing Transactional Memory Programs  Open access

     Hasanov Zyulkyarov, Ferad
    Defense's date: 2011-07-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional memory (TM) is a new optimistic synchronization technique which has the potential of making shared memory parallel programming easier compared to locks without giving up from the performance. This thesis explores four aspects in the research of transactional memory. First, it studies how programming with TM compares to locks. During the course of work, it develops the first real transactional application ¿ AtomicQuake. AtomicQuake is adapted from the parallel version of the Quake game server by replacing all lock-based synchronization with atomic blocks. Findings suggest that programming with TM is indeed easier than locks. However the performance of current software TM systems falls behind the efficiently implemented lock-based versions of the same program. Also, the same findings report that the proposed language level extensions are not sufficient for developing robust production level software and that the existing development tools such as compilers, debuggers, and profilers lack support for developing transactional application. Second, this thesis introduces new set of debugging principles and abstractions. These new debugging principles and abstractions enable debugging synchronization errors which manifest at coarse atomic block level, wrong code inside atomic blocks, and also performance errors related to the implementation of the atomic block. The new debugging principles distinguish between debugging at the language level constructs such as atomic blocks and debugging the atomic blocks based on how they are implemented whether TM or lock inference. These ideas are demonstrated by implementing a debugger extension for WinDbg and the ahead-of-time C# to X86 Bartok-STM compiler. Third, this thesis investigates the type of performance bottlenecks in TM applications and introduces new profiling techniques to find and understand these bottlenecks. The new profiling techniques provide in-depth and comprehensive information about the wasted work caused by aborting transactions. The individual profiling abstractions can be grouped in three groups: (i) techniques to identify multiple conflicts from a single program run, (ii) techniques to describe the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize which transactions conflict most. The ideas were demonstrated by building a lightweight profiling framework for Bartok-STM and an offline tool which process and display the profiling data. Forth, this thesis explores and introduces new TM specific optimizations which target the wasted work due to aborting transactions. Using the results obtained with the profiling tool it analyzes and optimizes several applications from the STAMP benchmark suite. The profiling techniques effectively revealed TM-specific bottlenecks such as false conflicts and contentions accesses to data structures. The discovered bottlenecks were subsequently eliminated with using the new optimization techniques. Among the optimization highlights are the transaction checkpoints which reduced the wasted work in Intruder with 40%, decomposing objects to eliminate false conflicts in Bayes, early release in Labyrinth which decreased wasted work from 98% to 1%, using less contentions data structures such as chained hashtable in Intruder and Genome which have higher degree of parallelism.

  • Debugging Programs that use Atomic Blocks and Transactional Memory

     Harris, Tim; Zyulkyarov, Ferad Hasanov; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    ACM SIGPLAN notices
    Date of publication: 2010-05
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Turbocharging Boosted Transactions or: How I Learnt to Stop Worrying and Love Longer Transactions

     Kulkarni, Chinmay; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Valero Cortes, Mateo
    ACM SIGPLAN notices
    Date of publication: 2009-04
    Journal article

     Share Reference managers Reference managers Open in new window

  • Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server

     Zyulkyarov, Ferad Hasanov; Gajinov, Vladimir; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Harris, Tim; Valero Cortes, Mateo
    ACM SIGPLAN notices
    Date of publication: 2009-04
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Atomic quake: using transactional memory in an interactive mulitplayer game Server  Open access

     Gajinov, Vladimir; Zyulkyarov, Ferad Hasanov; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Ayguade Parra, Eduard; Harris, Tim; Valero Cortes, Mateo
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    Presentation's date: 2009-02
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional Memory (TM) is being studied widely as a new technique for synchronizing concurrent accesses to shared memory data structures for use in multi-core systems. Much of the initial work on TM has been evaluated using microbenchmarks and application kernels; it is not clear whether conclusions drawn from these workloads will apply to larger systems. In this work we make the first attempt to develop a large, complex, application that uses TM for all of its synchronization. We describe how we have taken an existing parallel implementation of the Quake game server and restructured it to use transactions. In doing so we have encountered examples where transactions simplify the structure of the program. We have also encountered cases where using transactions occludes the structure of the existing code. Compared with existing TM benchmarks, our workload exhibits non-block-structured transactions within which there are I/O operations and system call invocations. There are long and short running transactions (200– 1.3M cycles) with small and large read and write sets (a few bytes to 1.5MB). There are nested transactions reaching up to 9 levels at runtime. There are examples where error handling and recovery occurs inside transactions. There are also examples where data changes between being accessed transactionally and accessed nontransactionally. However, we did not see examples where the kind of access to one piece of data depended on the value of another.

  • Content aware architectures

     González Garcia, Ruben
    Defense's date: 2009-12-04
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • A Two level Load/Store Queue Based on Execution Locality

     Pericas, Miquel; Cristal Kestelman, Adrian; Cazorla Almeida, Francisco Javier; González Garcia, Ruben; Veidenbaum, Alex; Jimenez, Daniel A; Valero Cortes, Mateo
    The 35th Annual International Symposium on Computer Arquitecture
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Transactional Memory and OpenMP

     Milovanovic, Milos; Ferrer, Roger; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    3rd International Workshop on OpenMP (IWOMP 2007)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Transactional Memory and OpenMp

     Milovanovic, Milos; Ferrer, Roger; Unsal, Osman Sabri; Cristal Kestelman, Adrian; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Valero Cortes, Mateo
    Lecture notes in computer science
    Date of publication: 2007-06
    Journal article

     Share Reference managers Reference managers Open in new window

  • Transactional memory: An overview

     Harris, T; Cristal Kestelman, Adrian; Unsal, Osman Sabri; Ayguade Parra, Eduard; Gagliardi, F; Smith, B; Valero Cortes, Mateo
    IEEE micro
    Date of publication: 2007-05
    Journal article

     Share Reference managers Reference managers Open in new window

  • Kilo-Instruction Processors RunAhead and Prefetch

     Cristal Kestelman, Adrian
    Third ACM International Conference on Computing Frontiers (CF'06)
    Presentation's date: 2006-05-03
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Boosting ILP & TLP with the Flexible Multi-Core (FMC)

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Cazorla Almeida, Francisco Javier; Valero Cortes, Mateo
    Second International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2006)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A Decoupled KILO-Instruction Processor

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Valero Cortes, Mateo
    12th International Symposium on High-Performance Computer Architecture (HPCA-12)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Kilo-Instruction Processors, RunAhead and Prefetch

     Ramirez Garcia, Tanausu; Cristal Kestelman, Adrian; Oliverio, J Santana; Pajuelo González, Manuel Alejandro; Valero Cortes, Mateo
    Third ACM International Conference on Computing Frontiers (CF'06)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Kilo Instruction Processors

     Cristal Kestelman, Adrian
    Defense's date: 2006-04-18
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Instruction Wakeup Mechanism: power and timming evaluation

     Cristal Kestelman, Adrian; Valero Cortes, Mateo
    Date of publication: 2005-01
    Book chapter

     Share Reference managers Reference managers Open in new window

  • the Loop Processor Architecture

     García, Alfredo; Medina, P; Fernández, E; Santana Pérez, Orlando Onofre; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Arquitectura Clusterizada Asimetrica Basada en el Contenido

     González Garcia, Ruben; Cristal Kestelman, Adrian; Pericas Gleim, Miquel; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Toward the Loop Processor Architecture (LPA)

     García Arbelo, J. Alejandro; Medina, Pedro; Cristal Kestelman, Adrian; Santana Jaria, Oliverio J.; FERNANDEZ GARCIA, ENRIQUE; Valero Cortes, Mateo
    XVI Jornadas de Paralelismo. CEDI 2005 I Congreso Español de Informática.
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • An Asymmetric Clustered Processor Based on Value Content

     González Garcia, Ruben; Cristal Kestelman, Adrian; Pericas Gleim, Miquel; Valero Cortes, Mateo; Veidenbaum, Alex
    19th ACM International Conference on Supercomputing (ISC'2005)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Exploiting Instruction Locality with a Decoupled kilo-Instruction Processor

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Jiménez, D A; Valero Cortes, Mateo
    International Symposium on High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • KIMP: Multicheckpointing Multiprocessors

     Enrrique, Vallejo; Galluzzi, Marco; Cristal Kestelman, Adrian; Fernando, Vallejo; Stenström, Per; James, E Smith; Valero Cortes, Mateo; Beivide Palacio, Julio Ramon
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Decoupled State-Execute Architecture

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Valero Cortes, Mateo
    International Symposium on High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Decoupled State-Execute Architecture

     Pericás, M; Cristal Kestelman, Adrian; González Garcia, Ruben; Valero Cortes, Mateo
    International Symposium on High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Instruction Locality with a Decoupled kilo-Instruction Processor

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Jiménez, D A; Valero Cortes, Mateo
    International Symposium on High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • An Asymmetric Clustered Processor based on Value Content

     Cristal Kestelman, Adrian
    19th ACM International Conference on Supercomputing (ISC'2005)
    Presentation's date: 2005-06-20
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A New Pointer-based Instruction Queue Design and Its Power-Performance Evaluation

     Ramírez, Marco A; Cristal Kestelman, Adrian; Valero Cortes, Mateo; Veidenbaum, Alexander V; Villa, Luis
    23rd IEEE International Conference on Computer Design, ICCD 2005
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Different approaches using Kilo-instruction processors

     Ramirez Garcia, Tanausu; Galluzzi, Marco; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
    Presentation's date: 2005
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Overcoming the memory wall with D-KIPs

     Pericas Gleim, Miquel; González Garcia, Ruben; Cristal Kestelman, Adrian; Valero Cortes, Mateo
    International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
    Presentation's date: 2005
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Chained In-Order/Out-of-Order DoubleCore Architecture

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Jimenez, Daniel A; Valero Cortes, Mateo
    International Symposium on Computer Architecture and High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Exploiting Instruction Locality with a Decoupled kilo-Instruction Processor

     Pericas Gleim, Miquel; Cristal Kestelman, Adrian; González Garcia, Ruben; Jiménez, D A; Valero Cortes, Mateo
    International Symposium on High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Eficacia vs. Eficiencia: Una decision de diseño en RunAhead

     Ramirez Garcia, Tanausu; Cristal Kestelman, Adrian; Oliverio, J Santana; Pajuelo González, Manuel Alejandro; Valero Cortes, Mateo
    Date: 2005-12
    Report

     Share Reference managers Reference managers Open in new window

  • Kilo-instruction processors: overcoming the memory wall

     Cristal Kestelman, Adrian; Galluzzi, Marco; Valero Cortes, Mateo
    IEEE micro
    Date of publication: 2005-05
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Arquitectura Simétrica

     González Garcia, Ruben; Cristal Kestelman, Adrian; Pericas Gleim, Miquel; Veidenbaum, A; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Eficacia vs Eficiencia: Una decision de diseño en RunAhead

     Ramirez Garcia, Tanausu; Cristal Kestelman, Adrian; Santana, O J; Pajuelo González, Manuel Alejandro; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Banked Front-End Physical Register File

     Pericas Gleim, Miquel; González, Rubén; Cristal Kestelman, Adrian; Veidenbaum, Alex; Valero Cortes, Mateo
    Date: 2004-10
    Report

     Share Reference managers Reference managers Open in new window

  • A clustered Processor based on Content-Aware Register File

     Gonzalez, Ruben; Cristal Kestelman, Adrian; Weidenbaum, Alexander; Pericas Gleim, Miquel; Valero Cortes, Mateo
    Date: 2004-10
    Report

     Share Reference managers Reference managers Open in new window

  • Implementing Kilo-Instruction Multiprocessors

     Vallejo, E; Galluzzi, Marco; Cristal Kestelman, Adrian; Vallejo, F; Beivide Palacio, Julio Ramon; Stenstrom, P; Smith, J; Valero Cortes, Mateo
    The IEEE/ACS International Conference on Pervasive Services (ICPS'04)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A comprehensive description of kilo-instruction processors

     Valero Cortes, Mateo; Cristal Kestelman, Adrian; Santana Jaria, Oliverio J.
    5º Congreso Nacional de Computación (CORE-2004)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Maintaining Thousands of In-Flight Instructions

     Cristal Kestelman, Adrian
    Euro-Par
    Presentation's date: 2004-08-31
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window