Multi-core processors are ubiquitous in all market segments from embedded to high performance computing, but only few applications can efficiently utilize them. Existing parallel frameworks aim to support thread-level parallelism in applications, but the imposed overhead prevents their usage for small problem instances. This work presents Micro-threads (Mth) a hardware-software proposal focused on a shared thread management model enabling the use of parallel resources in applications that have small chunks of parallel code or small problem inputs by a combination of software and hardware: delegation of the resource control to the application, an improved mechanism to store and fill processor's context, and an efficient synchronization system. Four sample applications are used to test our proposal: HSL filter (trivially parallel), FFT Radix2 (recursive algorithm), LU decomposition (barrier every cycle) and Dantzig algorithm (graph based, matrix manipulation). The results encourage the use of Mth and could smooth the use of multiple cores for applications that currently can not take advantage of the proliferation of the available parallel resources in each chip.
Markovic, N.; Nemirovsky, D.; Unsal, O.; Valero, M.; Cristal, A. IEEE computer architecture letters Vol. 14, num. 2, p. 160-163 DOI: 10.1109/LCA.2014.2357805 Data de publicació: 2015-07 Article en revista
As thread level parallelism in applications has continued to expand, so has research in chip multi-core processors. As more and more applications become multi-threaded we expect to find a growing number of threads executing on a machine. As a consequence, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Instead of perpetuating the trend of performing more complex thread scheduling in the operating system, we propose a scheduling mechanism that can be efficiently implemented in hardware as well. Our approach of identifying multi-threaded application bottlenecks such as thread synchronization sections complements the Fairness-aware Scheduler method. It achieves an average speed up of 11.5 percent (geometric mean) compared to the state-of-the-art Fairness-aware Scheduler.
Liu, Q.; Jiménez, V.; Moreto, M.; Abella, J.; Cazorla, F. J.; Valero, M. IEEE computer architecture letters Vol. 13, num. 2, p. 85-88 DOI: 10.1109/L-CA.2013.24 Data de publicació: 2014-07-01 Article en revista
We present for the first time the concept of per-task energy accounting (PTEA) and relate it to per-task energy metering (PTEM). We show the benefits of supporting both in future computing systems. Using the shared last-level cache (LLC) as an example: (1) We illustrate the complexities in providing PTEM and PTEA; (2) we present an idealized PTEM model and an accurate and low-cost implementation of it; and (3) we introduce a hardware mechanism to provide accurate PTEA in the cache.
In this paper, we propose a novel programmable functional unit (PFU) to accelerate general purpose application execution on a modern out-of-order x86 processor. Code is transformed and instructions are generated that run on the PFU using a co-designed virtual machine (Cd-VM). Results presented in this paper show that this HW/SW co-designed approach produces average speedups in performance of 29% in SPECFP and 19% in SPECINT, and up-to 55%, over modern out-of-order processor.
Luque, C.; Moreto, M.; Cazorla, F. J.; Gioiosa, R.; Buyuktosunoglu, A.; Valero, M. IEEE computer architecture letters Vol. 8, num. 1, p. 17-20 DOI: 10.1109/L-CA.2009.3 Data de publicació: 2009-01 Article en revista
Chip-MultiProcessors (CMP) introduce complexities when accounting CPU utilization to processes because the progress done by a process during an interval of time highly depends on the activity of the other processes it is co-scheduled with. We propose a new hardware accounting mechanism to improve the accuracy when measuring the CPU utilization in CMPs and compare it with the previous accounting mechanisms. Our results show that currently known mechanisms could lead to a 12% average error when it comes to CPU utilization accounting.
Our proposal reduces this error to less than 1% in a modeled 4-core processor system.
Cache partitioning has been proposed as an interesting alternative to traditional eviction policies of shared cache levels in modern CMP architectures: throughput is improved at the expense of a reasonable cost. However, these new policies present different behaviors depending on the applications that are running in the architecture. In this paper, we introduce some metrics that characterize applications and allow us to give a clear and simple model to explain final throughput speed ups.
Soft errors are an important challenge in contemporary microprocessors. Particle hits on the components of a processor are expected to create an increasing number of transient errors with each new microprocessor generation. In this paper we propose simple mechanisms that effectively reduce the vulnerability to soft errors in a processor. Our designs are generally motivated by the fact that many of the produced and consumed values in the processors are narrow and their upper order bits are meaningless. Soft errors canted by any particle strike to these higher order bits can be avoided by simply identifying these narrow values. Alternatively soft errors can be detected or corrected on the narrow values by replicating the vulnerable portion of the value inside the storage space provided for the upper order bits of these operands. We offer a variety of schemes that make use of narrow values and analyze their efficiency in reducing soft error vulnerability of level-1 data cache of the processor.
Modern out-of-order processors tolerate long-latency memory operations by supporting a large number of in-flight instructions. This is achieved in part through proper sizing of critical resources, such as register files or instruction queues. In light of the increasing gap between processor speed and memory latency, tolerating upcoming latencies in this way would require impractical sizes of such critical resources.To tackle this scalability problem, we make a case for resource-conscious out-of-order processors. We present quantitative evidence that critical resources are increasingly underutilized in these processors. We advocate that better use of such resources should be a priority in future research in processor architectures.
During the recent years, the market of mid/low-end portable systems such as PDAs or mobile digital phones have experimented a revolution in both selling volume and features as handheld devices incorporate Multimedia applications. This fact brings to an increase in the computational demands of the devices, while still having the limitation of power (and energy) consumption.
Instruction memoization is a promising technique to help alleviate the problem of power consumption of expensive functional units such as the floating-point one. Unfortunately, this technique could be energy-inefficient for low-end systems due to the additional power consumption of the relatively big tables required.
In this paper we present a novel way of understanding multimedia floating point operations based on the fuzzy computation paradigm: losses in the computation precision may exchange performance for negligible errors in the output. Exploiting the implicit characteristics of media FP computation, we propose a new technique called fuzzy memoization. Fuzzy memoization expands the capabilities of classic memoization by attaching entries with similar inputs to the same output. We present a case of study for a SH4 like processor and report good performance and power-delay improvements with feasible hardware requirements.