Roca, J.; Moya, V.; González, C.; Escandell, V.; Murciego, A.; Fernandez, A.; Espasa, R. The visual computer Vol. 26, num. 6-8, p. 707-719 DOI: 10.1007/s00371-010-0492-4 Data de publicació: 2010-06 Article en revista
Roca, J.; Moya, V.; Gonzalez, C.; Escandell, V.; Murciego, A.; Fernandez, A.; Espasa, R. Computers Graphic International Conference p. 707-719 Data de presentació: 2010-06 Presentació treball a congrés
This paper shows that breaking the barrier of 1 triangle/clock rasterization rate for microtriangles in modern GPU architectures in an efficient way is possible. The
fixed throughput of the special purpose culling and triangle setup stages of the classic pipeline limits the GPU scalability
to rasterize many triangles in parallel when these cover very few pixels. In contrast, the shader core counts and increasing
GFLOPs in modern GPUs clearly suggests parallelizing this computation entirely across multiple shader threads, making
use of the powerful wide-ALU instructions. In this paper, we present a very efficient SIMD-like rasterization code targeted at very small triangles that scales very well with the number of shader cores and has higher performance than traditional edge equation based algorithms. We have extended
the ATTILA GPU shader ISA (del Barrioet al. in IEEE International Symposium on Performance Analysis of Systems and Software, pp. 231–241, 2006) with two fixed point instructions to meet the rasterization precision requirement.
This paper also introduces a novel subpixel Bounding Box size optimization that adjusts the bounds much more finely, which is critical for small triangles, and doubles the 2x2- pixel stamp test efficiency. The proposed shader rasterization program can run on top of the original pixel shader program in such a way that selected fragments are rasterized, attribute interpolated and pixel shaded in the same pass. Our results show that our technique yields better performance than a classic rasterizer at 8 or more shader cores, with
speedups as high as 4x for 16 shader cores.
Dos de las limitaciones de rendimiento más importantes en los procesadores de hoy en día provienen de las operaciones de memoria y de las dependencias de control. Para resolver estos problemas, las memorias cache y los predictores de salto son dos alternativas hardware bien conocidas que explotan, entre otros factores, el reuso temporal de memoria y la correlación de saltos. En otras palabras, estas estructuras tratan de explotar la redundancia dinámica existente en los programas. Esta redundancia proviene parcialmente de la forma en que los programadores escriben código, pero también de limitaciones existentes en el modelo de compilación tradicional, lo cual introduce instrucciones de memoria y de salto innecesarias. Pensamos que los compiladores deberían ser muy agresivos optimizando programas, y por tanto ser capaces de eliminar una parte importante de esta redundancia. Por otro lado, las optimizaciones aplicadas en tiempo de enlace o directamente al programa ejecutable final han recibido una atención creciente en los últimos años, debido a limitaciones existentes en el modelo de compilación tradicional. Incluso aplicando sofisticados análisis y transformaciones interprocedurales, un compilador tradicional no es capaz de optimizar un programa como una entidad completa. Un problema similar aparece aplicando técnicas de compilación dirigidas por profiling: grandes proyectos se ven forzados a recompilar todos y cada uno de sus módulos para aprovechar dicha información. Por el contrario, seria más conveniente construir la aplicación completa, instrumentarla para obtener información de profiling y optimizar entonces el binario final sin recompilar ni un solo fichero fuente. En esta tesis presentamos nuevas técnicas de compilación dirigidas por profiling para eliminar la redundancia encontrada en programas ejecutables a nivel binario (esto es, redundancia binaria), incluso aunque estos programas hayan sido compilados agresivamente con un novísimo compilador comercial. Nuestras técnicas de eliminación de redundancia están diseñadas para eliminar operaciones de memoria y de salto redundantes, que son las más importantes para mitigar los problemas de rendimiento que hemos mencionado. Estas propuestas están basadas en técnicas de eliminación de redundancia parcial sensibles al camino de ejecución. Los resultados muestran que aplicando nuestras optimizaciones, somos capaces de alcanzar una reducción del 14% en el tiempo de ejecución de nuestro conjunto de programas. En este trabajo también revisamos el problemas del análisis de alias en programas ejecutables, identificando el por qué la desambiguación de memoria es uno de los puntos débiles en la modificación de código objeto. Proponemos varios análisis para ser aplicados en el contexto de optimizadores binarios. Primero un análisis de alias estricto para descubrir dependencias de memoria sensibles al camino de ejecución, el cual es usado en nuestras optimizaciones para la eliminación de redundancias de memoria. Seguidamente, dos análisis especulativos de posibles alias para detección de independencias de memoria. Estos análisis están basados en introducir información especulativa en tiempo de análisis, lo que incrementa la precisión en partes importantes de código manteniendo el análisis eficiente. Los resultados muestran que nuestras propuestas son altamente útiles para incrementar la desambiguación de memoria de código binario, lo que se traduce en oportunidades para aplicar optimizaciones. Todos nuestros algoritmos, tanto de análisis como de optimización, han sido implementados en un optimizador binario, enfatizando los problemas más relevantes en la aplicaciones de nuestros algoritmos en código ejecutable, sin la ayuda de gran parte de la información de alto nivel presente en compiladores tradicionales.
Two of the most important performance limiters in today's processor families comes from solving the memory wall and handling control dependencies. In order to address these issues, cache memories and branch predictors are well-known hardware proposals that take advantage of, among other things, exploiting both temporal memory reuse and branch correlation. In other words, they try to exploit the dynamic redundancy existing in programs. This redundancy comes partly from the way that programmers write source code, but also from limitations in the compilation model of traditional compilers, which introduces unnecessary memory and conditional branch instructions. We believe that today's optimizing compilers should be very aggressive in optimizing programs, and then they should be expected to optimize a significant part of this redundancy away. On the other hand, optimizations performed at link-time or directly applied to final program executables have received increased attention in recent years, due to limitations in the traditional compilation model. First, even though performing sophisticated interprocedural analyses and transformations, traditional compilers do not have the opportunity to optimize the program as a whole. A similar problem arises when applying profile-directe compilation techniques: large projects will be forced to re-build every source file to take advantage of profile information. By contrast, it would be more convenient to build the full application, instrument it to obtain profile data and then re-optimize the final binary without recompiling a single source file. In this thesis we present new profile-guided compiler optimizations for eliminating the redundancy encountered on executable programs at binary level (i.e.: binary redundancy), even though these programs have been compiled with full optimizations using a state-ofthe- art commercial compiler. In particular, our Binary Redundancy Elimination (BRE) techniques are targeted at eliminating both redundant memory operations and redundant conditional branches, which are the most important ones for addressing the performance issues that we mentioned above in today's microprocessors. These new proposals are mainly based on Partial Redundancy Elimination (PRE) techniques for eliminating partial redundancies in a path-sensitive fashion. Our results show that, by applying our optimizations, we are able to achieve a 14% execution time reduction in our benchmark suite. In this work we also review the problem of alias analysis at the executable program level, identifying why memory disambiguation is one of the weak points of object code modification. We then propose several alias analyses to be applied in the context of linktime or executable code optimizers. First, we present a must-alias analysis to recognize memory dependencies in a path- sensitive fashion, which is used in our optimization for eliminating redundant memory operations. Next, we propose two speculative may-alias data-flow algorithms to recognize memory independencies. These may-alias analyses are based on introducing unsafe speculation at analysis time, which increases alias precision on important portions of code while keeping the analysis reasonably cost-efficient. Our results show that our analyses prove to be very useful for increasing memory disambiguation accuracy of binary code, which turns out into opportunities for applying optimizations. All our algorithms, both for the analyses and the optimizations, have been implemented within a binary optimizer, which overcomes most of the existing limitations of traditional source-code compilers. Therefore, our work also points out the most relevant issues of applying our algorithms at the executable code level, since most of the high-level information available in traditional compilers is lost.
Victor, M.; Gonzalez, C.; Jordi, R.; Fernandez, A.; Espasa, R. 2005 International Conference on High Performance Embedded Architectures & Compilers (HiPEAC'2005) p. 286-301 Presentació treball a congrés
Fernandez, M.; Espasa, R. Annual Workshop on Interaction between Compilers and Computer Architecture in conjunction with the IEEE International Symposium on High-Performance Computer Architecture Presentació treball a congrés
This paper analyzes the performance of vector-dominated regions of code in numerical and multimedia applications in a superscalar + vector architecture and compares it with an eight-way superscalar processor. The ability to split a program’s execution into scalar and vector regions allows us to show that (1) as expected, the vector unit is much better than the wide-issue superscalar at executing the vector-dominated regions of the code; (2) on the scalar regions, the eight-way superscalar, although better than a four-way superscalar, is clearly not worth the extra complexity in terms of extra transistors and potential cycle-time limitations. Overall, the vector-enhanced superscalar is from 6% to 303% better than an eight-way superscalar. We also present detailed data on the performance of the memory system, which is usually the key limiting factor when running numerical and multi-\break media applications. We evaluate two additional cache designs that try to alleviate problems created by non-unit stride memory references.
Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SIMD functional units. This paper addresses the problem of the memory bandwidth. We propose a novel mechanism suitable for 2-dimensional vector architectures and targeted at providing high effective bandwidth for SIMD memory instructions. The basis of this mechanism is the extension of the scope of vectorization at the memory level, so that 3-dimensional memory patterns can be fetched into a second-level register file. By fetching long blocks of data and by reusing 2-dimensional memory streams at this second-level register file, we obtain a significant increase in the effective memory bandwidth. As side benefits, the new 3-dimensional load instructions provide a high robustness to memory latency and a significant reduction of the cache activity, thus reducing power and energy requirements. At the investment of a 50% more area than a regular SIMD register file, we have measured and average speed-up of 13% and the potential for power savings in the L2 cache of a 30%.
Corbal, J.; Espasa, R.; Valero, M. International Conference on Parallel Architectures and Compilation Techniques p. 83-94 DOI: 10.1109/PACT.2001.953290 Data de presentació: 2001-09 Presentació treball a congrés
Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for /spl mu/-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in /spl mu/-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current /spl mu/-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of /spl mu/-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.
Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for µ-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in µ-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current µ-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of µ-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.
Ayguade, E.; Dahlgren, F.; Christine, E.; Espasa, R.; Guang, R.; Muller, H.; Sakellariou, R.; Seznec, A. International European Conference on Parallel and Distributed Computing p. 385 DOI: 10.1007/3-540-44681-8_56 Data de presentació: 2001-08 Presentació treball a congrés
The papers presented in this combined topic consider issues related to the broad theme of computer architecture research. The program reflects the current emphasis of research on the exploitation of instruction-level parallelism and thread-level parallelism, with the papers presented covering several important aspects on both approaches: branch prediction, speculative multitheading, pipelining and superscalar architecture design, SIMD extensions, and dynamic scheduling issues in multithreaded architectures.
Quintana, F.; Corbal, J.; Espasa, R.; Valero, M. ACM Symposium on Parallelism in Algorithms and Architectures p. 103-112 DOI: 10.1145/378580.378602 Data de presentació: 2001-07 Presentació treball a congrés
This paper analyzes the performance of vector-dominated regions of code in numerical and multimedia applications in a superscalar+vector architecture and compares it to an 8-way superscalar processor. The ability to split a program's execution into scalar and vector regions allows us to show that (1) as expected, the vector unit is much better than the wide issue superscalar at executing the vector-dominated regions of the code; (2) on the scalar regions, the 8-way superscalar, although better than a 4-way superscalar, is clearly not worth the extra complexity in terms of extra transistors and potential cycle time limitations. Overall, the vector-enhanced superscalar is from 6% to 303% better than an 8-way superscalar. We also present detailed data on the performance of the memory system, which is usually the key limiting factor when running numerical and multimedia applications. We evaluate two additional cache designs that try to alleviate problems created by non-unit stride memory references.
Corbal, J.; Espasa, R.; Valero, M. Seventh International Symposium on High Performance Computer Architecture (HPCA-7) p. 219-228 DOI: 10.1109/HPCA.2001.903265 Data de presentació: 2001-01 Presentació treball a congrés
Future media workloads will require about two levels of magnitude the performance achieved by current general purpose processors. High uni-threaded performance will be needed to accomplish real-time constraints together with huge computational throughput, as next generation of media workloads will be eminently multithreaded (MPEG-4/MPEG-7). In order to fulfil the challenge of providing both good uni-threaded performance and throughput, we propose to join the simultaneous multithreading execution paradigm (SMT) together with the ability to execute media-oriented streaming /spl mu/-SIMD instructions. This paper evaluates the performance of two different aggressive SMT processors: one with conventional /spl mu/-SIMD extensions (such as MMX) and one with longer streaming vector /spl mu/-SIMD extensions. We will show that future media workloads are, in fact, dominated by the scalar performance. The combination of SMT plus streaming vector /spl mu/-SIMD helps alleviate the performance bottleneck of the integer unit. SMT allows "hiding" vector execution underneath integer execution by overlapping the two types of computation, while the streaming vector /spl mu/-SIMD reduces the pressure on issue width and fetch bandwidth, and provides a powerful mechanism to tolerate latency that allows to implement smart decoupled cache hierarchies.
This paper proposes and evaluates MOM: a novel ISA paradigm targeted at multimedia applications. By fusing conventional vector ISA approaches together with more recent SIMD-like (Single Instruction Multiple Data) ISAs (such as MMX), we have developed a new matrix oriented ISA which efficiently deals with the small matrix structures typically found in multimedia applications. MOM exploits a level of DLP not reachable by neither conventional vector ISAs nor SIMD-like media ISA extensions. Our results show that MOM provides a factor of 1.3x to 4x performance improvement when compared with two different multimedia extensions (MMX and MDMX) on several kernels, which translates into up to a 50% of performance gain when measuring full applications (20% in average). Furthermore, the streaming nature of MOM provides additional advantages for executing multimedia applications, such as a very low fetch pressure or a high tolerance to memory latency, making MOM an ideal candidate for the embedded domain.
The importance of media processing has produced a revolution in the design of embedded processors. In order to face the high computational and technological demands of near future media applications, new embedded processors are including features that were commonly restricted to the general purpose and the supercomputing domains. In this paper we have evaluated the performance of various DLP (Data Level Parallelism) oriented embedded architectures and analyzed quantitative data in order to determine the highlights and disadvantages of each approach. Additionally we have analyzed the differences between the explicit parallel versions of code (often based on the standard algorithms) and the high-tuned, non-vectorizable versions usually found in real multimedia programs. We will show that sub-word SIMD architectures (like MMX) are a very costeffective solution, and that, while long vector architectures provide few improvements at a very high cost, a smart combination between vector and SIMD-like architectures is the alternative that leverages best performance at a reasonable cost. We will also show that the memory latency tolerance, typical of vector architectures, is partially compensated by the worse spatial locality found when executing vector code.
MOM is a novel matrix-oriented ISA paradigm for multimedia applications, based on fusing conventional vector ISAs with SIMD ISAs such as MMX. This paper justifies why MOM is a suitable alternative for the multimedia domain due to its efficiency handling the small matrix structures typically found in most multimedia kernels. MOM leverages a performance boost between 1.3x and 4x over more conventional multimedia extensions (such as MMX and MDMX), which already achieve performance benefits ranging from 1.3x to 15x over conventional Alpha code. Moreover, MOM exhibit a high relative performance for low-issue rates and a high tolerance to memory latency. Both advantages present MOM as an attractive alternative for the embedded domain.
Decoupling techniques can be applied to a vector processor, resulting in a large increase in performance of vectorizable programs. We simulate a selection of the Perfect Club and Specfp92 benchmark suites and compare their execution time on a conventional single port vector architecture with that of a decoupled vector architecture. Decoupling increases the performance by a factor greater than 1.4 for realistic memory latencies, and for an ideal memory system with zero latency, there is still a speedup of as much as 1.3. A significant portion of this paper is devoted to studying the tradeoffs involved in choosing a suitable size for the queues of the decoupled architecture. The hardware cost of the queues need not be large to achieve most of the performance advantages of decoupling.
The focus of this paper is on adding a vector unit to a superscalar core, as a way to scale current state of the art superscalar processors. The proposed architecture has a vector register file that shares functional units both with the integer datapath and with the floating point datapath. A key point in our proposal is the design of a high performance cache interface that delivers high bandwidth to the vector unit at a low cost and low latency. We propose a double-banked cache with alignment circuitry to serve vector accesses and we study two cache hierarchies: one feeds the vector unit from the L1; the other from the L2. Our results show that large IPC values (higher than 10 in some cases) can be achieved. Moreover the scalability of our architecture simply requires addition of functional units, without requiring more issue bandwidth. As a consequence, the proposed vector unit achieves high performance for numerical and multimedia codes with minimal impact on the cycle time of the processor or on the performance of integer codes.
In this work we studied the influence of the vector register size over two different concepts of vector architectures. Long vector registers play an important role in a conventional vector architecture, however, even using highly vectorisable codes, only a small fraction of that large vector registers is used. Reducing vector register size on a conventional vector architecture results in a severe performance degradation, providing slowdowns in the range of 1.8 to 3.8. When we included an out-of-order execution on a vector architecture, the need for long vector registers was reduced. We used a trace driven approach to simulate a selection of the Perfect Club and Specfp92 programs. The results of the simulations showed that the reduction of the register size on an out-of-order vector architecture led to slowdowns in the range of 1.04 to 1.9. These compare favourably with the values found for a conventional vector machine. Even when reducing the registers size to 1/4 of the original size on an out-of-order machine, the slowdown was between 1.04 and 1.5, and was better still than on a conventional vector machine. Finally, when comparing both architectures, using the same register file size (8kb) we found that the gains in performance using out-of-order execution were between 1.13 and 1.40.
This paper presents a comparison between superscalar and vector processors. First, we start with a detailed ISA analysis of the vector machine, including data related to masked execution, vector length and vector first facilities. Then we present a comparison of the two models at the instruction set architecture (ISA) level that shows that the vector model has several advantages: executes fewer instructions, fewer overall operations, and generally executes fewer memory accesses. We then analyse both models in terms of speculative execution, each one in its context. Results show that superscalar processors make an extensive use of speculation and that there is a large amount of misspeculated instructions. In the vector model, speculation is achieved using vector masks and, in general, fewer operations are misspeculated.