Alvarez Mesa, Mauricio
Total activity: 20
Department
Department of Computer Architecture
School
Barcelona School of Informatics (FIB)
E-mail
mauricio.alvarezestudiant.upc.edu
Contact details
UPC directory Open in new window
Orcid
0000-0002-1751-0993 Open in new window

Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 20 of 20 results
  • PARALLEL VIDEO DECODING  Open access

     Alvarez Mesa, Mauricio
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Digital video is a popular technology used in many different applications. The quality of video, expressed in the spatial and temporal resolution, has been increasing continuously in the last years. In order to reduce the bitrate required for its storage and transmission, a new generation of video encoders and decoders (codecs) have been developed. The latest video codec standard, known as H.264/AVC, includes sophisticated compression tools that require more computing resources than any previous video codec. The combination of high quality video and the advanced compression tools found in H.264/AVC has resulted in a significant increase in the computational requirements of video decoding applications. The main objective of this thesis is to provide the performance required for real-time operation of high quality video decoding using programmable architectures. Our solution has been the simultaneous exploitation of multiple levels of parallelism. On the one hand, video decoders have been modified in order to extract as much parallelism as possible. And, on the other hand, general purpose architectures has been enhanced for exploiting the type of parallelism that is present in video codec applications.

    El vídeo digital es una tecnología popular utilizada en una gran variedad de aplicaciones. La calidad de vídeo, expresada en la resolución espacial y temporal, ha ido aumentando constantemente en los últimos años. Con el fin de reducir la tasa de bits requerida para su almacenamiento y transmisión, se ha desarrollado una nueva generación de codificadores y decodificadores (códecs) de vídeo. El códec estándar de vídeo más reciente, conocido como H.264/AVC, incluye herramientas sofisticadas de compresión que requieren más recursos de computación que los códecs de vídeo anteriores. El efecto combinado del vídeo de alta calidad y las herramientas de compresión avanzada incluidas en el H.264/AVC han llevado a un aumento significativo de los requerimientos computacionales de la decodificación de vídeo. El objetivo principal de esta tesis es proporcionar el rendimiento necesario para la decodificación en tiempo real de vídeo de alta calidad. Nuestra solución ha sido la explotación simultánea de múltiples niveles de paralelismo. Por un lado, se realizaron modificaciones en el decodificador de vídeo con el fin de extraer múltiples niveles de paralelismo. Y, por otro lado, se modificaron las arquitecturas de propósito general para mejorar la explotación del tipo paralelismo que está presente en las aplicaciones de vídeo. Primero hicimos un análisis de la escalabilidad de dos extensiones de Instrucción Simple con Múltiples Datos (SIMD por sus siglas en inglés): una de una dimensión (1D) y otra matricial de dos dimensiones (2D). Se demostró que al escalar la extensión 2D se obtiene un mayor rendimiento con una menor complejidad que al escalar la extensión 1D. Luego se realizó una caracterización de la decodificación de H.264/AVC en aplicaciones de alta definición (HD) donde se identificaron los núcleos principales. Debido a la falta de un punto de referencia (benchmark) adecuado para la decodificación de vídeo HD, desarrollamos uno propio, llamado HD-VideoBench el cual incluye aplicaciones completas de codificación y decodificación de vídeo junto con una serie de secuencias de vídeo en HD. Después optimizamos los núcleos más importantes del decodificador H.264/AVC usando instrucciones SIMD. Sin embargo, los resultados no alcanzaron el máximo rendimiento posible debido al efecto negativo de la desalineación de los datos en memoria. Como solución, evaluamos el hardware y el software necesarios para realizar accesos no alineados. Este soporte produjo mejoras significativas de rendimiento en la aplicación. Aparte se realizó una investigación sobre cómo extraer paralelismo de nivel de tarea. Se encontró que ninguno de los mecanismos existentes podía escalar para sistemas masivamente paralelos. Como alternativa, desarrollamos un nuevo algoritmo que fue capaz de encontrar miles de tareas independientes al explotar paralelismo de nivel de macrobloque. Luego implementamos una versión paralela del decodificador de H.264 en una máquina de memoria compartida distribuida (DSM por sus siglas en inglés). Sin embargo esta implementación no alcanzó el máximo rendimiento posible debido al impacto negativo de las operaciones de sincronización y al efecto del núcleo de decodificación de entropía. Con el fin de eliminar estos cuellos de botella se evaluó la paralelización al nivel de imagen de la fase de decodificación de entropía combinada con la paralelización al nivel de macrobloque de los demás núcleos. La sobrecarga de las operaciones de sincronización se eliminó casi por completo mediante el uso de operaciones aceleradas por hardware. Con todas las mejoras presentadas se permitió la decodificación, en tiempo real, de vídeo de alta definición y alta tasa de imágenes por segundo. Como resultado global se creó una solución escalable capaz de usar el número creciente procesadores en las arquitecturas multinúcleo.

  • Learning the principles of parallel computing with games

     Alvarez Mesa, Mauricio; Bofill Soliguer, Pau; Sánchez Castaño, Friman; Farreras Esclusa, Montserrat
    Active Learning in Engineering Education Workshop
    Presentation of work at congresses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    The trend towards parallel computers requires a fundamental change in the way software is developed in order to maintain performance scalability. Because of that, it is required that most software developers have a solid knowledge on how to develop parallel programs. In this paper we present a methodology for learning parallel computing that gives priority to the general principles rather than technologies that use them. Parallel computing is presented as a specific case of the general coordination problem and, based on that, the fundamental issues of coordination systems are presented. Coordination is modelled as a cooperative game, in which learners (players) contribute to a common goal. Two games are presented as an example: the ¿orange game¿, and a game based on the ¿dining philosophers¿ problem. Those games only use ordinary materials (not computers) such as cards, drawing paper, colours and oranges, and allow to illustrate problems in coordination systems like mutual exclusion and deadlocks.

  • Access to the full text
    The SARC architecture  Open access

     Ramirez Bellido, Alejandro; Cabarcas Jaramillo, Felipe; Juurlink, Ben; Alvarez Mesa, Mauricio; Sanchez Castaño, Friman; Azevedo, Arnaldo; Meenderinck, Cor; Ciobanu, Catalin; Isaza, Sebastian; Gaydadjiev, Georgi
    IEEE micro
    Vol. 30, num. 5, p. 16-29
    Date of publication: 2010-10
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications, with matching support from specialized accelerator processors.

  • Scalability of macroblock-level parallelism for H.264 decoding

     Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro; Martorell, Xavier; Ayguade Parra, Eduard
    International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
    p. 59-62
    Presentation's date: 2010-07-15
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Parallel scalability of video decoders

     Meenderinck, Cor; Azevedo, Arnaldo; Juurlink, Ben; Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro
    Journal of Signal Processing Systems
    Vol. 57, num. 2, p. 173-194
    DOI: 10.1007/s11265-008-0256-9
    Date of publication: 2009-11
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MB-level parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.

  • Performance evaluation of macroblock-level parallelization of H.264 decoding on a cc-NUMA multiprocessor architecture

     Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro; Valero Cortes, Mateo; Azevedo, Arnaldo; Meenderinck, Cor; Juurlink, Ben
    Avances en sistemas e informática
    Vol. 6, num. 1, p. 219-228
    Date of publication: 2009-06
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Este artículo presenta un estudio de la escalabilidad del rendimiento en el paralelismo a nivel macro bloque de un decodificador H.264 para aplicaciones de alta definición (HD) en arquitecturas de múltiples procesadores. Hemos implementado este paralelismo en un “cache coherent Non-uniform Memory Acces” (cc-NUMA) en procesadores simétricos (SMP) y comparando con los resultados con expectativas teóricas. El estudio incluye la evaluación de tres diferentes técnicas programadas: estática, dinámica y dinámica con cola. El enfoque de programación dinámica con optimización de cola presenta los mejores rendimientos obteniendo una velocidad máxima de 9.5 con 24 procesadores. Un análisis detallado reveló que el tratamiento de la sincronización es uno de los factores limitantes para el logro de una mejor escalabilidad. Este artículo incluye una evaluación del impacto en sincronización en bloque APIs como hilos POSIX y extensiones de tiempo real. Los resultados demostraron que el paralelismo a nivel macro bloque como una forma de granulado muy fino de TLP (Thread-Level Parellelism) es altamente afectado por los hilos de sincronización, tal vez con el soporte de hardware, se requieren para la paralelización a nivel macro más escalable.

  • A highly scalable parallel implementation of H.264

     Azevedo, Arnaldo; Juurlink, Ben; Meenderinck, Cor; Terechko, Andrei; Hoogerbrugge, Jan; Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    Transactions on High-Performance Embedded Architectures and Compilers (Transactions on HiPEAC)
    Vol. 4, num. 2, p. 1-25
    DOI: 10.1007/978-3-642-24568-8-6
    Date of publication: 2009-09-21
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding.

  • Performance evaluation of macroblock-level parallelization of H.264 decoding on a cc-NUMA multiprocessor architecture

     Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro; Valero Cortes, Mateo; Azevedo, Arnaldo; Meenderinck, Cor; Juurlink, Ben
    Colombian Computing Conference
    p. 108-117
    Presentation's date: 2009-04-23
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High De nition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory multiprocessor (SMP) and compared the results with the theoretical expectations. Three di erent scheduling techniques were analyzed: static, dynamic and dynamic with tail-submit. A dynamic scheduling approach with a tail-submit optimization presents the best performance obtaining a maximum speed-up of 9.5 using 24 processors. A detailed pro ling analysis showed that thread synchronization is one of the limiting factors for achieving a better parallel scalability. The paper includes an evaluation of the impact of using blocking synchronization APIs like POSIX threads and POSIX real-time extensions. Results showed that macroblock-level parallelism as a very negrain form of Thread-Level Parallelism (TLP) is highly affected by the thread synchronization overhead generated by these APIs. Other synchronization methods, possibly with hardware support, are required in order to make MB-level parallelization more scalable.

  • Parallel H.264 decoding on an embedded multicore processor

     Azevedo, Arnaldo; Meenderinck, Cor; Juurlink, Ben; Terechko, Andrei; Hoogerbrugge, Jan; Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro
    International Conference on High Performance and Embedded Architectures and Compilers
    p. 404-418
    DOI: 10.1007/978-3-540-92990-1_29
    Presentation's date: 2009-01-25
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.

  • Is engagement with a purpose the essence of active learning?  Open access

     Alvarez Mesa, Mauricio
    Active Learning in Engineering Education Workshop
    Presentation's date: 2009-06-10
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    In the 2009 edition of the conference on “Active Learning in Engineering Education”, there were several and fruitful discussions within a small workgroup about the essence of active learning. At the end we came with an attempt to sum up our whole discussion with one question. Our question is the same as the title of this essay. Taking this question as a starting point this article propose a specific purpose from which active learning can be based.

  • Access to the full text
    Scalability of macroblock-level parallelism for H.264 decoding  Open access

     Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro; Azevedo, Arnaldo; Meenderinck, Cor; Juurlink, Ben; Valero Cortes, Mateo
    IEEE International Conference on Parallel and Distributed Systems
    p. 236-243
    DOI: 10.1109/ICPADS.2009.124
    Presentation's date: 2009-12-09
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    This paper investigates the scalability of MacroBlock(MB) level parallelization of the H.264 decoder for High Definition (HD) applications. The study includes three parts. First, a formal model for predicting the maximum performance that can be obtained taking into account variable processing time of tasks and thread synchronization overhead. Second, an implementation on a real multiprocessor architecture including a comparison of different scheduling strategies and a profiling analysis for identifying the performance bottlenecks. Finally, a trace-driven simulation methodology has been used for identifying the opportunities of acceleration for removing the main bottlenecks. It includes the acceleration potential for the entropy decoding stage and thread synchronization and scheduling. Our study presents a quantitative analysis of the main bottlenecks of the application and estimates the acceleration levels that are required to make the MB-level parallel decoder scalable.

    Postprint (author’s final draft)

  • Parallel Scalability of Video Decoders

     Meenderinck, Cor; Arnaldo, Azevedo; Juurlink, Ben; Alvarez Mesa, Mauricio; Ramirez González, Alejandro
    Date: 2008-05
    Report

     Share Reference managers Reference managers Open in new window

  • H.264/AVC DECODER PARALLELIZATION IN CONTEXT OF CABAC ENTROPY DECODER

     Muhammad, Shafiq; Alvarez Mesa, Mauricio; Gil Gómez, Maria Luisa; Navarro Mas, Nacho
    Date: 2008-07
    Report

     Share Reference managers Reference managers Open in new window

  • Analysis of Video Filtering on the Cell Processor

     Arnaldo, Azevedo; Meenderinck, Cor; Juurlink, Ben; Alvarez Mesa, Mauricio; Ramirez Bellido, Alejandro
    IEEE International Symposium on Circuits and Systems
    p. 1
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video CODEC Applications

     Alvarez Mesa, Mauricio; Salamí San Juan, Esther; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    2007 IEEE International Symposium on Performance Analysis of Systems And Software (ISPASS'07)
    p. 62-71
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications

     Alvarez Mesa, Mauricio; Salamí San Juan, Esther; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    2007 IEEE International Symposium on Workload Characterization, IISWC-2007
    p. 120-128
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • On the Scalability of 1- and 2- Dimensional SIMD Extensions for Multimedia Applications

     Sanchez Castaño, Friman; Alvarez Mesa, Mauricio; Salamí San Juan, Esther; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    2005 IEEE International Symposium on Performance Analysis of Systems And Software (ISPASS'05)
    p. 167-176
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A performance evaluation of high definition digital video decoding using the H.264/AVC standard

     Alvarez Mesa, Mauricio; Salamí San Juan, Esther; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems
    p. 255-258
    Presentation's date: 2005
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • A performance characterization of high definition digital video decoding using H.264/AVC

     Alvarez Mesa, Mauricio; Salamí San Juan, Esther; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    2005 IEEE International Symposium on Workload Characterization, IISWC-2005
    p. 24-33
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Scalability and Complexity of 2-Dimensional SIMD Extensions

     Alvarez Mesa, Mauricio; Sanchez Castaño, Friman; Salamí San Juan, Esther; Ramirez Bellido, Alejandro; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window