Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 50 of 254 results
  • File System Metadata Virtualization

     Artiaga Amouroux, Ernest
    Defense's date: 2014-01-17
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    L'avançament dels sistemes informàtics ha creat noves formes d'utilitzar i accedir a les dades, portant l'arquitectura dels sistemes de fitxers tradicionals fins als seus límits i fent-los inadequats per a les noves necessitats. Els reptes actuals afecten tant al rendiment dels sistemes com a la seva usabilitat des de la perspectiva de les aplicacions. D'una banda, els sistemes informàtics d'alt rendiment s'estan convertint en agregacions de múltiples elements de computació en forma de clusters, grids o clouds. D'altra banda, hi ha un ventall cada cop més ampli d'aplicacions científiques i comercials que intenten aprofitar les noves possibilitats. Els requeriments d'aquestes aplicacions són sovint heterogenis, introduint grans variacions en els patrons d'ús dels sistemes de fitxers. Els centres de processament de dades han intentat compensar aquesta situació proporcionant diferents sistemes de fitxers per a diferents necessitats. Típicament, les característiques y les formes d'ús d'aquests sistemes es comuniquen als usuaris per tal que es facin responsables del seu ús adequat. La mateixa filosofia s'utilitza en entorns de computació personal, on hi acostuma a haver una clara distinció entre la porció de l'espai de noms del sistema de fitxers dedicat a l'emmagatzemament local, la part corresponent a sistemes de fitxers remots i, recentment, les parts enllaçades amb serveis al cloud tals com, per exemple, directoris per sincronitzar dades entre dispositius, per compartir fitxers amb altres usuaris o per realitzar còpies de seguretat. A la pràctica, aquesta explicitació de les funcionalitats dificulta la usabilitat dels sistemes de fitxers i la possibilitat d'aprofitar-ne tots els beneficis potencials. En aquesta tesi hem considerat que aquestes dificultats es poden alleujar fent que les característiques i funcionalitats es determinin fitxer a fitxer, i no basant-se en la localització dins d'un arbre de directoris rígid. A més, la usabilitat es podria incrementar mitjançant la disponibilitat de múltiples espais de noms dinàmics que es puguin adaptar a les necessitats específiques de les diverses aplicacions. Aquesta tesi contribueix a aquest objectiu mitjançant la proposta d'un mecanisme per desacoplar la visió del sistema d'emmagatzemament que té l'usuari de la seva estructura real. Aquest mecanisme consisteix en la virtualització de les metadades del sistema de fitxers (incloent l'espai de noms i els atributs dels objectes) i la interposició d'una capa intel.ligent que determini còm s'haurien d'emmagatzemar els fitxers per tal de treure el màxim benefici dels sistemes de fitxers subjacents, sense caure en problemes de rendiment o usabilitat a causa d'un ús inadequat dels sistemes de baix nivell. Aquesta tècnica permet oferir, simultàniament, múltiples vistes virtuals de l'espai de noms i dels atributs dels objectes del sistema de fitxers, les quals poden adaptar-se a les necessitats específiques de les aplicacions sense que calgui modificar la organització del sistema d'emmagatzemament subjacent. La primera contribució de la tesi introdueix el disseny d'una infraestructura de virtualització de metadades que fa possible el desacoplament de l'estructura física mencionat anteriorment; la segona contribució consisteix en un mètode per millorar el rendiment dels sistemes de fitxers de gran escala mitjançant el mencionat desacoplament; finalment, la tercera contribució consisteix en un mètode que utilitza la virtualització de les metadades per millorar la usabilitat dels sistemes d'emmagatzemament basats en el cloud.

  • Rendering of Bézier surfaces on handheld devices

     Concheiro Figueroa, Raquel; Amor López, Margarita; Padrón González, Emilio José; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier
    Journal of WSCG (Plzen, Print)
    Date of publication: 2013
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Bézier surfaces have been widely employed in the designing of complex scenes with high-quality results. Nevertheless, parametric surfaces cannot be directly rendered in the current GPUs of modern handheld devices. This work proposes a non-adaptive method for tessellating Bézier surfaces on a GPU without primitive generator, such as the GPUs implemented in handled devices. Our technique is based on the utilization of a parametric map of virtual vertices, and its operation can be adapted to the hardware resources available in the GPU by tuning a series of parameters. Additionally, an analysis of the most relevant hardware constraints in the graphics hardware of the current handheld devices has been carried out. As those constraints prevent interactive high-quality results from being achieved, even with our proposal, we present an algorithmic approach focused on the real-time rendering on future handheld devices.

  • A systematic methodology to generate decomposable and responsive power models for CMPs

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2013-07
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand the power behavior of real systems. Consequently, several power-aware policies use models to guide their decisions. Hence, the presence of power models that are informative, accurate, and capable of detecting power phases is critical to improve the success of power-saving techniques. Additionally, the design of current processors varied considerably with the appearance of CMPs (multiple cores sharing resources). Thus, PMC-based power models warrant further investigation on current energy-efficient multicore processors. In this paper, we present a systematic methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we study theirresponsiveness -the capacity to detect power phases-. Specifically, we produce power models for an Intel Core 2 Duo with one and two cores enabled for all the DVFS configurations. The models are empirically validated using the SPECcpu2006, NAS and LMBENCH benchmarks. Finally, we compare the models against existing approaches concluding that the proposed methodology produces more accurate, responsive, and informative models.

  • An OpenMP* barrier using SIMD instructions for Intel® Xeon Phi¿ coprocessor

     Caballero, Diego; Duran Gonzalez, Alejandro; Martorell Bofill, Xavier
    International Workshop on OpenMP
    Presentation's date: 2013-09
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Barrier synchronisation is a widely-studied topic since the supercomputer era due to its significant impact on the overall performance of parallel applications. With the current shift to many-core architectures, such as the Intel® Many Integrated Core Architecture, software barriers need to be revisited from an on-chip point of view to exploit their new specific resources. In this paper, we propose a tree-based barrier that takes advantage of SIMD instructions and the inter-thread cache locality provided by the 4-way SMT of the Intel® Xeon Phi¿ coprocessor. Our SIMD approach shows a speed-up of up to 2.84x over the default Intel OpenMP* barrier in the EPCC barrier microbenchmark. It also improves by up to 60% and 21% the Livermore Loop kernel number six and the NAS MG benchmark, respectively.

    Barrier synchronisation is a widely-studied topic since the supercomputer era due to its significant impact on the overall performance of parallel applications. With the current shift to many-core architectures, such as the Intel® Many Integrated Core Architecture, software barriers need to be revisited from an on-chip point of view to exploit their new specific resources. In this paper, we propose a tree-based barrier that takes advantage of SIMD instructions and the inter-thread cache locality provided by the 4-way SMT of the Intel® Xeon PhiTM coprocessor. Our SIMD approach shows a speed-up of up to 2.84x over the default Intel OpenMP* barrier in the EPCC barrier microbenchmark. It also improves by up to 60% and 21% the Livermore Loop kernel number six and the NAS MG benchmark, respectively.

  • Improving communication in PGAS environments: Static and dynamic coalescing in UPC

     Alvanos, Michail; Farreras Esclusa, Montserrat; Tiotto, Ettore; Amaral, José Nelson; Martorell Bofill, Xavier
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.

    The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.

  • Improving performance of all-to-all communication through loop scheduling in PGAS environments

     Alvanos, Michail; Tanase, Gabriel; Farreras Esclusa, Montserrat; Tiotto, Ettore; Amaral, José Nelson; Martorell Bofill, Xavier
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Implementing OmpSs support for regions of data in architectures with multiple address spaces

     Bueno Hedo, Javier; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    ACM/IEEE International Conference on Supercomputing
    Presentation's date: 2013-06
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

    The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory hierarchy exposed to the programmer. We present the implementation of data regions on the OmpSs programming model, a high-productivity annotation-based programming model derived from OpenMP. This enables the programmer to specify regions of strided and/or overlapped data used by the parallel tasks of the application. The data will be automatically managed by the underlying run-time environment, which could transparently apply optimization techniques to improve performance. This approach based on a high-productivity programming model contrasts with more direct approaches like MPI, where the programmer has to explicitly deal with the data management. It is generally believed that these are capable of achieving the best possible performance, so we also compare the performance of several OmpSs applications against well-known counterparts MPI implementations obtaining comparable or better results.

  • Rendering of Bézier surfaces on handheld devices

     Concheiro Figueroa, Raquel; Amor López, Margarita; Padrón González, Emilio José; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier
    International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision
    Presentation's date: 2013-06-26
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Optimization techniques for fine-grained communication in PGAS environments

     Alvanos, Michail
    Defense's date: 2013-12-10
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract  Share Reference managers Reference managers Open in new window

    Los lenguajes de programación basados en la técnica del Partitioned Global Address Space (PGAS) prometen ofreceruna mejor productividad del programador y un buen rendimiento en ordenadores paralelos a gran escala. Sin embargo, es difícil de lograr un rendimiento adecuado para aplicaciones que se basan en la comunicación de grano fino sin comprometer su programabilidad. Habitualmente se requiere de asistencia manual o por parte del compilador, para la optimización de código para evitar los accesos a datos de grano fino. La desventaja de aplicar manualmente transformaciones de código es el aumento de la complejidad del programa, lo que reduce enórmemente la productividad del programador. Por otro lado, las optimizaciones que puede realizar el compilador en los accesos de grano fino requieren del conocimiento de la asignación de datos físico y el uso de construcciones de bucle paralelas.Esta tesis presenta optimizaciones para resolver los tres problemas principales de la comunicación de grano fino: (i) la baja eficiencia de las comunicaciones de red, (ii) la gran cantidad de llamadas en tiempo de ejecución , y (iii) la aparición de congestión en la red de comunicaciones, debida a la distribución no uniforme de los datos.Para resolver estos problemas, la tesis presenta tres enfoques. En primer lugar, se presenta una transformacióninspector-ejecutor mejorada, para aumentar la eficiencia de la red a través de la agregación de datos en tiempo de ejecución. En segundo lugar, se presentan optimizaciones adicionales a la transformación del bucle inspector-ejecutorpara eliminar automáticamente las llamadas en tiempo de ejecución . Por último, la tesis presenta una transformación de bucles para evitar congestión en la red de comunicaciones y la sobrecarga de los nodos. A diferencia de trabajos previos que utilizan agregación de datos estática, precarga, privatización de datos con limitaciones, y gestión de cache en software, las soluciones presentadas en esta tesis cubren todos los aspectos relacionados con la comunicación de grano fino, incluyendo la reducción del número de llamadas generadas por el compilador y minimizando la sobrecarga de las optimizaciones de la técnica inspector-ejecutor.Se realiza una evaluación de las propuestas con varios microbenchmarks y benchmarks, con el objetivo de determinar su escalabilidad y rendimiento en la arquitectura Power 775. Los resultados indican que aplicaciones con accesos regulares a datos, llegan a obtener hasta un 180% del rendimiento obtenido en versiones optimizadas a mano, mientras que en aplicaciones con accesos irregulares a datos, se espera que las transformaciones puedan producir versiones desde 1,12x hasta 6,3 veces más veloces. Las técnicas de planificación de bubles muestran mejoras de rendimientoentre el 3% y el 25%, para NAS FT y aplicaciones de ordenación, y hasta 3,4x en los microbenchmarks.

  • Visible, near infrared and thermal hand-based image biometric recognition.  Open access

     Font Aragones, Xavier
    Defense's date: 2013-05-30
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Biometric Recognition refers to the automatic identification of a person based on his or her anatomical characteristic or modality (i.e., fingerprint, palmprint, face) or behavioural (i.e., signature) characteristic. It is a fundamental key issue in any process concerned with security, shared resources, network transactions among many others. Arises as a fundamental problem widely known as recognition, and becomes a must step before permission is granted. It is supposed that protects key resources by only allowing those resources to be used by users that have been granted authority to use or to have access to them. Biometric systems can operate in verification mode, where the question to be solved is Am I who I claim I am? or in identification mode where the question is Who am I? Scientific community has increased its efforts in order to improve performance of biometric systems. Depending on the application many solutions go in the way of working with several modalities or combining different classification methods. Since increasing modalities require some user inconvenience many of these approaches will never reach the market. For example working with iris, face and fingerprints requires some user effort in order to help acquisition. This thesis addresses hand-based biometric system in a thorough way. The main contributions are in the direction of a new multi-spectral hand-based image database and methods for performance improvement. The main contributions are: A) The first multi-spectral hand-based image database from both hand faces: palmar and dorsal. Biometric database are a precious commodity for research, mainly when it offers something new like visual (VIS), near infrared (NIR) and thermography (TIR) images at a time. This database with a length of 100 users and 10 samples per user constitute a good starting point to check algorithms and hand suitability for recognition. B) In order to correctly deal with raw hand data, some image preprocessing steps are necessary. Three different segmentation phases are deployed to deal with VIS, NIR and TIR images specifically. Some of the tough questions to address: overexposed images, ring fingers and the cuffs, cold finger and noise image. Once image segmented, two different approaches are prepared to deal with the segmented data. These two approaches called: Holistic and Geometric define the main focus to extract the feature vector. These feature vectors can be used alone or can be combined in some way. Many questions can be stated: e.g. which approach is better for recognition?, Can fingers alone obtain better performance than the whole hand? and Is thermography hand information suitable for recognition due to its thermoregulation properties? A complete set of data ready to analyse, coming from the holistic and geometric approach have been designed and saved to test. Some innovative geometric approach related to curvature will be demonstrated. C) Finally the Biometric Dispersion Matcher (BDM) is used in order to explore how it works under different fusion schemes, as well as with different classification methods. It is the intention of this research to contrast what happen when using other methods close to BDM like Linear Discriminant Analysis (LDA). At this point, some interesting questions will be solved, e.g. by taking advantage of the finger segmentation (as five different modalities) to figure out if they can outperform what the whole hand data can teach us.

    El Reconeixement Biomètric fa referència a la identi cació automàtica de persones fent us d'alguna característica o modalitat anatòmica (empremta digital) o d'alguna característica de comportament (signatura). És un aspecte fonamental en qualsevol procés relacionat amb la seguretat, la compartició de recursos o les transaccions electròniques entre d'altres. És converteix en un pas imprescindible abans de concedir l'autorització. Aquesta autorització, s'entén que protegeix recursos clau, permeten així, que aquests siguin utilitzats pels usuaris que han estat autoritzats a utilitzar-los o a tenir-hi accés. Els sistemes biomètrics poden funcionar en veri cació, on es resol la pregunta: Soc jo qui dic que soc? O en identi cació on es resol la qüestió: Qui soc jo? La comunitat cientí ca ha incrementat els seus esforços per millorar el rendiment dels sistemes biomètrics. En funció de l'aplicació, diverses solucions s'adrecen a treballar amb múltiples modalitats o combinant diferents mètodes de classi cació. Donat que incrementar el número de modalitats, representa a la vegada problemes pels usuaris, moltes d'aquestes aproximacions no arriben mai al mercat. La tesis contribueix principalment en tres grans àrees, totes elles amb el denominador comú següent: Reconeixement biometric a través de les mans. i) La primera d'elles constitueix la base de qualsevol estudi, les dades. Per poder interpretar, i establir un sistema de reconeixement biomètric prou robust amb un clar enfocament a múltiples fonts d'informació, però amb el mínim esforç per part de l'usuari es construeix aquesta Base de Dades de mans multi espectral. Les bases de dades biomètriques constitueixen un recurs molt preuat per a la recerca; sobretot si ofereixen algun element nou com es el cas. Imatges de mans en diferents espectres electromagnètics: en visible (VIS), en infraroig (NIR) i en tèrmic (TIR). Amb un total de 100 usuaris, i 10 mostres per usuari, constitueix un bon punt de partida per estudiar i posar a prova sistemes multi biomètrics enfocats a les mans. ii) El segon bloc s'adreça a les dues aproximacions existents en la literatura per a tractar les dades en brut. Aquestes dues aproximacions, anomenades Holística (tracta la imatge com un tot) i Geomètrica (utilitza càlculs geomètrics) de neixen el focus alhora d'extreure el vector de característiques. Abans de tractar alguna d'aquestes dues aproximacions, però, és necessària l'aplicació de diferents tècniques de preprocessat digital de la imatge per obtenir les regions d'interès desitjades. Diferents problemes presents a les imatges s'han hagut de solucionar de forma original per a cadascuna de les tipologies de les imatges presents: VIS, NIR i TIR. VIS: imatges sobre exposades, anells, mànigues, braçalets. NIR: Ungles pintades, distorsió en forma de soroll en les imatges TIR: Dits freds La segona àrea presenta aspectes innovadors, ja que a part de segmentar la imatge de la ma, es segmenten tots i cadascun dels dits (feature-based approach). Així aconseguim contrastar la seva capacitat de reconeixement envers la ma de forma completa. Addicionalment es presenta un conjunt de procediments geomètrics amb la idea de comparar-los amb els provinents de l'extracció holística. La tercera i última àrea contrasta el procediment de classi cació anomenat Biometric Dispersion Matcher (BDM) amb diferents situacions. La primera relacionada amb l'efectivitat respecte d'altres mètode de reconeixement, com ara l'Anàlisi Lineal Discriminant (LDA) o bé mètodes com KNN o la regressió logística. Les altres situacions que s'analitzen tenen a veure amb múltiples fonts d'informació, quan s'apliquen tècniques de normalització i/o estratègies de combinació (fusió) per millorar els resultats. Els resultats obtinguts no deixen lloc per a la confusió, i són certament prometedors en el sentit que posen a la llum la importància de combinar informació complementària per obtenir rendiments superiors.

  • Access to the full text
    Migration of a generic multi-physics framework to HPC environments  Open access

     Dadvand, Pooyan; Rossi, Riccardo; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Cotela Dalmau, Jordi; Juanpere Cañameras, Edgar; Idelsohn Barg, Sergio Rodolfo; Oñate Ibáñez de Navarra, Eugenio
    Computers and fluids
    Date of publication: 2013-07-10
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Creating a highly parallelizable code is a challenge specially for Distributed Memory Machines (DMMs). Moreover, algorithms and data structures suitable for these platforms can be very different from the ones used in serial code. For this reason, many programmers in the field prefer to start their own code from scratch. However, for an already existing framework supported by a long-time expertise the idea of transformation becomes attractive in order to reuse the effort done during years of development. In this presentation we explain how a relatively complex framework but with modular structure can be prepared for high performance computing with minimum modification. Kratos Multi-Physics [1] is an open source generic multi-disciplinary platform for solution of coupled problems consist of fluid, structure, thermal and electromagnetic fields. The parallelization of this framework is performed with objective of enforcing the less possible changes to its different solver modules and encapsulate the changes as much as possible in its common kernel. This objective is achieved thanks to the Kratos design and also innovative way of dealing with data transfers for a multi-disciplinary code. This work is completed by the migration of the framework from the 86× architecture to the Marenostrum Supercomputing platform. The migration has been verified by a set of benchmarks which show high scalability, from which we present the Telescope problem in this paper.

    Creating a highly parallelizable code is a challenge specially for distributed memory machines (DMMs). Moreover, algorithms and data structures suitable for these platforms can be very different from the ones used in serial code. For this reason, many programmers in the field prefer to start their own code from scratch. However, for an already existing framework supported by a long-time expertise the idea of transformation becomes attractive in order to reuse the effort done during years of development. In this presentation we explain how a relatively complex framework but with modular structure can be prepared for high performance computing with minimum modification. Kratos Multi-Physics [1] is an open source generic multi-disciplinary platform for solution of coupled problems consist of fluid, structure, thermal and electromagnetic fields. The parallelization of this framework is performed with objective of enforcing the less possible changes to its different solver modules and encapsulate the changes as much as possible in its common kernel. This objective is achieved thanks to the Kratos design and also innovative way of dealing with data transfers for a multi-disciplinary code. This work is completed by the migration of the framework from the x86 architecture to the Marenostrum Supercomputing platform. The migration has been verified by a set of benchmarks which show high scalability, from which we present the Telescope problem in this paper.

  • Accelerating boosting-based face detection on GPUs

     Oro, David; Fernández, Carles; Segura, Carlos; Martorell Bofill, Xavier; Hernando Pericas, Francisco Javier
    International Conference on Parallel Processing
    Presentation's date: 2012-09-13
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The goal of face detection is to determine the presence of faces in arbitrary images, along with their locations and dimensions. As it happens with any graphics workloads, these algorithms benefit from data-level parallelism. Existing parallelization efforts strictly focus on mapping different di- vide and conquer strategies into multicore CPUs and GPUs. However, even the most advanced single-chip many-core pro- cessors to date are still struggling to effectively handle real- time face detection under high-definition video workloads. To address this challenge, face detection algorithms typically avoid computations by dynamically evaluating a boosted cascade of classifiers. Unfortunately, this technique yields a low ALU occupancy in architectures such as GPUs, which heavily rely on large SIMD widths for maximizing data-level parallelism. In this paper we present several techniques to increase the performance of the cascade evaluation kernel, which is the most resource-intensive part of the face detection pipeline. Particularly, the usage of concurrent kernel execution in combination with cascades generated with the GentleBoost algorithm solves the problem of GPU underutilization, and achieves a 5X speedup in 1080p videos on average over the fastest known implementations, while slightly improving the accuracy. Finally, we also studied the parallelization of the cascade training process and its scalability under SMP platforms. The proposed parallelization strategy exploits both task and data-level parallelism and achieves a 3.5X speedup over single-threaded implementations

  • Hardware-software coherence protocol for the coexistence of caches and local memories

     Alvarez, Lluc; Vilanova, Lluis; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2012-11-07
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • On the instrumentation of OpenMP and OmpSs Tasking constructs  Open access

     Servat Gelabert, Harald; Teruel Garcia, Xavier; Llort Sanchez, German Matías; Duran Gonzalez, Alejandro; Giménez Lucas, Judit; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    Workshop on Productivity and Performance
    Presentation's date: 2012-08
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Parallelism has become more and more commonplace with the advent of the multicore processors. Although different parallel pro- gramming models have arisen to exploit the computing capabilities of such processors, developing applications that take benefit of these pro- cessors may not be easy. And what is worse, the performance achieved by the parallel version of the application may not be what the developer expected, as a result of a dubious ut ilization of the resources offered by the processor. We present in this paper a fruitful synergy of a shared memory parallel compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the analysis experience of the parallel application by incorporating data that is only known in the compiler and runtime side. Additionally we present performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.

  • OmpSs to OpenCL and to FPGAs

     Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Cabrera, Daniel; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-28
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • The task dependency analysis tool SSgrind

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-28
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • OmpSs to FPGA and overview of applications

     Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Workshop on Efficient Parallel Programming of Bioinformatics Applications on Heterogeneous HPC Platforms
    Presentation's date: 2012-06-25
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Productive programming of GPU clusters with OmpSs

     Bueno Hedo, Javier; Planas, Judit; Duran Gonzalez, Alejandro; Badia Sala, Rosa Maria; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    IEEE International Parallel and Distributed Processing Symposium
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applicactions programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS/PERFORMANCE joint International Conference on Measurement and Modeling of Computer Systems
    Presentation's date: 2012
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

     Vujic, Nikola; Alvarez, Lluc; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    ACM International Conference on Computing Frontiers
    Presentation's date: 2012-05-15
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Extending OpenMP* with vector constructs for modern multicore SIMD architectures

     Klemm, Michael; Tian, X.; Duran Gonzalez, Alejandro; Saito, Hideki; Caballero, Diego; Martorell Bofill, Xavier
    Lecture notes in computer science
    Date of publication: 2012
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Counter-based power modeling methods: top-down vs. bottom-up

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The computer journal (Kalispell, Mont.)
    Date of publication: 2012-08-24
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • POTRA: a framework for building power models for next generation multicore architectures

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    ACM SIGMETRICS Performance Evaluation Review
    Date of publication: 2012-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Architectural Explorations for Streaming Accelerators with Customized Memory Layouts  Open access

     Shafiq, Muhammad
    Defense's date: 2012-05-21
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.

    The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.

  • Hardware and software support for distributed shared memory in chip multiprocessors

     Villavieja Prados, Carlos
    Defense's date: 2012-01-09
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Energy accounting for shared virtualized environments under DVFS using PMC-based power models

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Torres Viñals, Jordi; Ayguade Parra, Eduard
    Future generation computer systems
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Cabarcas Jaramillo, Felipe; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    IEEE transactions on computers
    Date of publication: 2012-02
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Migration of a generic multi-physics framework to HPC environments  Open access

     Dadvand, Pooyan; Rossi, Riccardo; Gil Gómez, Maria Luisa; Martorell Bofill, Xavier; Cotela Dalmau, Jordi; Juanpere Cañameras, Edgar; Idelsohn Barg, Sergio Rodolfo; Oñate Ibáñez de Navarra, Eugenio
    International Conference on Parallel Computational Fluid Dynamics
    Presentation's date: 2011-05-17
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Creating a highly parallelizable code is a challenge and development for distributed memory machines (DMMs) can be very different form developing a serial code in term of algorithms and structure. For this reason, many developers in the field prefer to develop their own code from scratch. However, for an already existing framework with large development background the idea of transformation becomes attractive in order to reuse the effort done during years of development. In this presentation we explain how a relatively complex framework but with modular structure can be prepared for high performance computing with minimum modification. Kratos Multi-Physics [1] is an open source generic multi-disciplinary platform for solution of coupled problems consist of fluid, structure, thermal and electromagnetic fields. The parallelization of this framework is performed with objective of enforcing the less possible changes to its different solver modules and encapsulate the changes as much as possible in its common kernel. This objective is achieved thanks to the Kratos design and also innovative way of dealing with data transfers for a multi-disciplinary code. This work is completed by the migration of the framework from the x86 architecture to the Marenostrum Supercomputing platform. The migration has been verified by a set of benchmarks which show very good scalability, from which we present the Telescope problem in this paper.

  • Real-time GPU-based face detection in HD video sequences

     Oro, David; Fernandez, Carles; Rodriguez Saeta, Javier; Martorell Bofill, Xavier; Hernando Pericas, Francisco Javier
    International Conference on Computer Vision
    Presentation's date: 2011-11-07
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Modern GPUs have evolved into fully programmable parallel stream multiprocessors. Due to the nature of the graphic workloads, computer vision algorithms are in good position to leverage the computing power of these devices. An interesting problem that greatly benefits from parallelism is face detection. This paper presents a highly optimized Haar-based face detector that works in real time over high definition videos. The proposed kernel operations exploit both coarse and fine grain parallelism for performing integral image computations and filter evaluations, thus being beneficial not only for face detection but also for other computer vision techniques. Compared to previous implementations, the experiments show that our proposal achieves a sustained throughput of 35 fps under 1080p resolutions using a sliding window with step of one pixel.

  • Productive cluster programming with OmpSs

     Bueno Hedo, Javier; Martinell, Lluis; Duran Gonzalez, Alejandro; Farreras Esclusa, Montserrat; Martorell Bofill, Xavier; Badia Sala, Rosa Maria; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose
    International European Conference on Parallel and Distributed Computing
    Presentation's date: 2011-09-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Automatic generation and testing of application specific hardware accelerators on a new reconfigurable OpenSPARC platform

     González Álvarez, Cecilia; Fernández, Mikel; Jimenez Gonzalez, Daniel; Alvarez Martinez, Carlos; Martorell Bofill, Xavier
    HiPEAC Workshop on Reconfigurable Computing
    Presentation's date: 2011-01-23
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Physical Impairments Aware Planning and Operation of Transparent Optical Networks

     Azodolmolky, Siamak
    Defense's date: 2011-02-25
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • Programming, Debugging, Profiling and Optimizing Transactional Memory Programs  Open access

     Hasanov Zyulkyarov, Ferad
    Defense's date: 2011-07-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Transactional memory (TM) is a new optimistic synchronization technique which has the potential of making shared memory parallel programming easier compared to locks without giving up from the performance. This thesis explores four aspects in the research of transactional memory. First, it studies how programming with TM compares to locks. During the course of work, it develops the first real transactional application ¿ AtomicQuake. AtomicQuake is adapted from the parallel version of the Quake game server by replacing all lock-based synchronization with atomic blocks. Findings suggest that programming with TM is indeed easier than locks. However the performance of current software TM systems falls behind the efficiently implemented lock-based versions of the same program. Also, the same findings report that the proposed language level extensions are not sufficient for developing robust production level software and that the existing development tools such as compilers, debuggers, and profilers lack support for developing transactional application. Second, this thesis introduces new set of debugging principles and abstractions. These new debugging principles and abstractions enable debugging synchronization errors which manifest at coarse atomic block level, wrong code inside atomic blocks, and also performance errors related to the implementation of the atomic block. The new debugging principles distinguish between debugging at the language level constructs such as atomic blocks and debugging the atomic blocks based on how they are implemented whether TM or lock inference. These ideas are demonstrated by implementing a debugger extension for WinDbg and the ahead-of-time C# to X86 Bartok-STM compiler. Third, this thesis investigates the type of performance bottlenecks in TM applications and introduces new profiling techniques to find and understand these bottlenecks. The new profiling techniques provide in-depth and comprehensive information about the wasted work caused by aborting transactions. The individual profiling abstractions can be grouped in three groups: (i) techniques to identify multiple conflicts from a single program run, (ii) techniques to describe the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize which transactions conflict most. The ideas were demonstrated by building a lightweight profiling framework for Bartok-STM and an offline tool which process and display the profiling data. Forth, this thesis explores and introduces new TM specific optimizations which target the wasted work due to aborting transactions. Using the results obtained with the profiling tool it analyzes and optimizes several applications from the STAMP benchmark suite. The profiling techniques effectively revealed TM-specific bottlenecks such as false conflicts and contentions accesses to data structures. The discovered bottlenecks were subsequently eliminated with using the new optimization techniques. Among the optimization highlights are the transaction checkpoints which reduced the wasted work in Intruder with 40%, decomposing objects to eliminate false conflicts in Bayes, early release in Labyrinth which decreased wasted work from 98% to 1%, using less contentions data structures such as chained hashtable in Intruder and Genome which have higher degree of parallelism.

  • Proactive Software Rejuvenation solution for web enviroments on virtualized platforms

     Alonso López, Javier
    Defense's date: 2011-02-21
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

     Share Reference managers Reference managers Open in new window

  • CASTELL: A HETEROGENEOUS CMP ARCHITECTURE SCALABLE TO HUNDREDS OF PROCESSORS  Open access

     Cabarcas Jaramillo, Felipe
    Defense's date: 2011-09-19
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. iii

  • Poster: programming clusters of GPUs with OMPSs

     Bueno Hedo, Javier; Duran Gonzalez, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Labarta Mancho, Jesus Jose; Badia Sala, Rosa Maria
    International Conference for High Performance Computing, Networking, Storage and Analysis~
    Presentation's date: 2011-11-18
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    ACOTES project: Advanced compiler technologies for embedded streaming  Open access

     Munk, H.; Ayguade Parra, Eduard; Bastoul, C.; Carpenter, Paul Matthew; Chamski, Z.; Cohen, A.; Cornero, M.; Dumont, P.; Duranton, M.; Fellahi, M.; Ferrer, Roger; Ladelsky, R.; Lindwer, M.; Martorell Bofill, Xavier; Miranda, C.; Nuzman, D.; Ornstein, A.; Pop, A.; Pop, S.; Pouchet, L. N; Ramirez Bellido, Alejandro; Ródenas, D.; Rohou, E.; Rosen, I.; Shvadron, U.; Trifunovic, K.; Zaks, A.
    International journal of parallel programming
    Date of publication: 2011-04
    Journal article

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

  • OmpSs: A proposal for programming heterogeneous multi-core architectures

     Duran Gonzalez, Alejandro; Ayguade Parra, Eduard; Badia,, R.M.; Labarta Mancho, Jesus Jose; Martinell, Lluis; Martorell Bofill, Xavier; Planas, Judit
    Parallel processing letters
    Date of publication: 2011-06
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Decomposable and responsive power models for multicore processors using performance counters

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    International Conference for High Performance Computing, Networking, Storage and Analysis
    Presentation's date: 2010-06-04
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • DMA++: on the fly data realignment for on-chip memories

     Vujic, Nikola; Gonzalez Tallada, Marc; Cabarcas Jaramillo, Felipe; Ramirez Bellido, Alejandro; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    International Symposium on High-Performance Computer Architecture (HPCA)
    Presentation's date: 2010
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Transient congestion avoidance in software distributed shared memory systems

     Costa Prats, Juan Jose; Cortes Rossello, Antonio; Martorell Bofill, Xavier; Bueno Hedo, Javier; Ayguade Parra, Eduard
    International Conference on Parallel and Distributed Computing, Applications and Technologies
    Presentation's date: 2010-12-08
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Reducing data access latency in SDSM systems using runtime optimizations

     Bueno Hedo, Javier; Martorell Bofill, Xavier; Costa Prats, Juan Jose; Cortes Rossello, Antonio; Ayguade Parra, Eduard; Zhang, Guansong; Barton, Christopher; Silvera, Raul
    Conference of the Center for Advanced Studies on Collaborative Research (CASCON)
    Presentation's date: 2010-11-01
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Transient congestion avoidance in software distributed shared memory systems

     Costa Prats, Juan Jose; Cortes Rossello, Antonio; Martorell Bofill, Xavier; Bueno Hedo, Javier; Ayguade Parra, Eduard
    International Conference on Parallel and Distributed Computing, Applications and Technologies
    Presentation's date: 2010-12
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • GPFPGA: entorno para la generación automática de códigos HDL portables entre FPGAs

     Jimenez Gonzalez, Daniel; Sánchez Fernández, Raúl; Alvarez Martinez, Carlos; Morillo Pozo, Julian David; Cabrera, Daniel; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Jornadas de Computación Reconfigurable y Aplicaciones
    Presentation's date: 2010-09
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Accurate energy accounting for shared virtualized environments using PMC-based power modeling techniques

     Bertran Monfort, Ramon; Becerra Fontal, Yolanda; Carrera Perez, David; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Torres Viñals, Jordi; Ayguade Parra, Eduard
    ACM/IEEE International Conference on Grid Computing
    Presentation's date: 2010-10-27
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Parallel programming models for heterogeneous multicore architectures

     Ferrer, Roger; Bellens, Pieter; Beltran Querol, Vicenç; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Yeom, Jae-Seung; Schneider, Scott; Koukos, Konstantinos; Alvanos, Michail; Nikolopoulos, Dimitrios S.; Bilas, Angelos
    IEEE micro
    Date of publication: 2010-09-01
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Analysis and Optimization of Question Answering Systems  Open access

     Dominguez Sala, David
    Defense's date: 2010-04-23
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

  • HiPEAC Paper Award

     Vujic, Nikola; Gonzalez Tallada, Marc; Ramirez Bellido, Alejandro; Cabarcas Jaramillo, Felipe; Martorell Bofill, Xavier; Ayguade Parra, Eduard
    Award or recognition

     Share

  • Extending OpenMP to survive the heterogeneous multi-core era

     Ayguade Parra, Eduard; Badia Sala, Rosa Maria; Bellens, Pieter; Cabrera, Daniel; Duran González, Alejandro; Ferrer, Roger; Gonzalez Tallada, Marc; Igual, Francisco D.; Jimenez Gonzalez, Daniel; Labarta Mancho, Jesus Jose; Martinell, Lluis; Martorell Bofill, Xavier; Mayo, Rafael; Pérez Cáncer, Josep Maria; Planas, Judit; Quintana Ortí, Enrique Salvador
    International journal of parallel programming
    Date of publication: 2010-10
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, theCell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

  • Local memory design space exploration for high-performance computing

     Bertran Monfort, Ramon; Gonzalez Tallada, Marc; Martorell Bofill, Xavier; Navarro Mas, Nacho; Ayguade Parra, Eduard
    The Computer journal (paper)
    Date of publication: 2010-03-23
    Journal article

    View View Open in new window  Share Reference managers Reference managers Open in new window