Graphic summary
  • Show / hide key
  • Information


Scientific and technological production
  •  

1 to 45 of 45 results
  • Thread assignment of multithreaded network applications in multicore/multithreaded processors

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    IEEE transactions on parallel and distributed systems
    Date of publication: 2013-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment.

  • Improving the Effective Use of Multithreaded Architectures: Implications on Compilation, Thread Assignment, and Timing Analysis  Open access

     Radojkovic, Petar
    Defense's date: 2013-07-19
    Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Esta tesis presenta soluciones a distinto nivel que mejoran el uso efectivo de arquitecturas multinúcleo/multihilo. Las contribuciones de la tesis pueden ser clasificadas en tres grupos. En el primero, proponemos varios métodos para la asignación de aplicaciones de red que funcionan en servidores multinúcleo/multihilo. En el segundo, analizamos el problema de la partición de gráfos que es una parte del proceso de compilación de las aplicaciones multihilo. Finalmente, presentamos un método que mejora el análisis de tiempo de ejecución de las aplicaciones en el entorno de tiempo real. A continuación realizamos un resumen de cada una de las contribuciones:(1) Asignación de aplicaciones en procesadores multinúcleo/multihilo:La última generación de procesadores multinúcleo/multihilo tiene distintos niveles de reparto de recursos (p.ej. entre aplicaciones funcionando en el mismo núcleo o recursos compartidos entre todas las aplicaciones funcionando en el procesador). Así pues, el modo en que las aplicaciones de determinada carga de trabajo son asignadas a los hilos de un procesador determina qué recursos comparten las aplicaciones. Por consiguiente, ese tipo de asignación puede afectar significativamente al rendimiento del sistema. En esta tesis, demostramos la importancia de la asignación de aplicaciones de red que funcionan en servidores multinucleo/multihilo. También presentamos TSBSched y BlackBox scheduler, métodos para la asignación de aplicaciones de red que funcionan en procesadores con varios niveles de reparto de recursos. Finalmente, proponemos un punto de vista estadístico al problema de la asignación de aplicaciones. En particular, demostramos que realizar una muestra de varios cientos o varios miles de asignaciones de aplicaciones de forma aleatoria supondrá, con una alta probabilidad, capturar al menos una asignación de entre el 1% de las mejores asignaciones. También describimos el método que estima el rendimiento óptimo del sistema para una determinada carga de trabajo. (2) Partición Kernel en aplicaciones streamingUn paso importante en la compilación de un programa streaming a una aplicación multihilo es la partición de los kernels. Encontrar la partición de los kernels óptima es, sin embargo, un problema insuperable. Nosotros proponemos abordar el problema de la partición de los kernels desde un punto de vista estadístico. Describimos un método que estima estadísticamente el rendimiento de la partición óptima. Demostramos que el método de tomar muestras es una parte importante del análisis, y que no todos los métodos que generan muestras aleatorias proveen buenos resultados. También demostramos que las propias muestras aleatorias pueden ser utilizadas para encontrar una buena partición de los kernels, las cuales podrían ser una alternativa a los métodos heurísticos. (3) Procesadores multinúcleo/multihilo en entornos de tiempo realA pesar de los beneficios que los procesadores multinúcleo/multihilo pueden ofrecer, el entorno de tiempo real todavía no ha aceptado un cambio hacia este tipo de arquitecturas. El principal reto con las arquitecturas multinúcleo/multihilo es la dificultad de predecir el tiempo de ejecución cuando varias tareas funcionan a la vez. En las arquitecturas multinúcleo/multihilo el tiempo de ejecución de la tarea, y por tanto el peor tiempo de ejecución, depende de la interferencia con otras tareas ejecutadas al mismo tiempo en un mismo procesador. Nosotros proponemos una metodología que cuantifica la desaceleración que una tarea puede experimentar debido a la colisión con otras tareas que están siendo ejecutadas al mismo tiempo en el procesador multinúcleo/multihilo. La metodología ha sido aplicada a un estudio de caso en el cual distintas aplicaciones de tiempo real fueron ejecutadas en varios procesadores multinúcleo/multihilo.

    This thesis presents cross-domain approaches that improve the effective use of multithreaded architectures. The contributions of the thesis can be classified in three groups. First, we propose several methods for thread assignment of network applications running in multithreaded network servers. Second, we analyze the problem of graph partitioning that is a part of the compilation process of multithreaded streaming applications. Finally, we present a method that improves the measurement-based timing analysis of multithreaded architectures used in time-critical environments. The following sections summarize each of the contributions. (1) Thread assignment on multithreaded processors: State-of-the-art multithreaded processors have different level of resource sharing (e.g. between thread running on the same core and globally shared resources). Thus, the way that threads of a given workload are assigned to processors' hardware contexts determines which resources the threads share, which, in turn, may significantly affect the system performance. In this thesis, we demonstrate the importance of thread assignment for network applications running in multithreaded servers. We also present TSBSched and BlackBox scheduler, methods for thread assignment of multithreaded network applications running on processors with several levels of resource sharing. Finally, we propose a statistical approach to the thread assignment problem. In particular, we show that running a sample of several hundred or several thousand random thread assignments is sufficient to capture at least one out of 1% of the best-performing assignments with a very high probability. We also describe the method that estimates the optimal system performance for given workload. We successfull y applied TSBSched, BlackBox scheduler, and the presented statistical approach to a case study of thread assignment of multithreaded network applications running on the UltraSPARC T2 processor. (2) Kernel partitioning of streaming applications: An important step in compiling a stream program to multiple processors is kernel partitioning. Finding an optimal kernel partition is, however, an intractable problem. We propose a statistical approach to the kernel partitioning problem. We describe a method that statistically estimates the performance of the optimal kernel partition. We demonstrate that the sampling method is an important part of the analysis, and that not all methods that generate random samples provide good results. We also show that random sampling on its own can be used to find a good kernel partition, and that it could be an alternative to heuristics-based approaches. The presented statistical method is applied successfully to the benchmarks included in the StreamIt 2.1.1 suite. (3) Multithreaded processors in time-critical environments: Despite the benefits that multithreaded commercial-of-the-shelf (MT COTS) processors may offer in embedded real-time systems, the time-critical market has not yet embraced a shift toward these architectures. The main challenge with MT COTS architectures is the difficulty when predicting the execution time of concurrently-running (co-running) time-critical tasks. Providing a timing analysis for real industrial applications running on MT COTS processors becomes extremely difficult because the execution time of a task, and hence its worst-case execution time (WCET) depends on the interference with co-running tasks in shared processor resources. We show that the measurement-based timing analysis used for single-threaded processors cannot be directly extended for MT COTS architectures. Also, we propose a methodology that quantifies the slowdown that a task may experience because of collision with co-running tasks in shared resources of MT COTS processor. The methodology is applied to a case study in which different time-critical applications were executed on several MT COTS multithreaded processors.

  • The problem of evaluating CPU-GPU systems with 3d visualization applications

     Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Valero Cortes, Mateo
    IEEE micro
    Date of publication: 2012-12
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Complex, computationally demanding 3D visualization applications can be used as benchmarks to evaluate CPU-GPU systems. However, because those applications are time dependent, their execution is not deterministic. Thus, measurements can vary from one execution to another. This article proposes a methodology that enforces the starting times of frames so that applications behave deterministically.

    Complex, computationally demanding 3D visualization applications can be used as benchmarks to evaluate CPU-GPU systems. However, because those applications are time dependent, their execution is not deterministic. Thus, measurements can vary from one execution to another. This article proposes a methodology that enforces the starting times of frames so that applications behave deterministically.

  • Concurs Wayra Barcelona 2012

     Arroyo, Ignacio; Pajuelo González, Manuel Alejandro; Verdu Mula, Javier
    Award or recognition

    View View Open in new window  Share

  • Optimal task assignment in multithreaded processors: a statistical approach

     Cakarevic, Vladimir; Radojkovic, Petar; Moreto Planas, Miquel; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla, Francisco J.; Nemirovsky, Mario; Valero Cortes, Mateo
    International Conference on Architectural Support for Programming Languages and Operating Systems
    Presentation's date: 2012
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    The introduction of massively multithreaded (MMT) processors, comprised of a large number of cores with many shared resources, has made task scheduling, in particular task to hardware thread assignment, one of the most promising ways to improve system performance. However, finding an optimal task assignment for a workload running on MMT processors is an NP-complete problem. Due to the fact that the performance of the best possible task assignment is unknown, the room for improvement of current task-assignment algorithms cannot be determined. This is a major problem for the industry because it could lead to: (1)~A waste of resources if excessive effort is devoted to improving a task assignment algorithm that already provides a performance that is close to the optimal one, or (2)~significant performance loss if insufficient effort is devoted to improving poorly-performing task assignment algorithms. In this paper, we present a method based on Extreme Value Theory that allows the prediction of the performance of the optimal task assignment in MMT processors. We further show that executing a sample of several hundred or several thousand random task assignments is enough to obtain, with very high confidence, an assignment with a performance that is close to the optimal one. We validate our method with an industrial case study for a set of multithreaded network applications running on an UltraSPARC~T2 processor.

  • Procedimiento, sistema y pieza de código ejecutable para controlar el uso de recursos de hardware de un sistema informático

     Pajuelo González, Manuel Alejandro; Verdu Mula, Javier
    Date of request: 2012-04-16
    Invention patent

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Procedimiento, sistema y pieza de código ejecutable para controlar el uso de recursos de hardware de un sistema informático.

    La invención se refiere a un procedimiento para controlar el uso de recursos de hardware de un sistema informático por parte de una aplicación que se ejecuta sobre un sistema operativo que comprende al menos una interfaz de programación de aplicaciones (API) y que se ejecuta sobre este sistema informático, mediante una pieza de código ejecutable adaptada para ser inyectada en un proceso perteneciente a la aplicación, comprendiendo el procedimiento interceptar la llamada del proceso al servicio de la api; actuar sobre una entidad software perteneciente al proceso en ejecución, a partir de la interceptación de la llamada del proceso al servicio de la API.

  • Procedimiento, sistema y pieza de código ejecutable para virtualizar un recurso de hardware asociado a un sistema informático

     Pajuelo González, Manuel Alejandro; Verdu Mula, Javier
    Date of request: 2012-04-16
    Invention patent

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Procedimiento, sistema y pieza de código ejecutable para virtualizar un recurso de hardware asociado a un sistema informático.

    Procedimiento para virtualizar recursos de hardware asociado a un sistema informático por parte de una pieza de código ejecutable adaptada para ser inyectada en un proceso perteneciente a una aplicación que se ejecuta sobre un sistema operativo que comprende al menos una API que se ejecuta sobre el sistema informático, que comprende interceptar una llamada del proceso a un servicio de una API relacionado con la gestión del flujo de datos producido entre el proceso y el recurso de hardware; gestionar el flujo de datos producido entre el proceso y el recurso de hardware por parte de la pieza de código, a partir de la interceptación de la llamada del proceso al servicio de la API relacionado con la gestión del flujo de datos producido entre el proceso y el recurso de hardware.

  • An abstraction methodology for the evaluation of multi-core multi-threaded architectures

     Zilan, Ruken; Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Milito, Rodolfo; Valero Cortes, Mateo
    IEEE International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems
    Presentation's date: 2011-07-25
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    As the evolution of multi-core multi-threaded processors continues, the complexity demanded to perform an extensive trade-off analysis, increases proportionally. Cycle-accurate or trace-driven simulators are too slow to execute the large amount of experiments required to obtain indicative results. To achieve a thorough analysis of the system, software benchmarks or traces are required. In many cases when an analysis is needed most, during the earlier stages of the processor design, benchmarks or traces are not available. Analytical models overcome these limitations but do not provide the fine grain details needed for a deep analysis of these architectures. In this work we present a new methodology to abstract processor architectures, at a level between cycle-accurate and analytical simulators. To apply our methodology we use queueing modeling techniques. Thus, we introduce Q-MAS, a queueing based tool targeting a real chip (the Ultra SPARC T2 processor) and aimed at facilitating the quantification of trade-offs during the design phase of multi-core multi-threaded processor architectures. The results demonstrate that Q-MAS, the tool that we developed, provides accurate results very close to the actual hardware, with a minimal cost of running what-if scenarios.

  • Thread to strand binding of parallel network applications in massive multi-threaded systems

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla, Francisco J.; Nemirovsky, Mario; Valero Cortes, Mateo
    ACM SIGPLAN notices
    Date of publication: 2010-05
    Journal article

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/ thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative ofmultithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

  • Thread to strand binding of parallel network applications in massive multi-threaded systems

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    Presentation's date: 2010-01
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

  • ARQUITECTURA DE COMPUTADORS D'ALTRES PRESTACIONS (CAP)

     Jimenez Castells, Marta; Pericas Gleim, Miquel; Navarro Guerrero, Juan Jose; Llaberia Griño, Jose M.; Llosa Espuny, Jose Francisco; Villavieja Prados, Carlos; Alvarez Martinez, Carlos; Jimenez Gonzalez, Daniel; Ramirez Bellido, Alejandro; Morancho Llena, Enrique; Fernandez Jimenez, Agustin; Pajuelo González, Manuel Alejandro; Olive Duran, Angel; Sanchez Carracedo, Fermin; Moreto Planas, Miquel; Verdu Mula, Javier; Abella Ferrer, Jaume; Valero Cortes, Mateo
    Participation in a competitive project

     Share

  • Access to the full text
    Internet traffic and the behavior of processing workloads  Open access

     Zilan, Ruken; Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Valero Cortes, Mateo
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Nowadays, the evolution of network services provided at the edge of Internet increases the requirements of network applications. Such applications result in complexities thus, the processors need to execute more complex workloads that can deal not only with the packet header, but also with the packet payload (e.g. Deep Packet Inspection). Unlike common routing applications that show similar processing among packets, next-generation of network applications present variations in the processing procedure among packets. Thus, different traffic behaviors can produce different process patterns and present different memory and processing requirements. The aim of this work is to present an ongoing work towards correlating Internet traffic features with variations of processing workloads on the next-generation of edge routers.

  • Experiencias en el uso de un mapa conceptual global en SO

     Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Lopez Alvarez, David
    Jornades de Docència del Departament d'Arquitectura de Computadors
    Presentation's date: 2009-02-13
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Measuring operating system overhead on Sun UltraSparc T1 processor

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Gioiosa, Roberto; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems
    Presentation's date: 2009-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Characterizing the resource-sharing levels of the UltraSparc T2 processor

     Cakarevic, Vladimir; Radojkovic, Petar; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Cazorla, Francisco J.; Nemirovsky, Mario; Valero Cortes, Mateo
    IEEE/ACM International Symposium on Microarchitecture
    Presentation's date: 2009-12
    Presentation of work at congresses

    Read the abstract Read the abstract View View Open in new window  Share Reference managers Reference managers Open in new window

    Thread level parallelism (TLP) has become a popular trend to improve processor performance, overcoming the limitations of extracting instruction level parallelism. Each TLP paradigm, such as Simultaneous Multithreading or Chip-Multiprocessors, provides di erent bene ts, which has motivated processor vendors to combine several TLP paradigms in each chip design. Even if most of these combined-TLP designs are homogeneous, they present di erent levels of hardware resource sharing, which introduces complexities on the operating system scheduling and load balancing. Commonly, processor designs provide two levels of resource sharing: Inter-core in which only the highest levels of the cache hierarchy are shared, and Intracore in which most of the hardware resources of the core are shared . Recently, Sun Microsystems has released the UltraSPARC T2, a processor with three levels of hardware resource sharing: InterCore, IntraCore, and IntraPipe. In this work, we provide the rst characterization of a three-level resource sharing processor, the UltraSPARC T2, and we show how multi-level resource sharing a ects the operating system design. We further identify the most critical hardware resources in the T2 and the characteristics of applications that are not sensitive to resource sharing. Finally, we present a case study in which we run a real multithreaded network application, showing that a resource sharing aware scheduler can improve the system throughput up to 55%.

  • Mapa Conceptual Global como herramienta para la vision global de un sistema operativo  Open access

     Verdu Mula, Javier; Lopez Alvarez, David; Pajuelo González, Manuel Alejandro
    Jornadas de Enseñanza Universitaria de la Informática
    Presentation's date: 2009-07
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Numerosas asignaturas están formadas por un temario que está totalmente interrelacionado. Al final del curso los estudiantes deberían haber adquirido los conocimientos de cada tema pero, más importante aún, deberían saber cómo interactúan los diferentes temas entre ellos para obtener una visión global de la asignatura. Sin embargo, a menudo los estudiantes se centran en los temas por separado, en parte porque no les ofrecemos herramientas que les ayuden a relacionar las distintas partes del curso. En este trabajo presentamos el uso de un Mapa Conceptual Global (MCG) de una asignatura como recurso docente que ayuda al estudiante a obtener una visión de conjunto de todo el temario. La experiencia ha sido realizada como complemento de una clase de aprendizaje activo en una asignatura de Sistemas Operativos, pero pensamos que puede ser fácilmente aplicable a otros cursos.

  • Analysis and Architectural Support for Parallel Stateful Packet Processing  Open access

     Verdu Mula, Javier
    Defense's date: 2008-07-09
    Department of Computer Architecture, Universitat Politècnica de Catalunya
    Theses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    The evolution of network services is closely related to the network technology trend. Originally network nodes forwarded packets from a source to a destination in the network by executing lightweight packet processing, or even negligible workloads. As links provide more complex services, packet processing demands the execution of more computational intensive applications. Complex network applications deal with both packet header and payload (i.e. packet contents) to provide upper layer network services, such as enhanced security, system utilization policies, and video on demand management.Applications that provide complex network services arise two key capabilities that differ from the low layer network applications: a) deep packet inspection examines the packet payload tipically searching for a matching string or regular expression, and b) stateful processing keeps track information of previous packet processing, unlike other applications that don't keep any data about other packet processing. In most cases, deep packet inspection also integrates stateful processing.Computer architecture researches aim to maximize the system throughput to sustain the required network processing performance as well as other demands, such as memory and I/O bandwidth. In fact, there are different processor architectures depending on the sharing degree of hardware resources among streams (i.e. hardware context). Multicore architectures present multiple processing engines within a single chip that share cache levels of memory hierarchy and interconnection network. Multithreaded architectures integrates multiple streams in a single processing engine sharing functional units, register file, fecth unit, and inner levels of cache hierarchy. Scalable multicore multithreaded architectures emerge as a solution to overcome the requirements of high throughput systems. We call massively multithreaded architectures to the architectures that comprise tens to hundreds of streams distributed across multiple cores on a chip. Nevertheless, the efficient utilization of these architectures depends on the application characteristics. On one hand, emerging network applications show large computational workloads with significant variations in the packet processing behavior. Then, it is important to analyze the behavior of each packet processing to optimally assign packets to threads (i.e. software context) for reducing any negative interaction among them. On the other hand, network applications present Packet Level Parallelism (PLP) in which several packets can be processed in parallel. As in other paradigms, dependencies among packets limit the amount of PLP. Lower network layer applications show negligible packet dependencies. In contrast, complex upper network applications show dependencies among packets leading to reduce the amount of PLP.In this thesis, we address the limitations of parallelism in stateful network applications to maximize the throughput of advanced network devices. This dissertation comprises three complementary sets of contributions focused on: network analysis, workload characterization and architectural proposal.The network analysis evaluates the impact of network traffic on stateful network applications. We specially study the impact of network traffic aggregation on memory hierarchy performance. We categorize and characterize network applications according to their data management. The results point out that stateful processing presents reduced instruction level parallelism and high rate of long latency memory accesses. Our analysis reveal that stateful applications expose a variety of levels of parallelism related to stateful data categories. Thus, we propose the MultiLayer Processing (MLP) as an execution model to exploit multiple levels of parallelism. The MLP is a thread migration based mechanism that increases the sinergy among streams in the memory hierarchy and alleviates the contention in critical sections of parallel stateful workloads.

  • Sistemes Operatius. Quadern de Laboratori

     Pajuelo González, Manuel Alejandro; Lopez Alvarez, David; Millan Vizuete, Amador; Heredero Lazaro, Ana M.; Durán, Alex; Herrero Zaragoza, José Ramón; Verdu Mula, Javier; Becerra Fontal, Yolanda; Morancho Llena, Enrique
    Date of publication: 2008-07
    Book

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Measuring operating system overhead on CMT processors  Open access

     Radojkovic, Petar; Cakarevic, Vladimir; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Gioiosa, Roberto; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    Symposium on Computer Architecture and High Performance Computing
    Presentation's date: 2008-10-29
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing (HPC), especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1, running Linux and Solaris. Since a real system is too complex to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.

  • Montando el puzzle: visión global de un sistema operativo

     Verdu Mula, Javier; Lopez Alvarez, David; Pajuelo González, Manuel Alejandro
    Jornadas de Enseñanza Universitaria de la Informática
    Presentation's date: 2008-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Access to the full text
    Overhead of the spin-lock loop in UltraSPARC T2  Open access

     Cakarevic, Vladimir; Radojkovic, Petar; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Nemirovsky, Mario; Valero Cortes, Mateo; Pajuelo González, Manuel Alejandro; Verdu Mula, Javier
    HiPEAC Industrial Workshop
    Presentation's date: 2008-06-04
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Spin locks are task synchronization mechanism used to provide mutual exclusion to shared software resources. Spin locks have a good performance in several situations over other synchronization mechanisms, i.e., when on average tasks wait short time to obtain the lock, the probability of getting the lock is high, or when there is no other synchronization mechanism. In this paper we study the effect that the execution of spinlocks create in multithreaded processors. Besides going to multicore architectures, recent industry trends show a big move toward hardware multithreaded processors. Intel P4, IBM POWER5 and POWER6, Sun's UltraSPARC T1 and T2 all this processors implement multithreading in various degrees. By sharing more processor resources we can increase system's performance, but at the same time, it increases the impact that processes executing simultaneously introduce to each other.

  • Una metodología para obtener la visión global de un SO

     Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Lopez Alvarez, David
    Jornades de Docència del Departament d'Arquitectura de Computadors
    Presentation's date: 2008-02-14
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Multiplayer processing - an execution model for parallel stateful packet processing

     Verdu Mula, Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    ACM/IEEE Symposium on Architectures for Networking and Communications Systems
    Presentation's date: 2008-11
    Presentation of work at congresses

    View View Open in new window  Share Reference managers Reference managers Open in new window

  • Access to the full text
    Undertanding the overhead of the spin-lock loop in CMT architectures  Open access

     Cakarevic, Vladimir; Radojkovic, Petar; Verdu Mula, Javier; Cazorla Almeida, Francisco Javier; Gioiosa, Roberto; Pajuelo González, Manuel Alejandro; Nemirovsky, Mario; Valero Cortes, Mateo
    workshop in the interaction between Operating Systems and Computer Architecture
    Presentation's date: 2008-06-18
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Spin locks are a synchronization mechanisms used to provide mutual exclusion to shared software resources. Spin locks are used over other synchronization mechanisms in several situations, like when the average waiting time to obtain the lock is short, in which case the probability of getting the lock is high, or when it is no possible to use other synchronization mechanisms. In this paper, we study the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock. For this purpose, we create a task that continuously executes the spin-lock loop and execute several instances of this task together with another active tasks. Our results show that, when the spin-lock tasks run with other applications in the same core of a T1 or a T2 processor, they introduce a significant overhead on other applications: 31% in T1 and 42% in T2, on average, respectively. For the T1 and T2 processors, we identify the fetch bandwidth as the main source of interaction between active threads and the spin-lock threads. We, propose 4 different variants of the Linux spin-lock loop that require less fetch bandwidth. Our proposal reduces the overhead of the spin-lock tasks over the other applications down to 3.5% and 1.5% on average, in T1 and T2 respectively. This is a reduction of 28 percentage points with respect to the Linux spin-lock loop for T1. For T2 the reduction is about 40 percentage points.

  • Access to the full text
    Understanding the overhead of the spin-lock loop in CMT architectures  Open access

     Cakarevic, Vladimir; Radojkovic, Petar; Verdu Mula, Javier; Pajuelo González, Manuel Alejandro; Gioiosa, Roberto; Cazorla Almeida, Francisco Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    Workshop on the Interaction between Operating Systems and Computer Architecture
    Presentation's date: 2008-06
    Presentation of work at congresses

    Read the abstract Read the abstract Access to the full text Access to the full text Open in new window  Share Reference managers Reference managers Open in new window

    Abstract—Spin locks are a synchronization mechanisms used to provide mutual exclusion to shared software resources. Spin locks are used over other synchronization mechanisms in several situations, like when the average waiting time to obtain the lock is short, in which case the probability of getting the lock is high, or when it is no possible to use other synchronization mechanisms. In this paper, we study the effect that the execution of the Linux spin-lock loop in the Sun UltraSPARC T1 and T2 processors introduces on other running tasks, especially in the worst case scenario where the workload shows high contention on a lock. For this purpose, we create a task that continuously executes the spin-lock loop and execute several instances of this task together with another active tasks. Our results show that, when the spin-lock tasks run with other applications in the same core of a T1 or a T2 processor, they introduce a significant overhead on other applications: 31% in T1 and 42% in T2, on average, respectively. For the T1 and T2 processors, we identify the fetch bandwidth as the main source of interaction between active threads and the spin-lock threads. We, propose 4 different variants of the Linux spin-lock loop that require less fetch bandwidth. Our proposal reduces the overhead of the spin-lock tasks over the other Applications down to 3.5% and 1.5% on average, in T1 and T2 respectively. This is a reduction of 28 percentage points with respect to the Linux spin-lock loop for T1. For T2 the reduction is about 40 percentage points.

  • Computación de Altas Prestaciones V: Arquitecturas, Compiladores, Sistemas Operativos, Herramientas y Aplicaciones

     Ramirez Bellido, Alejandro; Valero Cortes, Mateo; Moreto Planas, Miquel; Cazorla Almeida, Francisco Javier; Abella Ferrer, Jaume; Figueiredo Boneti, Carlos Santieri; Gioiosa, Roberto; Pajuelo González, Manuel Alejandro; Quiñones Moreno, Eduardo; Verdu Mula, Javier; Guitart Fernández, Jordi; Fernandez Jimenez, Agustin; Garcia Almiñana, Jordi; Utrera Iglesias, Gladys Miriam
    Participation in a competitive project

     Share

  • Parallelizing deep packet processing in highly parallel architectuers

     Verdu Mula, Javier; Nemirovsky, Mario; Valero Cortes, Mateo
    Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES 2007)
    Presentation's date: 2007-07
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • The Impact of Traffic Aggregation on the Memory Performance of Networking Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Valero Cortes, Mateo
    Journal of embedded computing
    Date of publication: 2006-10
    Journal article

     Share Reference managers Reference managers Open in new window

  • The Impact of Traffic Aggregation on the Memory Performance of Networking Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Valero Cortes, Mateo
    Computer architecture news
    Date of publication: 2005-03
    Journal article

     Share Reference managers Reference managers Open in new window

  • Workload analyses of networking applications

     Verdu Mula, Javier; Garcia Mateos, Jorge; Nemirvsky, Mario; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering

     Holanda Filho, Raimir; Verdu Mula, Javier; Garcia Vidal, Jorge; Valero Cortes, Mateo
    2005 IEEE International Symposium on Performance Analysis of Systems And Software (ISPASS'05)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Performance Analysis of New Packet Trace Compressiong TCP Flow Clustering

     Holanda Filho, Raimir; Verdu Mula, Javier; Garcia Vidal, Jorge; Valero Cortes, Mateo
    IEEE International Symposium on Performance Analisys of Systems and Software ISPASS
    Presentation's date: 2005-03
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Architectural Impact of Stateful Networking Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Valero Cortes, Mateo
    1st ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS-I).
    Presentation's date: 2005-10-26
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Workload Characterization of Stateful Networking Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Valero Cortes, Mateo
    International Symposium on High Performance Computing
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Analysis of Traffic Traces for Statefull Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, M; Valero Cortes, Mateo
    Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Traffic Aggregation Impact on the Memory Performance of Networking Applications

     Valero Cortes, Mateo; Verdu Mula, Javier; Nemirosvky, M; Garcia Vidal, Jorge
    MEDEA Workshop MEmory performance: DEaling with Applications , systems and architecture in conjunction with PACT 2004 Conference.
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Analysis of Traffic Traces for Stateful Applications

     Verdu Mula, Javier
    Third Workshop on Network Processors & Applications (NP3) in conjunction with the Tenth International Symposium on High-Performance Computer Architecture (HPCA-10)
    Presentation's date: 2004-02-14
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • The impact of traffic aggregation on the memory performance of networking applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, Mario; Valero Cortes, Mateo
    MEDEA Workshop MEmory performance: DEaling with Applications , systems and architecture in conjunction with PACT 2004 Conference.
    Presentation's date: 2004-09
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Traffic Aggregation Impact on the Memory Performance of Networking Applications

     Verdu Mula, Javier
    MEDEA Workshop MEmory performance: DEaling with Applications , systems and architecture in conjunction with PACT 2004 Conference.
    Presentation's date: 2004-09-01
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Analysis of Traffic Traces for Stateful Applications

     Valero Cortes, Mateo; Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirowsky, M
    Third Workshop on Network Processors & Applications (NP3) in conjunction with the Tenth International Symposium on High-Performance Computer Architecture (HPCA-10)
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • Analysis of traffic traces for stateful applications

     Verdu Mula, Javier
    Jornadas de Paralelismo
    Presentation's date: 2004-09-15
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window

  • The Impact of Traffic Aggregation on the Memory Performance of Networking Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, M; Valero Cortes, Mateo
    Date: 2004-07
    Report

     Share Reference managers Reference managers Open in new window

  • Workload Characterization of Emerging Stateful Networking Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, M; Valero Cortes, Mateo
    Date: 2004-09
    Report

     Share Reference managers Reference managers Open in new window

  • Analysis of Traffic Traces for Stateful Applications

     Verdu Mula, Javier; Garcia Vidal, Jorge; Nemirovsky, M; Valero Cortes, Mateo
    Date: 2003-11
    Report

     Share Reference managers Reference managers Open in new window

  • Retos en el Diseño de Network Processors

     Verdu Mula, Javier; Corbal San Adrian, Jesus; Garcia Vidal, Jorge; Valero Cortes, Mateo
    XIII Jornadas de Paralelismo
    Presentation of work at congresses

     Share Reference managers Reference managers Open in new window