There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures.
In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead.
We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.
Bellens, P.; Perez, Josep M.; Cabarcas, F.; Alex Ramirez; Badia, R.M.; Labarta, J. Scientific programming Vol. 17, num. 1-2, p. 77-95 DOI: 10.3233/SPR-2009-0272 Data de publicació: 2009-01 Article en revista
In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
In this paper we explore the idea of customizing and reusing loop schedules to improve the scalability of non-regular numerical codes in shared-memory architectures with non-uniform memory access latency. The main objective is to implicitly setup affinity links between threads and data, by devising loop schedules that achieve balanced work distribution within irregular data spaces and reusing them as much as possible along the execution of the program for better memory access locality. This transformation provides a great deal of flexibility in optimizing locality, without compromising the simplicity of the shared-memory programming paradigm. In particular, the programmer does not need to explicitly distribute data between processors. The paper presents practical examples from real applications and experiments showing the efficiency of the approach.
Many scientific applications involve array operations that are sparse in nature, ie array elements depend on the values of relatively few elements of the same or another array. When parallelised in the shared-memory model, there are often inter-thread dependencies which require that the individual array updates are protected in some way. Possible strategies include protecting all the updates, or having each thread compute local temporary results which are then combined globally across threads. However, for the extremely common situation of sparse array access, neither of these approaches is particularly efficient. The key point is that data access patterns usually remain constant for a long time, so it is possible to use an inspector/executor approach. When the sparse operation is first encountered, the access pattern is inspected to identify those updates which have potential inter-thread dependencies. Whenever the code is actually executed, only these selected updates are protected. We propose a new OpenMP clause, indirect, for parallel loops that have irregular data access patterns. This is trivial to implement in a conforming way by protecting every array update, but also allows for an inspector/executor compiler implementation which will be more efficient in sparse cases. We describe efficient compiler implementation strategies for the new directive. We also present timings from the kernels of a Discrete Element Modelling application and a Finite Element code where the inspector/executor approach is used. The results demonstrate that the method can be extremely efficient in practice.
Nikolopoulos, D.; Papatheodorou, T.; Polychronopoulos, C.; Labarta, J.; Ayguade, E. Scientific programming Vol. 8, num. 3, p. 143-162 DOI: 10.1155/2000/417570 Data de publicació: 2001-07 Article en revista
This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.
In order to ascertain the current status of the implementation of data parallel languages, with particular emphasis on High Performance Fortran (HPF), a survey of proposed or existing implementations has been carried out. The results of the survey are presented in this article. The main objectives of the survey were to determine: (1) what is currently available; (2) the current status of the implementations; (3) how they adhered to the proposed standards; and (4) what future developments are planned. The survey was carried out over the period February to May 1996. It is a snapshot of the state of HPF implementations.
This article describes the main features and implementation of our automatic data distribution research tool. The tool (DDT) accepts programs written in Fortran 77 and generates High Performance Fortran (HPF) directives to map arrays onto the memories of the processors and parallelize loops, and executable statements to remap these arrays. DDT works by identifying a set of computational phases (procedures and loops). The algorithm builds a search space of candidate solutions for these phases which is explored looking for the combination that minimizes the overall cost; this cost includes data movement cost and computation cost. The movement cost reflects the cost of accessing remote data during the execution of a phase and the remapping costs that have to be paid in order to execute the phase with the selected mapping. The computation cost includes the cost of executing a phase in parallel according to the selected mapping and the owner computes rule. The tool supports interprocedural analysis and uses control flow information to identify how phases are sequenced during the execution of the application.