In this paper, we present OmpSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on different architectures, SMP, GPUs, and hybrid SMP/GPU environments, showing the wide usefulness of the approach. The evaluation is done with six different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, Julia Set, PBPI and FixedGrid. We compare the results obtained with the execution of the same benchmarks written in OpenCL or OpenMP, on the same architectures. The results show that OmpSs greatly outperforms both environments. With the use of OmpSs the programming environment is more flexible than traditional approaches to exploit multiple accelerators, and due to the simplicity of the annotations, it increases programmer's productivity.
Many parallel algorithms exhibit a hypercube communication topology. Such algorithms can easily be executed on a multicomputer with a hypercube interconnection topology. However, in most cases these parallel algorithms only make use of a small fraction of the interconnection bandwidth offered by the multicomputer. In particular, each processor of a hypercube multicomputer is connected to d different neighbors by d different links. Nevertheless, hypercube algorithms usually do not use more than one of these d links at the same time. This paper presents a technique called communication pipelining that enables a more efficient use of the interconnection network and, in consequence, a significant reduction in the execution time. This technique is based on a transformation of the original algorithm. The resulting equivalent algorithm makes use of several links of each node simultaneously. Given a particular problem and a particular architecture, the degree of pipelining to be applied is a design parameter that must be decided when transforming the original algorithm. The paper presents analytical models that allow for an optimal choice of the degree of pipelining for each problem and a given architecture. To illustrate the performance of the communication pipelining technique, its application to the FFT computation is presented as an example. It is shown that an optimal choice of the degree of pipelining can achieve a reduction by a factor of d in the communication overhead of the algorithm.
This paper describes an automatic data distribution method which deal with both the alignment and the distribution problems in a single optimization phase, as opposed to sequentially solving these two inter-dependent approaches as done by previous work. The core of this work is called the Communication-Parallelism Graph, which describes the relationships among array dimensions of the same and different array references regarding communication and parallelism. The overall data distribution problem is then formulated as a linear 0–1 integer programming problem, where the objective function to be minimized is the total execution time. The solution is static in the sense that the layout of the arrays does not change during the execution of the program. We also show the feasibility of using this approach to solve the problem in terms of compilation time and quality of the solutions generated.
In vector multiprocessor systems, collisions in the interconnection network and conflicts in the memory modules are the main causes of the performance degradation. In this work we use a synchronized interconnection network, and propose an interleaved storage scheme and an out-of-order access to the elements of the stream that allow conflict-free access. The streams are generated by the different processors in an asynchronous manner. The mechanism works for the most common strides found in real programs. The hardware required is also described and its complexity is shown to be equivalent to the complexity when the processor requests the elements in order.
Valero, M.; Lang, T.; Llaberia, J.; Peiron, M.; Navarro, J.; Ayguade, E. Parallel processing letters Vol. 1, num. 2, p. 95-102 DOI: 10.1142/S0129626491000045 Data de publicació: 1991-12 Article en revista
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free access to one family of strides in vector processors with matched memories. The paper extends these schemes to achieve this conflict-free access for several families. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. The hardware required is similar to that for the access in order.