Agent-based modelling and simulation is a promising methodology that can be applied in the study of population dynamics. The main advantage of this technique is that it allows representing the particularities of the individuals that are modeled along with the interactions that take place among them and their environment. Hence, classical numerical simulation approaches are less adequate for reproducing complex dynamics. Nowadays, there is a rise of interest on using distributed computing to perform large-scale simulation of social systems. However, the inherent complexity of this type of applications is challenging and requires the study of possible solutions from the parallel computing perspective (e.g., how to deal with fine grain or irregular workload). In this paper, we discuss the particularities of simulating populating dynamics by using parallel discrete event simulation methodologies. To illustrate our approach, we present a possible solution to make transparent the use of parallel simulation for modeling demographic systems: Yades tool. In Yades, modelers can easily define models that describe different demographic processes with a web user interface and transparently run them on any computer architecture environment thanks to its demographic simulation library and code generator. Therefore, transparency is provided by by two means: the provision of a web user interface where modelers and policy makers can specify their agent-based models with the tools they are familiar with, and the automatic generation of the simulation code that can be executed in any platform (cluster or supercomputer). A study is conducted to evaluate the performance of our solution in a High Performance Computing environment. The main benefit of this outline is that our findings can be generalized to problems with similar characteristics to our demographic simulation model.
Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.
In this work, we analyze the scalability of inexact two-level balancing domain decomposition by constraints (BDDC) preconditioners for Krylov subspace iterative solvers, when using a highly scalable asynchronous parallel implementation where fine and coarse correction computations are overlapped in time. This way, the coarse-grid problem can be fully overlapped by fine-grid computations (which are embarrassingly parallel) in a wide range of cases. Further, we consider inexact solvers to reduce the computational cost/complexity and memory consumption of coarse and local problems and boost the scalability of the solver. Out of our numerical experimentation, we conclude that the BDDC preconditioner is quite insensitive to inexact solvers. In particular, one cycle of algebraic multigrid (AMG) is enough to attain algorithmic scalability. Further, the clear reduction of computing time and memory requirements of inexact solvers compared to sparse direct ones makes possible to scale far beyond state-of-the-art BDDC implementations. Excellent weak scalability results have been obtained with the proposed inexact/overlapped implementation of the two-level BDDC preconditioner, up to 93,312 cores and 20 billion unknowns on JUQUEEN. Further, we have also applied the proposed setting to unstructured meshes and partitions for the pressure Poisson solver in the backward-facing step benchmark domain.
The current trend in development of parallel programming models is to combine different well established models into a single programming, model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.; This article presents DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.
The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.
One of the main difficulties using multi-point statistical (MPS) simulation based on annealing techniques or genetic algorithms concerns the excessive amount of time and memory that must be spent in order to achieve convergence. In this work we propose code optimizations and parallelization schemes over a genetic-based MPS code with the aim of speeding up the execution time. The code optimizations involve the reduction of cache misses in the array accesses, avoid branching instructions and increase the locality of the accessed data. The hybrid parallelization scheme involves a fine-grain parallelization of loops using a shared-memory programming model (OpenMP) and a coarse-grain distribution of load among several computational nodes using a distributed-memory programming model (MPI). Convergence, execution time and speed-up results are presented using 2D training images of sizes 100 × 100 × 1 and 1000 × 1000 × 1 on a distributed-shared memory supercomputing facility.
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.
Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer. We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications' performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.
Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer.
We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications’ performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.
Power has become the primary constraint in high performance computing. Traditionally, parallel job scheduling policies have been designed to improve certain job performance metrics when scheduling parallel workloads on a system with a given number of processors. The available number of processors is not anymore the only limitation in parallel job scheduling. The recent increase in processor power consumption has resulted in a new limitation: the available power. Given constraints naturally lead to an optimization problem. We proposed MaxJobPerf, a new parallel job scheduling policy based on integer linear programming. Dynamic Voltage Frequency Scaling (DVFS) is a widely used technique that running applications at reduced CPU frequency/voltage trades increased execution time for power reduction. The optimization problem determines which jobs should run and at which frequency. In this paper, we compare the MaxJobPerf policy against other power budgeting policies for different power budgets. It clearly outperforms the other power-budgeting approaches at the parallel job scheduling level. Furthermore, we give a detailed analysis of the policy parameters including a discussion on how to manage job reservations to avoid job starvation.
Alba, E.; Almeida, F.; Blesa, M.; Cotta, C.; Diaz, M.; Dorta, I.; Gabarro, J.; León, C.; Luque, G.; Petit, J.; Rodríguez, M.; Rojas, A.; Xafa, F.; Xhafa, F. Parallel computing Vol. 32, num. 5-6, p. 415-440 Data de publicació: 2006-06 Article en revista
The paper discusses the implementation of a parallel algorithm to compute the eigenvalues and eigenvectors of a real symmetric matrix on a mesh multicomputer. The algorithm uses the one-sided Jacobi method and a two-dimensional organization of the nodes. It is aimed at reducing the communication cost incurred by one-dimensional algorithms found in the literature. The performance of the proposed algorithm on a squared 2D/3D mesh multicomputer is assessed through simple analytical models of execution time. The models show that the performance improvement over one-dimensional algorithms can be very noticeable, specially for a large number of nodes.
Parallel architectures with physically distributed memory providing computing cycles and large amounts of memory are becoming more and more common. To make such architectures truly usable, programming models and support tools are needed to ease the programming effort for these parallel systems. Automatic data distribution tools and techniques play an important role in achieving that goal. This paper discusses state-of-the-art approaches to fully automatic data and computation partitioning. A kernel application is used as a case study to illustrate the main differences of four representative approaches. The paper concludes with a discussion of promising future research directions for automatic data layout.
This paper presents a method to derive efficient algorithms for hypercubes. The method exploits two features of the underlying hardware: a) the parallelism provided by the multiple communication links of each node and b) the possibility of overlapping computations and communications which is a feature of machines supporting an asynchronous communication protocol. The method can be applied to a generic class of hypercube algorithms whose distinguishing features are quite frequent in common algorithms for hypercubes. Many examples of this class of algorithms are found in the literature for different problems. The paper shows the efficiency of the method for two case studies. The results show that the reduction in communication overhead is very significant in many cases. They also show that the algorithms produced by our method are always very close to the optimum in terms of execution time.