We present a new model for the execution of Prolog programs, called MEM (Multipath Execution Model), which is aimed at improving the performance of the standard sequential depth-first left-to-right execution model by reducing the computational cost of traversing the search tree associated to a program. The MEM model combines a depth and breadth exploration of the search tree that avoids the execution of several control instructions. Moreover, MEM can easily exploit a new kind of parallelism, called path parallelism, which allows the parallel execution of unify operations corresponding to different paths, with the additional benefit that parallel computations are always independent among them.
This paper presents a review and a classification of mechanisms for reducing the cost of branches in pipelined processors. We show that the wide spectrum of different mechanisms proposed in the literature are based on just few techniques that in each case are combined to build the particular mechanism. The basis of these techniques are explained and many examples of real cases are given using most of the latest commercial and academic processors.
The simultaneous access to several vectors is typical in vector multiprocessors. When these accesses are performed in an asynchronous manner, collisions in the network and the conflicts in the memory modules produce high latencies that reduce the efficiency of the system. The authors propose a block-interleaved storage scheme to store streams as well as a synchronized out-of-order access mechanism to the vectors that compose the stream so no access conflicts occur for several families of strides.
External memory communication is one of the most limiting factors to processor performance. In this paper, a new mechanism is presented for RISC architectures and, in general, for those architectures with a fixed pipelined structure and data references generated by load/store instructions. The mechanism is based on the restricted use of and on-chip instruction buffer (RIB) to avoid those structural hazards generated by concurrent instruction/data memory accesses. An important area saving and performance improvement can be obtained when comparing with traditional on-chip instruction cache approaches.
The execution of branch instructions causes a loss of performance on pipelined processors. In this paper a new branch mechanism based on a Branch Target Buffer is presented. It executes branches with zero time cost. In order to evaluate its performance improvement for several pipeline structures, an analytical model has been developed and simulations have been performed. The chip area required for its implementation is also considered. The performance increase and the simplicity of its design make it suitable to be included in a RISC-like processor.
Execution of branch instructions is one of the main factors that prevents RISC processors from achieving their peak execution rate. A mechanism that attempts to execute branches with zero time cost is proposed. An analytical model that explains the behavior of the mechanism is presented. Simulation results show a significant performance improvement when compared with other schemes widely used in RISC architectures.
The multiple-busses network is attractive for interconnecting processors to memory modules in a multiprocessor system. The use of this network requires an M-users B-servers arbiter. Several designs are shown for such an arbiter with a round-robin policy. The iterative design is simple but relatively slow and does not use all the possibilities of integration. A network with one level of lookahead is faster and more integrated. To increase the speed of arbitration even more a design with two levels of lookahead is proposed. This network can be used effectively for up to 16 processors, memory modules and busses.