Gelado, I.; Stone, J.; Cabezas, J.; Patel, S.; Navarro, Nacho; HEI HWU, W. Computer architecture news Vol. 38, num. 1, p. 347-358 DOI: 10.1145/1735970.1736059 Data de publicació: 2010 Article en revista
State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.
Pericas, M.; Cristal, A.; Cazorla, F.; González, R.; Veidenbaum, A.; Jiménez, D. A.; Valero, M. Computer architecture news Vol. 36, num. 3, p. 25-36 DOI: 10.1145/1394608.1382171 Data de publicació: 2008-06 Article en revista
Multicore processors have emerged as a powerful platform on which to efficiently exploit thread-level parallelism (TLP). However, due to Amdahl’s Law, such designs will be increasingly limited by the remaining sequential components of applications. To overcome this limitation it is necessary to design processors with many lower–performance cores for TLP and some high-performance cores designed to execute sequential algorithms. Such cores will need to address the memory-wall by implementing kilo-instruction windows. Large window processors require large Load/Store Queues that would be too slow if implemented using current CAMbased designs. This paper proposes an Epoch-based Load Store Queue (ELSQ), a new design based on Execution Locality. It is integrated into a large-window processor that has a fast, out-of-order core operating only on L1/L2 cache hits and N slower cores that process L2 misses and their dependent instructions. The large LSQ is coupled with the slow cores and is partitioned into N small and local LSQs, one per core. We evaluate ELSQ in a large-window environment, finding that it enables high performance at low power. By exploiting locality among loads and stores, ELSQ outperforms even an idealized central LSQ when implemented on top of a decoupled processor design.
To alleviate the memory wall problem, current architectural trends suggest implementing large instruction windows able to maintain a high number of in-flight instructions. However, the benefits achieved by these recent proposals may be limited because more instructions are executed down the wrong path of a mispredicted branch. The larger number of misspeculated instructions involves increasing the energy consumed compared to traditional designs with smaller instruction windows. Our analysis shows that, for some SPEC2000 integer benchmarks, up to 2, 5X wrong-path load instructions are executed when the instruction window of a 4-way superscalar processor is increased from 256 to 1024 entries.
This paper describes a simple speculative control technique to prevent wrong-path load instructions from being executed. Our technique extends the functionality of the load-store queue to block those load instructions that depend on a hard-to-predict conditional branch until it is resolved. If the branch is actually mispredicted, unnecessary cache misses can be avoided, saving energy down the wrong path. Furthermore, instructions that depend on a blocked load are not issued because their source values are not available, which also saves dynamic energy. Our results show that the proposed mechanism reduces, on average, up to 26% misspeculated load instructions and 18% wrong-path instructions without any performance loss.
L2 misses are one of the main causes for stalling the activity in current and future microprocessors.In this paper we present a mechanism to speculatively execute independent instructions of L2-miss loads, even if no entry in the reorder buffer is available. The proposed mechanism generates future instances of instructions that are expected to be independent of the delinquent load. When these dynamic instructions are later fetched, they use the previously precomputed data and directly go to the commit stage without executing.The mechanism replicates strided loads found above the L2-miss load, that produce the data for the target independent instructions. Instructions following the L2-miss load will check if their source operands have been replicated. In this case, multiple speculative instances of them will also be generated.This mechanism is built on top of a superscalar processor with an aggressive prefetch scheme. Compared to this baseline, the mechanism obtains 21% of performance improvement.
The trend of the networking processing is to increase the intelligence of the routers (i.e. security capacities). This means that there is an increment in the workload generated per packet and new types of applications are emerging, such as stateful programs. On the other hand, Internet traffic continues to grow vigorously. This fact involves an increment of the traffic aggregation levels and overloades the processing capacities of the routers.In this paper we show the importance of traffic aggregation level on networking application studies. We also classify the applications according to the data management of the packet processing. Hence, we present the different impacts on the data cache performance depending on the application category. Our results show that traffic aggregation level may affect the cache performance depending on the networking application category. Stateful applications show a significant sensitivity to this impact.
This short paper serves to correct the errors contained in the paper entitled "Measuring Experimental Error in Microprocessor Simulation," presented at the 2001 International Symposium on Computer Architecture (ISCA-28) . That paper contained a study of a validated microarchitectural simulator called sim-alpha, and included a case study that compared results obtained from a configuration of sim-alpha to those reported by Cruz et al. . The comparison showed a disparity in the results between the two studies, which was invalid due to an error in the way the operand bypass network of sim-alpha was modified.In this paper, we present a revised comparison that models the bypass network in sim-alpha consistently with the work of Cruz et al. (henceforth called the reference study). After explaining the bypassing issues in detail, we show that the new comparison between the revised results and the reference study shows similar trends, thus validating the accuracy of the reference study.In addition, we (the authors of the sim-alpha study) would like to state explicitly that we regret any professional harm caused to the authors of the reference study (Cruz, González, Valero, and Topham). The original intent was to show that different simulators modeling similar targets could cause different conclusions to be drawn, thus emphasizing the need for consistency across simulators and research efforts. Since the comparison with the reference study was made with a (modified) validated simulator, and the two sets of results showed a different trend (because of the error in modeling the bypass), it is natural to infer that the reference study was incorrect. We thoughtlessly published the comparison without contacting the authors of the reference study beforehand. We apologize to them both for implying an error in their (correct) methodology and for the discourteous way in which it was presented.
In this paper we explore the possibility of reusing schedules to improve the scalability of numerical codes in shared--memory architectures with non--uniform memory access. The main objective is to implicitly construct affinity links between threads and data accesses and reuse them as much as possible along the execution of the application. These links are created thorugh the definition and reuse of iteration schedules statically defined by the user or dinamically created at run time. The paper does not include a formal proposal of OpenMP extensions but includes some experiments showing the usefulness of constructing affinity links in some irregular codes.
In this issue, we present a selection of papers from several workshops held in September 2001 in Barcelona, Spain. The workshops were hosted within the PACT (Parallel Architecture and Compilation Techniques) Conference.
The issue logic of a dynamically-scheduled superscalar processor is a complex mechanism devoted to start the execution of multiple instructions every cycle. Due to its complexity, it is responsible for a significant percentage of the energy consumed by a microprocessor. The energy consumption of the issue logic depends on several architectural parameters, the instruction issue queue size being one of the most important. In this paper we present a technique to reduce the energy consumption of the issue logic of a high-performance superscalar processor. The proposed technique is based on the observation that the conventional issue logic wastes a significant amount of energy for useless activity. In particular, the wake-up of empty entries and operands that are ready represents an important source of energy waste. Besides, we propose a mechanism to dynamically reduce the effective size of the instruction queue. We show that on average the effective instruction queue size can be reduced by a factor of 26% with minimal impact on performance. This reduction together with the energy saved for empty and ready entries result in about 90.7% reduction in the energy consumed by the wake-up logic, which represents 14.9% of the total energy of the assumed processor.
The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage register file has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a register file architecture composed of multiple banks. In particular we focus on a multi-level organization of the register file, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked register file, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the register file access time is factored in.
Cache Miss Equations (CME) [GMM97] is a method that accurately describes the cache behavior by means of polyhedra. Even though the computation cost of generating CME is a linear function of the number of references, to solve them is a very time consuming task and thus trying to study a whole program may be infeasible.In this work, we present effective techniques that exploit some properties of the particular polyhedra generated by CME. Such technique reduce the complexity of the algorithm to solve CME, which results in a significant speed-up when compared with traditional methods. In particular, the proposed approach does not require the computation of the vertices of each polyhedron, which has an exponential complexity.
The high latency of memory accesses is one of the factors that most contribute to reduce the performance of current vector supercomputers. The conflicts that can occur in the memory modules plus the collisions in the interconnection network in the case of multiprocessors make that the execution time of applications increases significantly. In this work we propose a memory access method that for both cases of vector uniprocessors and multiprocessors allows to perform stream accesses with the smallest possible latency in the majority of the cases. The basic idea is to arbitrate the memory access by defining the order in which the memory modules are visited. The stream elements are requested out of order. In addition, the access method also reduces the cost of the interconnection network.
Valero, M.; Lang, T.; Llaberia, J.; Peiron, M.; Ayguade, E.; Navarro, J. Computer architecture news Vol. 20, num. 2, p. 372-381 DOI: 10.1145/146628.140400 Data de publicació: 1992-05 Article en revista
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. Both matched and unmatched memories are considered: we show that the number of strides is even larger for the latter case. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.
In this paper we propose a methodology to adapt Systolic Algorithms to the hardware selected for their implementation. Systolic Algorithms obtained can be efficiently implemented using Pipelined Functional Units. The methodology is based on two transformation rules. These rules are applied to an initial Systolic Algorithm, possibly obtained through one of the design methodologies proposed by other authors. Parameters for these transformations are obtained from the specification of the hardware to be used. The methodology has been particularized in the case of one-dimensional Systolic Algorithms with data contraflow.
A methodology to transform dense to band matrices is presented in this paper. This transformation, is accomplished by triangular blocks partitioning, and allows the implementation of solutions to problems with any given size, by means of contraflow systolic arrays, originally proposed by H.T. Kung. Matrix-vector and matrix-matrix multiplications are the operations considered here.The proposed transformations allow the optimal utilization of processing elements (PEs) of the systolic array when dense matrix are operated. Every computation is made inside the array by using adequate feedback. The feedback delay time depends only on the systolic array size.
Performance issues of a single-bus interconnection network for multiprocessor systems, operating in a multiplexed way, are presented in this paper. Several models are developed and used to allow system performance evaluation. Comparisons with equivalent crossbar systems are provided. It is shown how crossbar EBW values can be reached and exceeded when appropriate operation parameters are chosen in a multiplexed single-bus system. Another architectural feature is considered, concerning the utilization of buffers at the memory modules. With the buffering scheme, memory interference can be reduced so that the system performance is practically improved.