Carrera, D.; Beltran, V.; Torres, J.; Ayguade, E. International journal of high performance computing and networking Vol. 5, num. 5/6, p. 323-330 DOI: 10.1504/IJHPCN.2008.025551 Data de publicació: 2008-10 Article en revista
Monreal, T.; Viñals, V.; Gonzalez, A.; Valero, M. International journal of high performance computing and networking Vol. 3, num. 2/3, p. 83-94 DOI: 10.1504/IJHPCN.2005.008029 Data de publicació: 2005 Article en revista
Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register-renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.
Falcon, A.; Santana, O.; Alex Ramirez; Valero, M. International journal of high performance computing and networking Vol. 2, num. 1, p. 11-21 DOI: 10.1504/IJHPCN.2004.009264 Data de publicació: 2004-04 Article en revista
Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current simultaneous multithreading (SMT) fetch designs, causing a performance drop due to the absence of instructions to execute. In this paper, we propose several solutions to reduce the effect of branch predictor delay in the performance of SMT processors. Firstly, we analyse the impact of varying the number of access ports. Secondly, we describe a decoupled implementation of an SMT fetch unit that helps to tolerate the predictor delay. Finally, we present an interthread pipelined branch predictor, based on creating a pipeline of interleaved predictions from different threads. Our results show that, combining all the proposed techniques, the performance obtained is similar to that obtained using an ideal, 1-cycle access branch predictor.
Cazorla, F. J.; Alex Ramirez; Valero, M.; Fernández, E. International journal of high performance computing and networking Vol. 2, num. 1, p. 45-54 DOI: 10.1504/IJHPCN.2004.009267 Data de publicació: 2004-02 Article en revista
Simultaneous multithreading (SMT) processors fetch instructions from several threads, increasing the available instruction level parallelism of each thread exposed to the processor. In an SMT the fetch engine decides which threads enter the processor and have priority in using resources. Hence, the fetch engine determines how shared resources are allocated, playing a key role in the final performance of the machine. When a thread experiences an L2 cache miss, critical resources can be monopolised for a long time, throttling the execution of remaining threads. Several approaches have been proposed to cope with this problem. The first contribution of this paper is the evaluation and comparison of the three best published policies addressing the long latency load problem. The second and main contributions of this paper are that we have proposed improved versions of these three policies. Our results show that the improved versions significantly enhance the original ones in both throughput and fairness.
Cristal, A.; Llosa, J.; Valero, M.; Ortega, D. International journal of high performance computing and networking Vol. 2, num. 1, p. 1-10 DOI: 10.1504/IJHPCN.2004.009263 Data de publicació: 2004 Article en revista
Memory speed is growing more slowly than processor speed. This means that processors must spend more and more time waiting for data to arrive from memory. One of the most effective techniques to deal with this effect is to increase the amount of in-flight instructions in the processor, thus allowing for an increased instruction level parallelism when missing instructions occur. With expected latencies of 500 and 1000 cycles, the amount of in-flight instructions needed to sustain performance will have to increase dramatically, and therefore, the microarchitectural elements, such as the reorder buffer, the number of registers and the instruction queues, which depend linearly on this parameter will have to be re-architected to allow such an increased number of in-flight instructions. In this paper, we present several techniques, which try to solve the problems caused by thousands of in-flight instructions.
Ramirez, M.; Cristal, A.; Valero, M.; Veidenbaum, A.; Villa, L. International journal of high performance computing and networking Vol. 1, num. 4, p. 153-161 DOI: 10.1504/IJHPCN.2004.008344 Data de publicació: 2004 Article en revista
Instruction wakeup logic consumes a large amount of energy in out-of-order processors. Existing solutions to the problem require prediction or additional hardware complexity to reduce the energy consumption and, in some cases, may have a negative impact on processor performance. This paper proposes a new mechanism for instruction wakeup, which uses a partitioned instruction queue (IQ). The energy consumption of an IQ partition (block) is proportional to the number of entries in it. All the blocks are turned off until the mechanism determines which blocks to access on wakeup using a simple successor tracking mechanism. The proposed approach is shown to require as little as 1.5 comparisons per committed instruction for SPEC2000 benchmarks. The energy consumption and timing of the partitioned IQ design are evaluated using CACTI 3 models for a 0.07 µm process. The average energy savings observed were 85% and 92%, respectively, for 64-entry integer and floating-point partitioned IQs.
Pericas, M.; Ayguade, E.; Zalamea, F.; Llosa, J.; Valero, M. International journal of high performance computing and networking Vol. 1, num. 4, p. 171-179 DOI: 10.1504/IJHPCN.2004.008346 Data de publicació: 2004 Article en revista
Issue logic is among the worst scaling structures in a modern microprocessor. Increasing the issue width increments the processor area in an exponential way. Bigger processors will have inherently larger wire delays. In this scenario, technology scaling will yield smaller performance improvements as the wire delays do not decrease. Instead, they start to dominate the clock cycle. In order to offer higher performance the wire problem needs to be tackled. This paper discusses two methods which attempt to move the wire problem out of the critical path. The first method is the clustering technique, which directly approaches the wire problem by combining several smaller execution cores in the processor backend to perform the computations. Each core has a smaller issue width and a much smaller area. The second technique we study is the widening technique. This technique consists in reducing the issue width of the processor, but giving the instructions SIMD capabilities. The parallelism here is small (normally two to four) and does not resemble multimedia or vector extensions. Wide processors use wide functional units that compute the same operation on multiple words. The rationale behind this idea is that by reducing the issue width (but not the computational bandwidth), we are also reducing the issue logic circuitry and the complexity of structures such as the register file and the cache memory. When compared with a centralised core with 128 registers, 8 FPUs and 4 memory ports, our approach, using an equivalent amount of hardware units, is able to achieve speedups up to 1.7.