Stephen Felix scite author profile

Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6,5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

show abstract

Design tradeoffs for the Alpha EV8 conditional branch predictor

Seznec

Felix

Krishnan³

et al. 2002

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

This paper presents the Alpha EV8 conditional branch predictor The Alpha EV8 microprocessor project, canceled in June 2001 in a late phase of development, envisioned an aggressive 8-wide issue out-of-order superscalar microarchitecture featuring a very deep pipeline and simultaneous multithreading. Performance of such a processor is highly dependent on the accuracy of its branch predictor and consequently a very large silicon area was devoted to branch prediction on EVS. The Alpha EV8 branch predictor relies on global history and features a total of 352 Kbits.The focus of this paper is on the different trade-offs performed to overcome various implementation constraints for the EV8 branch predictor. One such instance is the pipelining of the predictor on two cycles to facilitate the prediction of up to 16 branches per cycle from any two dynamically successive, 8 instruction fetch blocks. This resulted in the use of three fetch-block oM compressed branch history information for accesing the predictor. Implementation constraints also restricted the composition of the index functions for the predictor and forced the usage of only sing&-ported memory cells.Nevertheless, we show that the Alpha EV8 branch predictor achieves prediction accuracy in the same range as the state-of-the-art academic global history branch predictors that do not consider implementation constraints in great detail I IntroductionThe Alpha EV8 microprocessor [2] features a 8-wide superscalar deeply pipelined microarchitecture. With minimum branch misprediction penalty of 14 cycles, the performance of this microprocessor is very dependent on the branch prediction accuracy. The architecture and technology of the Alpha EV8 are very aggressive and new challenges were confronted in the design of the branch predictor. This paper presents the Alpha EV8 branch predictor in great detail. The paper expounds on different constraints that were * This work was done while the authors were with Compaq during 1999 faced during the definition of the predictor, and on various trade-offs performed that lead to the final design. In particular, we elucidate on the following: (a) use of a global history branch prediction scheme, (b) choice of the prediction scheme derived from the hybrid skewed branch predictor 2Bc-gskew[ 19], (c) redefinition of the information vector used for indexing the predictor that combines compressed branch history and path history, (d) different prediction and hysteresis table sizes: prediction tables and hysteresis tables are accessed at different pipeline stages, and hence can be implemented as physically distinct tables, (e) variable history lengths: the four logical tables in the EV8 predictor are accessed using four different history lengths, (f) guaranteeing conflict free access to the bank-interleaved predictor with single-ported memory cells for up to 16 branch predictions from any two 8-instruction dynamically succesive fetch blocks, and (g) careful definition of index functions for the predictor tables.This work demonstrates that in ...

show abstract

Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading

Preston¹,

Badeau²,

Bailey³

et al.

View full text Add to dashboard Cite

Tarantula

Espasa

Ardanaz

Emer³

et al. 2002

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Tarantula is an aggressive floating point machine targeted at technical scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6,5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha 1SA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions e3ficiently, (4)fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation "per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of SX. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

show abstract

A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128

Hanlon¹,

Felix²

2022

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Stephen Felix

Design tradeoffs for the alpha EV8 conditional branch predictor

Tarantula: a vector extension to the alpha architecture

Design tradeoffs for the Alpha EV8 conditional branch predictor

Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading

Tarantula

A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128

Contact Info

Product

Resources

About