On Probability Characteristics of "Downfalls" in a Standard Brownian Motion

Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6,5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.

show abstract

Adding a vector unit to a superscalar processor

Quintana

Corbal

Espasa

et al. 1999

View full text Add to dashboard Cite

The focus of this paper is on adding a vector unit to a superscalar core, as a way to scale current state of the art superscalar processors.The proposed architecture has a vector register file that shares functional units both with the integer datapath and with the floatingpoint datapath. A key point in our proposal is the design of a high performance cache interface that delivers high bandwidth to the vector unit at a low cost and low latency. We propose a double-banked cache with alignment circuitry to serve vector accesses and we study two cache hierarchies: one feeds the uector unit from the Ll; the other from the L.2. Our results show that large IPC values (higher than IO in some cases) can be achieved. Moreover the scalability of our architecture simply requires addition of functional units, without requiring more issue bandwidth. As a consequence, the proposed vector unit achieves high performance for numerical and multimedia codes with minimal impact on the cycle time of the processor or on the performance of integer codes.

show abstract

Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes

Farcy¹,

Temam²,

Espasa³

et al.

View full text Add to dashboard Cite

The goal of this study is twofold: to analyze in detail the nature of conditional branch mispredictions in correlationbased branch predictors, and, based on this analysis, to reduce the impact of branch mispredictions on processor performance by decreasing the branch resolution delay instead of improving the branch prediction accuracy.We classify conditional branches with the highest number of mispredictions according to the nature of their branch condition analytical expression. Based on these expressions, we can analyze and even precisely explain the origin of mispredictions in many cases. Moreover, we find that many such branches belong to small sets of blocks inside loops, and within such sets we find that some of the branch expressions have regularity properties. We show how to exploit this regularity property by anticipating the branch outcome, where anticipation is a combination of value prediction and normal dataflow execution.We investigate a hardware mechanism to implement the concept of branch outcome anticipation. This mechanism relies on the separate execution of the normal program flow and a branch flow, which is a subset of the program flow corresponding to copies of the instructions needed to compute branch outcomes. The branch flow uses the regularity properties of branch condition expressions to get ahead of the normal program flow whenever possible. Currently, the mechanism can only target a subset of the conditional branches, but with these branches we experimentally show that the anticipation mechanism successfully reduces the average branch misprediction latency by 60%.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Roger Espasa

Larrabee: A Many-Core x86 Architecture for Visual Computing

Asim: a performance model framework

Tarantula: a vector extension to the alpha architecture

Adding a vector unit to a superscalar processor

Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes

Contact Info

Product

Resources

About