Proceedings of the 28th Annual International Symposium on Computer Architecture - ISCA '01 2001
DOI: 10.1145/379240.379250
|View full text |Cite
|
Sign up to set email alerts
|

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

Abstract: Hardly predictable data addresses in man), irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution--essentially a combined act of speculative address generation and prefetching-to accelerate the main thread. In this paper, we propose such a p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
62
0

Year Published

2002
2002
2011
2011

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 192 publications
(62 citation statements)
references
References 42 publications
0
62
0
Order By: Relevance
“…Researchers have proposed different techniques to hide the long memory access latency, such as software prefetching [4,5,6,7] and hardware prefetching [8,9,10,11,12]. In software prefetching, the compiler inserts prefetch instructions that bring the indicated block of data onto onchip memory.…”
Section: Related Workmentioning
confidence: 99%
“…Researchers have proposed different techniques to hide the long memory access latency, such as software prefetching [4,5,6,7] and hardware prefetching [8,9,10,11,12]. In software prefetching, the compiler inserts prefetch instructions that bring the indicated block of data onto onchip memory.…”
Section: Related Workmentioning
confidence: 99%
“…Since each Panther processor encapsulates two identical cores, helper threads can run on the often under-utilized secondary core to prefetch data into the shared cache so as to improve the main thread performance. Unlike other helper threading schemes discussed for SMT [4], [7], [10], [13], [15], [29], thread communication for Panther has to be conducted through the L2 cache since the L2 cache is the closet level of shared caching. On the other hand, construction of helper threads on Panther is less constrained by resource contention as each core is a complete processing unit with its own functional units, TLB, L1 caches and register files; execution of the helper thread would have less negative impact on the main thread performance.…”
Section: Helper Thread Model For Cmpmentioning
confidence: 99%
“…When the initial load arrives, the processor resumes execution from the checkpointed state. In software pre-execution (also referred to as helper threads or software scouting) [2], [4], [7], [10], [14], [24], [29], [35], a distilled version of the forward slice starting from the missing load is executed, minimizing the utilization of execution resources. Helper threads utilizing run-time compilation techniques may also be effectively deployed on processors that do not have the necessary hardware support for hardware scouting (such as checkpointing and resuming regular execution).…”
Section: Introductionmentioning
confidence: 99%
“…Pre-execution code can be constructed statically [8,15,13] or dynamically [7,23]. Pre-computation code typically runs in a spare hardware context [7,23] or a dedicated hardware engine [17,1,21], in parallel with the main thread.…”
Section: Prefetching Via Pre-execution Threadsmentioning
confidence: 99%
“…This decreases the observed latency, increases memory level parallelism, and allows cache-hit dominated performance even when the working set is larger than the cache. Software based prefetching [4,18,15,31,14,6,11,20] has been shown to be a promising technique to address this issue, and all modern high-performance instruction set architectures provide support for software prefetching.…”
Section: Introductionmentioning
confidence: 99%