Beating in-order stalls with "flea-flicker" two-pass pipelining

Barnes, R. D.; Sias, John W.; Nyström, Erik; Patel, Sanjay J.; Navarro, Juan J.; Hwu, Wen-mei W.

doi:10.1109/tc.2006.4

Cited by 23 publications

(25 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If we reduce EDA to a purely performance enhancing mechanism, it resembles a class of techniques represented by decoupled access/execute architecture [6], Slip-stream [7], [20], dual-core execution (DCE) [9], Fleaflicker [21], Tandem [18], and Paceline [19]. Among these, we compare to DCE as architecturally, it is perhaps the most closely related design: both try to avoid long-latency cache miss-induced stalls to improve performance.…”

Section: 01mentioning

confidence: 99%

See 1 more Smart Citation

A performance-correctness explicitly-decoupled architecture

Garg

Huang

2008

2008 41st IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

Section: 01mentioning

confidence: 99%

“…In Section 5.3, we contrasted our approach with a class of designs using two passes to process a thread, including Slip-stream [7], [20], dual-core execution [9], Flea-flicker [21], Tandem [18], and Paceline [19]. Another class of related work is helper-threading (also called speculative precomputation) (e.g., [24]- [32]).…”

Section: Related Workmentioning

confidence: 99%

A performance-correctness explicitly-decoupled architecture

Garg

Huang

2008

2008 41st IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

show abstract

“…The leader core runs a shorter version based on the removal of ineffectual instructions while the checker core runs the unmodified program. Lastly, Flea-Flicker two pass pipelining [4] allows the leader core to return an invalid value on long-latency operations and proceed. In most of these schemes, the checker core takes advantage of program execution on the leader core by receiving preprocessed instruction streams, resolved branches, and L2 cache prefetches.…”

Section: Challenges In Coupling With a Faulty Corementioning

confidence: 99%

Putting Faulty Cores to Work

et al. 2010

View full text Add to dashboard Cite

Abstract-Since the non-cache parts of a core are less regular, compared to on-chip caches, tolerating manufacturing defects in the processing core is a more challenging problem. Due to the lack of effective solutions, disabling non-functional cores is a common practice in industry, which results in a significant reduction in system throughput. Although a faulty core cannot be trusted to correctly execute programs, we observe that for most defects, when starting from a valid architectural state, execution traces on a defective core coarsely resemble those of fault-free executions. In light of this insight, we propose a robust and heterogeneous core coupling execution scheme, Necromancer, that exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior. We partition the cores in a CMP system into multiple groups in which each group shares a lightweight core that can be substantially accelerated using these execution hints from a faulty core.

show abstract

“…As the memory wall problem has come to overshadow other aspects of processing, various forms of runahead execution have been proposed [21][12] [7][3] [4]. Runahead execution attempts to reduce the effect of the long memory latencies by increasing the memory-level parallelism.…”

Section: Introductionmentioning

confidence: 99%