Truss: A Reliable, Scalable Server Architecture

Gold, Brian T.; Kim, J.; Smolens, Jared C.; Chung, Eric S.; Liaskovitis, Vasileios; Nurvitadhi, Eriko; Falsafi, Babak; Hoe, James C.; Nowatzyk, Andreas

doi:10.1109/mm.2005.122

Cited by 18 publications

(19 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TRUSS introduces a distributed shared memory architecture with no single point of failure [10]. To avoid common-mode failure, redundant operations are carried out by cores on different chips; however, this leads to performance losses due to long delays waiting for data to be checked.…”

Section: Related Workmentioning

confidence: 99%

Cost-effective safety and fault localization using distributed temporal redundancy

Meyer

Calhoun

Lach

et al. 2011

Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems

View full text Add to dashboard Cite

Cost pressure is driving vendors of safety-critical systems to integrate previously distributed systems. One natural approach we have previous introduced is On-Demand Redundancy (ODR), which allows safety-critical and non-critical tasks, traditionally isolated to limit interference, to execute on shared resources. Our prior work has shown that relaxed dedication (RD), one ODR strategy which allows non-critical tasks (NCTs) to execute on idle critical task resources (CTRs), significantly increases NCT throughput. Unfortunately, there are circumstances under which, in spite of this opportunity, it is difficult to effectively schedule NCTs.In this paper, we introduce distributed temporal redundancy (DTR), which allows critical tasks, which traditionally execute in lockstep, to execute asynchronously. In doing so, DTR increases scheduling flexibility, resulting in systems that achieve much closer to the optimal NCT throughput than with relaxed dedication alone; in one set of experiments, DTR schedules no less 93% of the theoretical NCT cycles across a variety of synthetic benchmarks, outperforming RD by over 11%, on average. Furthermore, by distributing all redundant tasks across different resources, triple-modular redundancy, and therefore fault localization, can be achieved. We demonstrate that this can be accomplished with little additional cost and complexity: in practice, relatively few DTR tasks are in flight simultaneously, limiting the additional buffering needed to support DTR.

show abstract

Section: Related Workmentioning

confidence: 99%

Cost-effective safety and fault localization using distributed temporal redundancy

Meyer

Calhoun

Lach

et al. 2011

Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems

View full text Add to dashboard Cite

show abstract

“…Hardware and software combined approaches [24], [25], [29], [26], [30], [27] use the parallel processing capacity of chip multiprocessors (CMPs) and redundant multi threading to detect and recover the problem. Mohamed et al [62] shows Chip Level Redundantly Threaded Multiprocessor with Recovery (CRTR), where the basic idea is to run each program twice, as two identical threads, on a simultaneous multithreaded processor.…”

Section: Related Workmentioning

confidence: 99%

An Efficient Approach towards Mitigating Soft Errors Risks

Khan¹,

Uddin²,

Jürjens³

2011

SIPIJ

View full text Add to dashboard Cite

show abstract

“…We propose to decouple error checking from the DSM coherence protocol. Unlike our previous design [7], decoupled checking requires no modification to the existing coherence controller. Although the checking latency increases slightly with a decoupled design, the effectiveness of the checking filter is sufficiently high that overall performance overhead not affected.…”

Section: Introductionmentioning

confidence: 99%

“…Our prior proposal [7] for a chip-level redundant DSM suffers from unacceptable performance overhead, particularly in commercial workloads, due to frequent, long-latency checking on the critical path of execution. Previously proposed mechanisms [7] that reduce the latency of checking fail to provide adequate improvement and require impractical changes to the cache coherence protocol and its implementation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors

Gold

Falsafi²,

Hoe

2009

2009 15th IEEE Pacific Rim International Symposium on Dependable Computing

View full text Add to dashboard Cite

Abstract-Distributed shared-memory (DSM) multiprocessors provide a scalable hardware platform, but lack the necessary redundancy for mainframe-level reliability and availability. Chip-level redundancy in a DSM server faces a key challenge: the increased latency to check results among redundant components. To address performance overheads, we propose a checking filter that reduces the number of checking operations impeding the critical path of execution. Furthermore, we propose to decouple checking operations from the coherence protocol, which simplifies the implementation and permits reuse of existing coherence controller hardware. Our simulation results of commercial workloads indicate average performance overhead is within 4% (9% maximum) of tightly coupled DMR solutions.

show abstract

Truss: A Reliable, Scalable Server Architecture

Cited by 18 publications

References 16 publications

Cost-effective safety and fault localization using distributed temporal redundancy

Cost-effective safety and fault localization using distributed temporal redundancy

An Efficient Approach towards Mitigating Soft Errors Risks

Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors

Contact Info

Product

Resources

About