Hypervisor-based fault tolerance

Bressoud, Thomas; Schneider, Fred B.

doi:10.1145/224056.224058

Cited by 191 publications

(80 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To effectively utilize the array bandwidth, the server interleaves (i.e., stripes) each media stream among disks in the array. The unit of data interleaving, referred to as a media block or a stripe unit, denotes the maximum amount of logically contiguous data 3 Several techniques have been proposed which scramble media streams prior to network transmission to enable approximate reconstruction in case of packet losses [10,26]. The efficacy of these techniques validates our claim.…”

Section: Parity-based Reconstructionsupporting

confidence: 65%

“…2 -Since the cause of the data loss is irrelevant to the recovery algorithm, the unscrambling algorithms in LRJ and LRM can be adapted to mask packet losses due to network congestion as well. 3 Thus, the IRAD architecture provides an integrated, scalable, end-to-end solution for failure recovery. In case of a disk failure, a redundant array must (1) perform online reconstruction, and thereby provide uninterrupted service to user requests, and (2) rebuild the failed disk onto a spare disk, so that the array can revert back to the normal operating mode.…”

Section: Iradmentioning

confidence: 99%

“…A truly fault-tolerant design will support redundancies in all the key components of the server, including the CPU, memory, I/O, and network subsystems, as well as the system software. In this paper, we confine our focus to fault-tolerant designs for the I/O subsystem, and assume that existing faulttolerant techniques will be used for other subsystems (for example, see [3,13,28]). …”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Failure recovery algorithms for multimedia servers

Shenoy

Vin

2000

Multimedia Systems

View full text Add to dashboard Cite

In this paper, we present two novel disk failure recovery methods that utilize the inherent characteristics of video streams for efficient recovery. Whereas the first method exploits the inherent redundancy in video streams (rather than error-correcting codes) to approximately reconstruct data stored on failed disks, the second method exploits the sequentiality of video playback to reduce the overhead of online failure recovery in conventional RAID arrays. For the former approach, we present loss-resilient versions of JPEG and MPEG compression algorithms. We present an inherently redundant array of disks (IRAD) architecture that combines these loss-resilient compression algorithms with techniques for efficient placement of video streams on disk arrays to ensure that on-the-fly recovery does not impose any additional load on the array. Together, they enhance the scalability of multimedia servers by (1) integrating the recovery process with the decompression of video streams, and thereby distributing the reconstruction process across the clients; and (2) supporting graceful degradation in the quality of recovered images with increase in the number of disk failures. We present analytical and experimental results to show that both schemes significantly reduce the failure recovery overhead in a multimedia server.

show abstract

Section: Parity-based Reconstructionsupporting

confidence: 65%

Section: Iradmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Failure recovery algorithms for multimedia servers

Shenoy

Vin

2000

Multimedia Systems

View full text Add to dashboard Cite

show abstract

“…When phase 1 completes, both processors write their data back to the directory (7), (8). At the directory P1 is ordered after P0, so P0's writeback applied (9), while the writeback from P1 is nacked (10). P1 retries the writeback (11), which is then accepted by the directory (12).…”

Section: Mist Complexitymentioning

confidence: 99%

“…They have shown that addressing the problem has the potential to (1) increase software reliability by enhancing software test coverage before release [43], (2) increase system reliability through replication based fault tolerance [9], (3) aid in multithreaded software engineering [42], and (4) enhance security by providing a tool to analyze an attack [13]. Many of these prior proposals either rely on the ability to replay a previously recorded execution [14,20,27,28,31,41,42], incur a performance overhead that is likely too high for always-on usage [5], require complex speculative hardware [11], or only guarantee determinism in well behaved programs [32].…”

Section: Introductionmentioning

confidence: 99%

Calvin: Deterministic or not? Free will to choose

Hower

Dudnik

Hill

et al. 2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

Pragmatic source code reuse via execution record and replay

Armaly

McMillan

2016

J Software Evolu Process

View full text Add to dashboard Cite

A key problem during copy-paste source code reuse is that, to reuse even a small section of code from a program as opposed to an API, a programmer must include a huge amount of additional source code from elsewhere in the same program. This additional code is notoriously large and complex, and portions can only be identified at runtime. In this paper, we propose execution record/replay as a solution to this problem. We describe a novel reuse technique that allows programmers to reuse functions from a C or C++ program, by recording the execution of the program and selectively modifying how its functions are replayed. We have implemented our technique and evaluated it in an empirical study in which eight programmers used our tool to complete four tasks over four hours each. The participants found our technique to be easier than manually reusing the code as part of their project. We also found that the resulting code was smaller and less complex than it would have been had the participants manually reused the code.

show abstract

Hypervisor-based fault tolerance

Cited by 191 publications

References 21 publications

Failure recovery algorithms for multimedia servers

Failure recovery algorithms for multimedia servers

Calvin: Deterministic or not? Free will to choose

Pragmatic source code reuse via execution record and replay

Contact Info

Product

Resources

About