2015 IEEE Symposium on Service-Oriented System Engineering 2015
DOI: 10.1109/sose.2015.18
|View full text |Cite
|
Sign up to set email alerts
|

JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

Abstract: Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-topoint transmission semantics can be ill-fitted … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2015
2015
2018
2018

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 16 publications
(5 reference statements)
0
1
0
Order By: Relevance
“…The MTBF (Mean Time Between Failures) value for devices of critical infrastructure are getting longer and the reliability of a network devices (availability) reaches 99,999%. However this level of reliability, in some cases, is still not sufficient enough [1]. The formula of improving network availability by improving the device production process and improving the reliability parameters of network nodes seems to be reaching its limit.…”
Section: Introductionmentioning
confidence: 99%
“…The MTBF (Mean Time Between Failures) value for devices of critical infrastructure are getting longer and the reliability of a network devices (availability) reaches 99,999%. However this level of reliability, in some cases, is still not sufficient enough [1]. The formula of improving network availability by improving the device production process and improving the reliability parameters of network nodes seems to be reaching its limit.…”
Section: Introductionmentioning
confidence: 99%