The study of reliability indicators was carried out on the example of a cluster supercomputer configuration of “SKIF-GEO” (further cluster) worked out within the framework of the scientific and technical program “SKIF-Nedra” (2015–2018, Program of the Union State of Russia and Belarus). The cluster is a stationary supercomputer configuration designed to solve resource-intensive applications in data processing centers (DPC). Computing platforms and other cluster modules are located in the same 19′′ rack height of 42U. Theoretical peak performance of cluster – 100 Tflop/s. The basic architectural principles implemented in the cluster, the composition and structural-functional scheme of the cluster are given. A methodological support for calculating the reliability of the cluster, based on previous studies by the authors, is proposed. Taking into account these studies, the structural scheme of reliability (SSR) of the cluster, consisting of two parts – the cluster core and the combination of computing facilities (nodes) (CCF), is substantiated. The component parts (CP) include components of the cluster, the failure of which leads to a decrease in performance to zero. CCF includes CP of cluster, the failures of which lead to a decrease in cluster performance. The choice of the main indicators of the reliability of the cluster core and CCF is grounded and formulas for calculating these indicators are given. The analysis of the consequences of failures of cluster components is made. Taking into account the analysis, the SSR of the cluster core is determined, which allows to derive a formula for calculating the cluster core reliability indicators. A mathematical model of reliability (state graph) of an CCF cluster is proposed, which allows one to derive formulas for calculating the mean time to failure and the mean time for a failure of the CCF of cluster. An assessment of the reliability of CP cluster, for which there is no reliable information on their reliability, is determined based on the SSR of these CP. An assessment of the reliability of the cluster as a whole, based on the calculation of reliability indicators based on reference data on the reliability of components and components, as well as on data from the operation of supercomputers of family “SKIF” has been carried out. Taking into account this estimation and the calculated ratios obtained, the cluster reliability indicators for two options were calculated (in the presence and absence of a reserve of computing nodes). High values of cluster reliability indicators were achieved due to the architectural and structural solutions adopted in the process of its development, aimed at increasing its survivability.
An increase in the informative content of the calculated values of the reliability measure (RM) of objects, whose reliability is ensured by the redundancy of structural elements, is considered in the article. The increase of the informative content is ensured using the interval estimates of the RM. In the normal reliability calculation, the calculated value of the object’s RM is unambiguous, and for an interval reliability estimate, the value range is obtained, which can be quite appreciated as the increase in the informative content. The choice of on-board equipment for small spacecrafts as an object of research in this work is determined as follows: at present, the vast majority of spacecrafts can be classified as small spacecrafts; since the reliability of small spacecrafts is high, it is necessary to use redundancy; the Belarusian spacecraft for remote sen-sing of the Earth belongs to the category of small spacecrafts. As a result of research, the formulas for calculation of interval estimation results are established for the linear and nonlinear dependence of the object’s RM on the RM of its elements. Structural reliability schemes (SSR) are used as an object (system) reliability model, which includes blocks of elements without redundancy (simple) and blocks with different-type redundancy (complex). The object’s RM is a reliability measure determined by its SSR. Therefore, for an interval estimation of the object’s RM to be obtained, the interval estimates of the RM of its blocks must be made. RM interval estimates of simple and complex SSR blocks are obtained in the article. Complex blocks were considered as a set of parallel circuits provi- ding: continuous redundancy for all loaded circuits; non-continuous redundancy of loaded and unloaded circuits; standby redun-dancy; redundancy by voting. The formulas for interval estimation of the object’s RM represented by the SSR and the example of using the methodology on the component part of a real on-board information system are given in the article. The boundary values of the interval estimates of the example can be taken as optimistic and pessimistic estimates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.