Computing at Massive Scale: Scalability and Dependability Challenges

Yang, Renyu; Xu, Jie

doi:10.1109/sose.2016.73

Cited by 25 publications

(16 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With big data processing demands soaring and service decoupling, there is a general manifestation that heterogeneous workloads (in terms of execution durations, resource patterns, etc.) run and operate in the data center cluster [39]. Herein, we make coarse comparisons shown in Table 1.…”

Section: Problem Definitionmentioning

confidence: 99%

Intelligent Resource Scheduling at Scale: A Machine Learning Perspective

Yang

Ouyang

Chen

et al. 2018

2018 IEEE Symposium on Service-Oriented System Engineering (SOSE)

Self Cite

View full text Add to dashboard Cite

Resource scheduling refers to the problem of packing tasks with multi-dimensional resource requirements and non-functional constraints. The exhibited heterogeneity of workload and server characteristics in Cloud-scale or Internetscale environments has raised unprecedented new challenges for cluster scheduling. Compared with ad-hoc heuristics for a multi-resource cluster scheduling problem, machine learning (ML) approaches can in turn facilitate improved efficiency in resource management. In this paper, we describe and discuss how ML can be used to autonomously exploit and understand both workloads and environments, and to learn how to efficiently deal with scheduling problems such as consolidating co-located workloads, handling resource requests, guaranteeing applications' QoS, and mitigating tailed stragglers etc. The scheduling procedure can be fundamentally learned from experience rather than interference by human subjectivity. Additionally, we present a generalized ML-based solution and demonstrate its effectiveness through a case study of the performance-centric node classification and straggler mitigation method. We believe that the rethinking from a ML perspective can steer the architecture optimization and efficiency improvement.

show abstract

Section: Problem Definitionmentioning

confidence: 99%

Intelligent Resource Scheduling at Scale: A Machine Learning Perspective

Yang

Ouyang

Chen

et al. 2018

2018 IEEE Symposium on Service-Oriented System Engineering (SOSE)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Specifically these data sets are too big to store on a single machine and so must be distributed. Already the growth of data is exponential [54] and increasing data collection and further cloud services will only accelerate this further [55]. Very quickly this could lead to a situation where we are no longer able to process the vast amount of data being collected.…”

Section: Challenge: Data Explosionmentioning

confidence: 99%

Massive-Scale Automation in Cyber-Physical Systems: Vision & Challenges

McKee

Clement

Almutairi

et al. 2017

2017 IEEE 13th International Symposium on Autonomous Decentralized System (ISADS)

Self Cite

View full text Add to dashboard Cite

Abstract-The next era of computing is the evolution of the Internet of Things (IoT) and Smart Cities with development of the Internet of Simulation (IoS). The existing technologies of Cloud, Edge, and Fog computing as well as HPC being applied to the domains of Big Data and deep learning are not adequate to handle the scale and complexity of the systems required to facilitate a fully integrated and automated smart city. This integration of existing systems will create an explosion of data streams at a scale not yet experienced. The additional data can be combined with simulations as services (SIMaaS) to provide a shared model of reality across all integrated systems, things, devices, and individuals within the city. There are also numerous challenges in managing the security and safety of the integrated systems. This paper presents an overview of the existing stateof-the-art in automating, augmenting, and integrating systems across the domains of smart cities, autonomous vehicles, energy efficiency, smart manufacturing in Industry 4.0, and healthcare. Additionally the key challenges relating to Big Data, a model of reality, augmentation of systems, computation, and security are examined.

show abstract

“…Faults may occur simultaneously and in any aspect of system operations ranging from application to hardware, and may have a wide variety of causes, including insufficient memory (OOM), overweight system utilization, performance interference, network congestion, server faults (e.g. disk, middleware software), and applications crash or hanging etc [8].…”

Section: Motivationmentioning

confidence: 99%

“…Dependability is a key concern for Cloud resource managers due to increasingly common failures which are now the norm rather than the exception caused by the enlarged system scale and complexity [6] [7] [8], different workload characteristics, and plethora of faults types that can activate. Failures within a resource manager have the potential to cause significant economic consequences to Cloud providers due to loss of service to consumers [9], and could affect services provisioned to millions globally in the event of Manuscript received Jun 10, 2015; revised Dec 31, 2015; accepted Mar 3, 2016. catastrophic failures.…”

Section: Introductionmentioning

confidence: 99%

Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

Yang

Zhang

Garraghan

et al. 2017

IEEE Trans. Serv. Comput.

Self Cite

View full text Add to dashboard Cite

Abstract-Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed -an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71% additional CPU usage.

show abstract

Computing at Massive Scale: Scalability and Dependability Challenges

Cited by 25 publications

References 51 publications

Intelligent Resource Scheduling at Scale: A Machine Learning Perspective

Intelligent Resource Scheduling at Scale: A Machine Learning Perspective

Massive-Scale Automation in Cyber-Physical Systems: Vision & Challenges

Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

Contact Info

Product

Resources

About