Abstract-Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as "Long Tail", whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs.
Abstract-The next era of computing is the evolution of the Internet of Things (IoT) and Smart Cities with development of the Internet of Simulation (IoS). The existing technologies of Cloud, Edge, and Fog computing as well as HPC being applied to the domains of Big Data and deep learning are not adequate to handle the scale and complexity of the systems required to facilitate a fully integrated and automated smart city. This integration of existing systems will create an explosion of data streams at a scale not yet experienced. The additional data can be combined with simulations as services (SIMaaS) to provide a shared model of reality across all integrated systems, things, devices, and individuals within the city. There are also numerous challenges in managing the security and safety of the integrated systems. This paper presents an overview of the existing stateof-the-art in automating, augmenting, and integrating systems across the domains of smart cities, autonomous vehicles, energy efficiency, smart manufacturing in Industry 4.0, and healthcare. Additionally the key challenges relating to Big Data, a model of reality, augmentation of systems, computation, and security are examined.
Abstract-A trend seen in many industries is the increasing reliance on modelling and simulation to facilitate design, decision making and training. Previously, these models would operate in isolation but now there is a growing need to integrate and connect simulations together for co-simulation. In addition, the 21 st century has seen the expansion of the Internet of Things (IoT) enabling the interconnectivity of smart devices across the Internet. In this paper we propose that an important, and often overlooked, domain of IoT is that of modelling and simulation. Expanding IoT to encompass interconnected simulations enables the potential for an Internet of Simulation (IoS) whereby models and simulations are exposed to the wider internet and can be accessed on an "as-a-service" basis. The proposed IoS would need to manage simulation across heterogeneous infrastructures; temporal and causal aspects of simulations; as well as variations in data structures. Via the proposed Simulation as a Service (SIMaaS) and Workflow as a Service (WFaaS) constructs in IoS, highly complex simulation integration could be performed automatically, resulting in high fidelity system level simulations. Additionally, the potential for faster than real-time simulation afforded by IoS opens the possibility of connecting IoS to existing IoT infrastructure via a real-time bridge to facilitate decision making based on live data.
Abstract-The trend towards turning existing cities into smart cities is growing. Facilitated by advances in computing such as Cloud services and Internet of Things (IoT), smart cities propose to bring integrated, autonomous systems together to improve quality of life for their inhabitants. Systems such as autonomous vehicles, smart grids and intelligent traffic management are in the initial stages of development. However, as of yet there, is no holistic architecture on which to integrate these systems into a smart city. Additionally, the existing systems and infrastructure of cities is extensive and critical to their operation. We cannot simply replace these systems with smarter versions, instead the system intelligence must augment the existing systems. In this paper we propose a service oriented reference architecture for smart cities which can tackle these problems and identify some related open research questions. The abstract architecture encapsulates the way in which different aspects of the service oriented approach span through the layers of existing city infrastructure. Additionally, the extensible provision of services by individual systems allows for the organic growth of the smart city as required.
-Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.
Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge − the straggler problem, whereby a small subset of slow tasks significantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglers by creating task replicas at runtime. The method detects stragglers by specifying a predefined threshold to calculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve specified levels of QoS for different applications. In this paper we present an algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it's effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response times by up to 20% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67% against 16.67% in comparison to the static method.
Abstract-Simulation is critical when studying real operational behavior of increasingly complex Cyber-Physical Systems, forecasting future behavior, and experimenting with hypothetical scenarios. A critical aspect of simulation is the ability to evaluate large-scale systems within a reasonable time frame while modeling complex interactions between millions of components. However, modern simulations face limitations in provisioning this functionality for CPSs in terms of balancing simulation complexity with performance, resulting in substantial operational costs required for completing simulation execution. Moreover, users are required to have expertise in modeling and configuring simulations to infrastructure which is time consuming. In this paper we present SEED (Simulation EnvironmEnt Distributor), a novel approach for simulating large-scale CPSs across a loosely-coupled distributed system requiring minimal user configuration. This is achieved through automated simulation partitioning and instantiation while enforcing tight event messaging across the system. SEED operates efficiently within both small and large-scale OTS hardware, agnostic of cluster heterogeneity and OS running, and is capable of simulating the full system and network stack of a CPS. Our approach is validated through experiments conducted in a cluster to simulate CPS operation. Results demonstrate that SEED is capable of simulating CPSs containing 2,000,000 tasks across 2000 nodes with only 6.89x slow down relative to real time, and executes effectively across distributed infrastructure.
With the evolution of the Internet of things and smart cities, a new trend of the Internet of simulation has emerged to utilise the technologies of cloud, edge, fog computing, and high-performance computing for design and analysis of complex cyber-physical systems using simulation. These technologies although being applied to the domains of big data and deep learning are not adequate to cope with the scale and complexity of emerging connected, smart, and autonomous systems. This study explores the existing state-of-the-art in automating, augmenting, and integrating systems across the domains of smart cities, autonomous vehicles, energy efficiency, smart manufacturing in Industry 4.0, and healthcare. This is expanded to look at existing computational infrastructure and how it can be used to support these applications. A detailed review is presented of advances in approaches providing and supporting intelligence as a service. Finally, some of the remaining challenges due to the explosion of data streams; issues of safety and security; and others related to big data, a model of reality, augmentation of systems, and computation are examined. 2 Emerging applications The emergence of the Internet of anything and everything [14]from IoT [26] is driving smarter and more context-aware systems and applications. These concepts augment the technologies related to cloud and edge computing [27] and allow computational power to be balanced against location which has an impact on both network latencies and security. The ubiquitous management of the computational systems and communication networks is anticipated to be augmenting and penetrating most cyber-physical systems that we interact with on a daily basis within the coming decade. One example domain is that of cooperative robotics where advances in autonomous systems [28] are enhanced with additional computational capability from the cloud. The resulting emerging field of cloud robotics combines the two research fields to provide intelligence services to robots from the cloud [29-31],
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.