The large diffusion of highly-parallel sharedmemory multi-core machines has led Parallel Discrete Event Simulation (PDES) platforms to a shift towards a shareeverything model. This model is based on loose coupling between simulation objects and threads, lasting (as an extreme) no more than the lifetime of individual events. Concurrent threads can therefore CPU-dispatch events destined to any object at any point in time, thus fully sharing the workload of events to be processed on a fine grain basis. This demands for efficient mechanisms to share the overall pool of pending events by enabling parallelism in insertion and extraction operations. In this article we present a lock-free event pool which also provides amortized O(1) time complexity for both insertions and extractions. It can sustain highly concurrent accesses, while not leading to noticeable performance degradation when scaling up the thread count. Experimental results demonstrate that our solution stands as a core facility capable of further raising up the pragmatical impact of such an emerging shareeverything PDES paradigm.
The share-everything PDES (Parallel Discrete Event Simulation) paradigm is based on fully sharing the possibility to process any individual event across concurrent threads, rather than binding Logical Processes (LPs) and their events to threads. It allows concentrating, at any time, the computing power-the CPU-cores on board of a shared-memory machine-towards the unprocessed events that stand closest to the current commit horizon of the simulation run. This fruitfully biases the delivery of the computing power towards the hot portion of the model execution trajectory. In this article we present an innovative share-everything PDES system that provides (1) fully non-blocking coordination of the threads when accessing shared data structures and (2) fully speculative processing capabilities-Time Warp style processing-of the events. As we show via an experimental study, our proposal can cope with hard workloads where both classical Time Warp systems-based on LPs to threads binding-and previous share-everything proposalsnot able to exploit fully speculative processing of the events-tend to fail in delivering adequate performance.
A crucial aspect in software development is understanding how an application's performance scales as a function of its input data. Estimating the empirical cost function of individual routines of a program can help developers predict the runtime on larger workloads and pinpoint asymptotic inefficiencies in the code. While this has been the target of extensive research in performance profiling, a major limitation of state-of-the-art approaches is that the input size is assumed to be determinable from the program's state prior to the invocation of the routine to be profiled, failing to characterize the scenario where routines dynamically receive input values during their activations. This results in missing workloads generated by kernel system calls (e.g., in response to I/O or network operations) or by other threads, which play a crucial role in modern concurrent and interactive applications. Measuring dynamic workloads poses several challenges, requiring shared-memory communication between threads to be efficiently traced. In this paper we present a new metric and an efficient algorithm for automatically estimating the size of the input of each routine activation. We provide examples showing that our metric allows the estimation of the empirical cost functions of complex applications more precisely than previous approaches. An extensive experimental investigation on a variety of benchmarks shows that our metric can be integrated in a Valgrind-based profiler incurring overheads comparable to other prominent heavyweight dynamic analysis tools.
A crucial aspect in software development is understanding how an application's performance scales as a function of its input data. Estimating the empirical cost function of individual routines of a program can help developers predict the runtime on larger workloads and pinpoint asymptotic inefficiencies in the code. While this has been the target of extensive research in performance profiling, a major limitation of state-of-the-art approaches is that the input size is assumed to be determinable from the program's state prior to the invocation of the routine to be profiled, failing to characterize the scenario where routines dynamically receive input values during their activations. This results in missing workloads generated by kernel system calls (e.g., in response to I/O or network operations) or by other threads, which play a crucial role in modern concurrent and interactive applications. Measuring dynamic workloads poses several challenges, requiring shared-memory communication between threads to be efficiently traced. In this paper we present a new metric and an efficient algorithm for automatically estimating the size of the input of each routine activation. We provide examples showing that our metric allows the estimation of the empirical cost functions of complex applications more precisely than previous approaches. An extensive experimental investigation on a variety of benchmarks shows that our metric can be integrated in a Valgrind-based profiler incurring overheads comparable to other prominent heavyweight dynamic analysis tools.
Emerging share-everything Parallel Discrete Event Simulation (PDES) platforms rely on worker threads fully sharing the workload of events to be processed. These platforms require efficient event pool data structures enabling high concurrency of extraction/insertion operations. Non-blocking event pool algorithms are raising as promising solutions for this problem. However, the classical non-blocking paradigm leads concurrent conflicting operations, acting on a same portion of the event pool data structure, to abort and then retry. In this article we present a conflict-resilient non-blocking calendar queue that enables conflicting dequeue operations, concurrently attempting to extract the minimum element, to survive, thus improving the level of scalability of accesses to the hot portion of the data structure---namely the bucket to which the current locality of the events to be processed is bound. We have integrated our solution within an open source share-everything PDES platform and report the results of an experimental analysis of the proposed concurrent data structure compared to some literature solutions
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.