Summary Block devices such as magnetic disks are nonvolatile data storage devices that transfer data in fixed‐size chunks. They are the main nonvolatile memory that holds the file system, and they are also used in virtual memory mechanisms such swapping and page fault handling. Investigating storage performance issues requires a full insight into the operating system internals. Kernel tracing offers an efficient mechanism to gather information about the storage subsystem at runtime. Still, the tracing output is often huge and difficult to analyze manually. In this paper, we introduce a framework to compute meaningful storage performance metrics from low‐level trace events generated by LTTng. A stateful approach is used to model the state of the storage subsystem. Efficient data structures and algorithms are proposed to offer a reasonable response time, allowing the user to navigate throughout the trace and to retrieve metrics from any time range. The framework includes a visualization system that provides different graphical views that represent the collected information in a convenient way. These views are synchronized together, forming a comprehensive perspective that makes storage performance investigation a much more comfortable task. Different use cases are presented to show the usefulness of the framework in real‐world applications.
Distributed storage systems are commonly used in modern computing. They are highly scalable and offer data replication and fault tolerance. The complexity of those systems makes them difficult to debug using traditional tools. The existing tools are able to evaluate the overall performance of such systems but they do not provide enough information to find the root cause of performance issues. In this article, we propose a tracing-based performance analysis framework for storage clusters. We use a tracing strategy that reduces the tracing overhead in production systems. The traces collected from the different storage nodes are correlated and used to generate a data model that represents the cluster. Userspace tracing is used to gather data from the storage daemons, while Kernel tracing is used to provide detailed information about operating system internals such as disk queues, network queues and process scheduling. Efficient data structures are used to store the model and to generate metrics and graphical views. Our tool is used in different real world scenarios and is able to investigate interesting performance problems including I/O latencies, data replication and storage nodes failures.
Root cause identification of performance degradation within distributed systems is often a difficult and time-consuming task, yet it is crucial for maintaining high performance. In this paper, we present an execution trace-driven solution that reduces the efforts required to investigate, debug, and solve performance problems found in multinode distributed systems. The proposed approach employs a unified analysis method to represent trace data collected from the user-space level to the hardware level of involved nodes, allowing for efficient and effective root cause analysis. This solution works by extracting performance metrics and state information from trace data collected at user-space, kernel, and network levels. The multisource trace data is then synchronized and structured in a multidimensional data store, which is designed specifically for this kind of data. A posteriori analysis using a top-down approach is then used to investigate performance problems and detect their root causes. In this paper, we apply this generic framework to analyze trace data collected from the execution of the web server, database server, and application servers in a distributed LAMP (Linux, Apache, MySQL, and PHP) Stack. Using industrial level use cases, we show that the proposed approach is capable of investigating the root cause of performance issues, addressing unusual latency, and improving base latency by 70%. This is achieved with minimal tracing overhead that does not significantly impact performance, as well as O log n query response times for efficient analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.