Even though graphics processors (GPUs) are becoming increasingly popular for general purpose computing, current (and likely near future) generations of GPUs do not provide hardware support for detecting soft/hard errors in computation logic or memory storage cells since graphics applications are inherently fault tolerant. As a result, if an error occurs in GPUs during program execution, the results could be silently corrupted, which is not acceptable for general purpose computations. To improve the fidelity of general purpose computation on GPUs (GPGPU), we investigate software approaches to perform redundant execution. In particular, we propose and study three different, application-level techniques. The first technique simply executes the GPU kernel program twice, and thus achieves roughly half of the throughput of a non-redundant execution. The next two techniques interleave redundant execution with the original code in different ways to take advantage of the parallelism between the original code and its redundant copy. Furthermore, we evaluate the benefits of providing hardware support, including ECC/parity protection to on-chip and off-chip memories, for each of the software techniques. Interestingly, our findings, based on six commonly used applications, indicate that the benefits of complex software approaches are both application and architecture dependent. The simple approach, which executes the kernel twice, is often sufficient and may even outperform the complex ones. Moreover, we argue that the cost is not justified to protect memories with ECC/parity bits.
In this paper, we present our effort in developing an opensource GPU (graphics processing units) code library for the MATLAB Image Processing Toolbox (IPT). We ported a dozen of representative functions from IPT and based on their inherent characteristics, we grouped these functions into four categories: data independent, data sharing, algorithm dependent and data dependent. For each category, we present a detailed case study, which reveals interesting insights on how to efficiently optimize the code for GPUs and highlight performance-critical hardware features, some of which have not been well explored in existing literature. Our results show drastic speedups for the functions in the data-independent or data-sharing category by leveraging hardware support judiciously; and moderate speedups for those in the algorithm-dependent category by careful algorithm selection and parallelization. For the functions in the last category, finegrain synchronization and data-dependency requirements are the main obstacles to an efficient implementation on GPUs.
In this paper we propose a unified architectural support that can be used flexibly for either soft-error protection or software bug detection. Our approach is based on dynamically detecting and enforcing instructionlevel invariants. A hardware table is designed to keep track of run-time invariant information. During program execution, instructions access this table and compare their produced results against the stored invariants. Any violation of the predicted invariant suggests a potential abnormal behavior, which could be a result of a soft error or a latent software bug.In case of a soft error, monitoring invariant violations provides opportunistic soft-error protection to multiple structures in processor pipelines. Our experimental results show that invariant violations detect soft errors promptly and as a result, simple pipeline squashing is able to fix most of the detected soft errors. Meanwhile, the same approach can be easily adapted for software bug detection. The proposed architectural support eliminates the substantial performance overhead associated with software-based bug-detection approaches and enables continuous monitoring of production code.
Software defects, commonly known as bugs, present a serious challenge for system reliability and dependability. Once a program failure is observed, the debugging activities to locate the defects are typically nontrivial and time consuming. In this paper, we propose a novel automated approach to pin-point the root-causes of software failures. Our proposed approach consists of three steps. The first step is bug prediction, which leverages the existing work on anomaly-based bug detection as exceptional behavior during program execution has been shown to frequently point to the root cause of a software failure. The second step is bug isolation, which eliminates false-positive bug predictions by checking whether the dynamic forward slices of bug predictions lead to the observed program failure. The last step is bug validation, in which the isolated anomalies are validated by dynamically nullifying their effects and observing if the program still fails. The whole bug prediction, isolation and validation process is fully automated and can be implemented with efficient architectural support. Our experiments with 6 programs and 7 bugs, including a real bug in the gcc 2.95.2 compiler, show that our approach is highly effective at isolating only the relevant anomalies. Compared to state-of-art debugging techniques, our proposed approach pinpoints the defect locations more accurately and presents the user with a much smaller code set to analyze.
Two recent trends that have emerged include (1) Rapid growth in big data technologies with new types of computing models to handle unstructured data, such as mapreduce and noSQL (2) A growing focus on the memory subsystem for performance and power optimizations, particularly with emerging memory technologies offering different characteristics from conventional DRAM (bandwidths, read/write asymmetries).This paper examines how these trends may intersect by characterizing the memory access patterns of various Hadoop and noSQL big data workloads. Using memory DIMM traces collected using special hardware, we analyze the spatial and temporal reference patterns to bring out several insights related to memory and platform usages, such as memory footprints, read-write ratios, bandwidths, latencies, etc. We develop an analysis methodology to understand how conventional optimizations such as caching, prediction, and prefetching may apply to these workloads, and discuss the implications on software and system design.
Software defects, commonly known as bugs, present a serious challenge for system reliability and dependability. Once a program failure is observed, the debugging activities to locate the defects are typically nontrivial and time consuming. In this paper, we propose a novel automated approach to pin-point the root-causes of software failures.Our proposed approach consists of three steps. The first step is bug prediction, which leverages the existing work on anomaly-based bug detection as exceptional behavior during program execution has been shown to frequently point to the root cause of a software failure. The second step is bug isolation, which eliminates false-positive bug predictions by checking whether the dynamic forward slices of bug predictions lead to the observed program failure. The last step is bug validation, in which the isolated anomalies are validated by dynamically nullifying their effects and observing if the program still fails. The whole bug prediction, isolation and validation process is fully automated and can be implemented with efficient architectural support. Our experiments with 6 programs and 7 bugs, including a real bug in the gcc 2.95.2 compiler, show that our approach is highly effective at isolating only the relevant anomalies. Compared to state-of-art debugging techniques, our proposed approach pinpoints the defect locations more accurately and presents the user with a much smaller code set to analyze.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.