Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm. We show that prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design.To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and systemaware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.
A vertically loaded floating pile in clay affects a neighbouring pile by increasing the latter's displacement due to its own load. As a result, a group of rigidly capped piles exhibits a force/settlement ratio (‘vertical stiffness’) that is smaller than the sum of the individual stiffnesses of each pile – ‘efficiency’ in static stiffness less than 1. However, under dynamic steady-state loading the response of the pile group is an oscillatory function of frequency, and at certain frequencies a complete reversal of the static trend occurs, with the elastic dynamic group ‘efficiency’ exceeding not only the static ‘efficiency’, but also unity. To assess the realism of such behaviour, finite-element inelastic soil models were utilised to explore the influence of soil non-linearity on pile-to-pile interaction factors, under both static and dynamic loading. It is found that, with realistically inelastic undrained clay behaviour, the influence of a loaded pile on its neighbour diminishes radically with increasing amplitude of imposed displacement. The presence of a number of in-between piles, as well as the neighbouring pile's own rigidity, has no substantial effect on the interaction. The observed trends are explained by recourse to simple physical arguments. The diagrams provided for the pile-to-pile interaction factor are utilised to obtain the vertical dynamic impedance (i.e. stiffness and damping) of a 2 × 2 and a 3 × 3 rigidly capped pile group. It is found that these impedances are in accord with those resulting from three-dimensional analysis of the complete pile group. The difference between elastic and inelastic efficiency factors is shown to be substantial. The validity of the numerical results is strictly limited to piles in soft clays, whose resisting stress on the pile shaft equals their undrained shear strength.
Processing-using-memory (PuM) techniques leverage the analog operation of memory cells to perform computation. Several recent works have demonstrated PuM techniques (e.g., copy and initialization operations, bitwise operations, random number generation) in off-the-shelf DRAM devices. Since DRAM is the dominant memory technology as main memory in current computing systems, these PuM techniques represent an opportunity for alleviating the data movement bottleneck at very low cost. However, system integration of PuM techniques imposes non-trivial challenges (e.g., related to data allocation and alignment, memory coherence management) that are yet to be solved. Design space exploration of potential solutions to the PuM integration challenges requires appropriate tools to develop necessary hardware and software components. Unfortunately, current proprietary computing systems, specialized DRAM-testing platforms, or system simulators do not provide the flexibility and/or the holistic system view that is necessary to deal with PuM integration challenges.We design and develop PiDRAM, the first flexible endto-end framework that enables system integration studies and evaluation of real PuM techniques. PiDRAM provides software and hardware components to rapidly integrate PuM techniques across the whole system software and hardware stack (e.g., necessary modifications in the operating system, memory controller). We implement PiDRAM on an FPGAbased platform along with an open-source RISC-V system. To demonstrate the flexibility and ease of use of PiDRAM, we implement and evaluate two state-of-the-art PuM techniques. First, we implement in-memory copy and initialization. We propose solutions to integration challenges (e.g., memory coherence) and conduct a detailed end-to-end implementation study. Second, we implement a true random number generator in DRAM. Our results show that the in-memory copy and initialization techniques can improve the performance of bulk copy operations by 12.6× and bulk initialization operations by 14.6× on a real system. Implementing the true random number generator requires only 190 lines of Verilog and 74 lines of C code using PiDRAM's software and hardware components. PiDRAM is available on Github. 1
Motivation: Identifying sequence similarity is a fundamental step in genomic analyses, which is typically performed by first matching short subsequences of each genomic sequence, called seeds, and then verifying the similarity between sequences with sufficient number of matching seeds. The length and number of seed matches between sequences directly impact the accuracy and performance of identifying sequence similarity. Existing attempts optimizing seed matches suffer from performing either 1) the costly similarity verification for too many sequence pairs due to finding a large number of exact-matching seeds or 2) costly calculations to find fewer fuzzy (i.e., approximate) seed matches. Our goal is to efficiently find fuzzy seed matches to improve the performance, memory efficiency, and accuracy of identifying sequence similarity. To this end, we introduce BLEND, a fast, memory-efficient, and accurate mechanism to find fuzzy seed matches. BLEND 1) generates hash values for seeds so that similar seeds may have the same hash value, and 2) uses these hash values to efficiently find fuzzy seed matches between sequences. Results: We show the benefits of BLEND when used in two important genomics applications: finding overlapping reads and read mapping. For finding overlapping reads, BLEND enables a 0.9×-22.4× (on average 8.6×) faster and 1.8×-6.9× (on average 5.43×) more memory-efficient implementation than the state-of-the-art tool, Minimap2. We observe that BLEND finds better quality overlaps that lead to more accurate de novo assemblies compared to Minimap2. When mapping high coverage and accurate long reads, BLEND on average provides 1.2× speedup compared to Minimap2.
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either 1) increasing the use of the costly sequence alignment or 2) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seeds matches. BLEND 1) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and 2) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4x-83.9x (on average 19.3x), has a lower memory footprint by 0.9x-14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8x-4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.