Arpith C. Jacob scite author profile

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing 10 18 floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

show abstract

Biosequence Similarity Search on the Mercury System

Krishnamurthy

Buhler

Chamberlain

et al. 2007

J VLSI Sign Process Syst Sign Im

View full text Add to dashboard Cite

Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described. 1: IntroductionComputational search through large databases of DNA and protein sequence is a fundamental tool of modern molecular biology. Rapid advances in the speed and cost-effectiveness of DNA sequencing have led to an explosion in the rate at which new sequences, including entire mammalian genomes [35], are being generated. To understand the function and evolutionary history of an organism, biologists now seek to identify discrete biologically meaningful features in its genome sequence. A powerful approach to identify such features is comparative annotation, in which a query sequence, such as new genome, is compared to a large database of known biosequences. Database sequences exhibiting high similarity to the query, as measured by string edit distance [31], are hypothesized to derive from the same ancestral sequence as the query and in many cases to have the same biological function.BLAST, the Basic Local Alignment Search Tool [1], is the most widely used software for rapidly comparing a query sequence to a biosequence database. Although BLAST's algorithms are highly optimized for efficient similarity search, growth in the databases it uses is outpacing speed improvements in general-purpose computing hardware. For example, the National Center for Biological Information (NCBI) Genbank database grew exponentially between 1992 and 2003 with a doubling time of 12-16 months [24]. The problem is particularly acute for BLASTN, the BLAST variant used to compare DNA sequences, because each new genome sequenced from animals or higher plants produces between 10 8 and 10 10 bytes of new DNA sequence.One response to runaway growth in biosequence databases has been to distribute BLAST searches across multiple computers, each responsible for searching only part of a database. This approach requires both a substantial hardware investment and the ability to coordinate a {praveenk, jbuhler, roger, jbf, kg2, jarpith, jmlancas}@cse.wustl.edu. NIH-PA Author ManuscriptNIH-PA Author Manuscript NIH-PA Author Manuscript search across processors. An alternate approach that makes more parsimonious use of hardware is to build a specialized BLAST accelerator. By using an applic...

show abstract

Data access optimization in a processing-in-memory system

Sura

Jacob

Chen

et al. 2015

View full text Add to dashboard Cite

The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.

show abstract

Preliminary results in accelerating profile HMM search on FPGAs

Jacob

Lancaster

Buhler

et al. 2007

View full text Add to dashboard Cite

Offloading Support for OpenMP in Clang and LLVM

Antão

Bataev

Jacob

et al. 2016

View full text Add to dashboard Cite

Hardware Technologies for High-Performance Data-Intensive Computing

et al. 2008

View full text Add to dashboard Cite

A s the amount of scientific and social data continues to grow, researchers in a multitude of domains face challenges associated with storing, indexing, retrieving, assimilating, and synthesizing raw data into actionable information. Combining techniques from computer science, statistics, and applied math, data-intensive computing involves developing and optimizing algorithms and systems that interact closely with large volumes of data.Scientific applications that read and write large data sets often perform poorly and don't scale well on presentday computing systems. Many data-intensive applications are data-path-oriented, making little use of branch prediction and speculation hardware in the CPU. These applications are well suited to streaming data access and can't effectively use the sophisticated on-chip cache hierarchy. Their ability to process large data sets is hampered by orders-of-magnitude mismatches between disk, memory, and CPU bandwidths.Emerging technologies can improve data-intensive algorithms' performance, at reasonable cost in development time, by an order of magnitude over the state of the art. Coprocessors such as graphics processor units (GPUs) and field-programmable gate arrays (FPGAs) can significantly speed up some application classes in which data-path-oriented computing is dominant. Additionally, these coprocessors interact with application-controlled on-chip memory rather than a traditional cache.To alleviate the 10-to-100 factor mismatch in bandwidth between disk and memory, we investigated an I/O system built from a large, parallel array of solid-state storage devices. While containing the same NAND flash chips as USB drives, such I/O arrays achieve significantly higher bandwidth and lower latency than USB drives through parallel access to an array of devices.To quantify these technologies' merits, we've created a small collection of data-intensive benchmarks selected from applications in data analysis and science. These benchmarks draw from three data types: scientific imagery, unstructured text, and semantic graphs representing networks of relationships. Our results demonstrate that augmenting commodity processors to exploit these technologies can improve performance 2 to 17 times. COPROCESSORSCoprocessors designed for data-oriented computing can deliver orders-of-magnitude better performance than general-purpose microprocessors on data-pathcentric compute kernels. We evaluated the benefits of two coprocessor architectures: graphics processors and reconfigurable hardware.Data-intensive problems challenge conventional computing architectures with demanding CPU, memory, and I/O requirements. Experiments with three benchmarks suggest that emerging hardware technologies can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.

show abstract

Accelerating Nussinov RNA secondary structure prediction with systolic arrays on FPGAs

Jacob

Buhler

Chamberlain

2008

View full text Add to dashboard Cite

Integrating GPU support for OpenMP offloading directives into Clang

Bertolli

Antão

Bercea

et al. 2015

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Arpith C. Jacob

Active Memory Cube: A processing-in-memory architecture for exascale systems

Biosequence Similarity Search on the Mercury System

Data access optimization in a processing-in-memory system

Preliminary results in accelerating profile HMM search on FPGAs

Offloading Support for OpenMP in Clang and LLVM

Hardware Technologies for High-Performance Data-Intensive Computing

Accelerating Nussinov RNA secondary structure prediction with systolic arrays on FPGAs

Integrating GPU support for OpenMP offloading directives into Clang

Contact Info

Product

Resources

About