Abstract-Fast bottleneck detection and elimination is an important component of any design flow that aims at producing high-throughput systems. Bottlenecks can be difficult to find and correct, because their causes are diverse and often subtle. In this paper, we build on our recent method for performance analysis to develop a method for bottleneck identification and alleviation for pipelined asynchronous systems.More specifically, this paper makes two contributions. First, we introduce a method that, given a throughput goal, identifies which parts of the pipelined system constrain its throughput. Each such bottleneck is categorized based on the type of structural transformation that could potentially alleviate it: increase degree of pipelining (stage splitting, stage duplication, and loop unrolling); decrease forward latency (stage merging and parallelization); and perform slack matching. The second contribution is a method that guides the user to systematically apply these modifications to alleviate the bottlenecks and reach a target throughput goal.We have validated the bottleneck analysis method on several examples and were able to attain the desired throughput goal in each case through iterative application of our bottleneck alleviation method. Runtimes were negligible in all cases (less than 50 ms).
We present a technique for increasing the throughput of stream processing architectures by removing the bottlenecks caused by loop structures. We implement loops as self-timed pipelined rings that can operate on multiple data sets concurrently. Our contribution includes a transformation algorithm which takes as input a high-level program and gives as output the structure of an optimized pipeline ring. Our technique handles nested loops and is further enhanced by loop unrolling. Simulations run on benchmark examples show a 1.3 to 4.9x speedup without unrolling and a 2.6 to 9.7x speedup with twofold loop unrolling.
Abstract-This paper presents a systematic approach for microarchitectural exploration in pipelined asynchronous systems, with the goal of achieving a specified throughput target while minimizing a given cost function (based on energy, area, etc.). The method includes a general framework that (i) allows for a rich extensible set of microarchitectural transformations for improving throughput; and (ii) can handle a variety of cost functions, such as area, energy, Eτ 2 and the energy-area product. In general, the space of transformations that can be applied to a given circuit is potentially infinite because an arbitrarily long sequence of transformations may be applicable. To compound the challenge, the value of the given cost function can change non-monotonically as successive transformations are applied (e.g., some transformations increase area, while others decrease area), thereby making it difficult to apply a typical branch-and-bound approach to prune the search space. Our method employs simple but effective heuristic search strategies (including greedy, lookahead, and breadth-first). A key contribution is to identify commutativity of certain transformations, thereby pruning the design space significantly. The approach was automated and applied to a number of examples. Various throughput targets were assumed: from 50% to 20x throughput improvement. In each example, the approach was successful in meeting the throughput target.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.