Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register files often organized as a two dimensional grid. The most difficult challenge in deploying CGRAs is compiler scheduling technology that can efficiently map software implementations of compute intensive loops onto the array. Traditional schedulers focus on the placement of operations in time and space. With CGRAs, the challenge of placement is compounded by the need to explicitly route operands from producers to consumers. To systematically attack this problem, we take an edge-centric approach to modulo scheduling that focuses on the routing problem as its primary objective. With edge-centric modulo scheduling (EMS), placement is a by-product of the routing process, and the schedule is developed by routing each edge in the dataflow graph. Routing cost metrics provide the scheduler with a global perspective to guide selection. Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.
New media and signal processing applications demand ever higher performance while operating within the tight power constraints of mobile devices. A range of hardware implementations is available to deliver computation with varying degrees of area and power efficiency, from general-purpose processors to application-specific integrated circuits (ASICs). The tradeoff of moving towards more efficient customized solutions such as ASICs is the lack of flexibility in terms of hardware reusability and programmability. In this paper, we propose a customized semi-programmable loop accelerator architecture that exploits the efficiency gains available through high levels of customization, while maintaining sufficient flexibility to execute multiple similar loops. A customized instance of the loop accelerator architecture is generated for a particular loop and then the data and control paths are proactively generalized in an efficient manner to increase flexibility. A compiler mapping phase is then able to map other loops onto the same hardware. The efficiency of the programmable accelerator is compared with non-programmable accelerators and with the OpenRISC 1200 general purpose processor. The programmable accelerator is able to achieve up to 34x better power efficiency and 30x better area efficiency than a simple general purpose processor, while trading off as little as 2x power and area efficiency to the non-programmable accelerator.
Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.
Register bypass provides additional datapaths to eliminate data hazards in processor pipelines. The difficulty with register bypass is that the cost of the bypass network is substantial and grows substantially as processor width or pipeline depth are increased. For a single application, many of the bypass paths have extremely low utilization. Thus, there is an important opportunity in the design of application-specific processors to remove a large fraction of the bypass cost while maintaining performance comparable to a processor with full bypass. To this end, we propose a systematic design customization process along with a bypass-cognizant compiler scheduler. For the former, we employ iterative design space exploration wherein successive processor designs are selected based on bypass utilization statistics combined with the availability of redundant bypass paths. Compiler scheduling for sparse bypass processors is accomplished by prioritizing function unit choices for each operation prior to scheduling using global information. Results show that for a 5-issue customized VLIW processor, 70% of the bypass cost is eliminated while sacrificing only 10% performance.
In the embedded domain, custom hardware in the form of ASICs is often used to implement critical parts of applications when performance and energy efficiency goals cannot be met with software implementations on a general purpose processor or DSP. The downsides of using ASICs include high non-recurring engineering costs, inability to accommodate changes in the application after production, and inability to reuse hardware for new applications. However, by allowing a degree of post-programmability, the hardware can retain high performance and energy efficiency while increasing flexibility and reusability. The difficulty with programmable custom hardware lies in mapping new applications onto an existing datapath that is both sparse and irregular. This paper proposes a constraint-driven modulo scheduler that maps softwarepipelineable loops onto programmable loop accelerator hardware. The scheduler is able to target accelerators with widely varying levels of datapath functional capability and connectivity, and thus, varying degrees of programmability. The paper investigates the ability of the scheduler to map new loops onto existing hardware, which depends on both the degree of programmability of the hardware as well as the similarity of the new loop to the original loop for which the hardware was designed.
To meet the conflicting goals of high-performance low-cost embedded systems, critical application loop nests are commonly executed on specialized hardware accelerators. These loop accelerators are traditionally designed in a single-function manner, wherein each loop nest is implemented as a dedicated hardware block. This paper focuses on hardware sharing across loop nests by creating multifunction loop accelerators, or accelerators capable of executing multiple algorithms. A compiler-based system for automatically synthesizing multifunction loop accelerator architectures from C code is presented. We compare the effectiveness of three architecture synthesis approaches with varying levels of complexity: sum of individual accelerators, union of individual accelerators, and joint accelerator synthesis. Experiments show that multifunction accelerators achieve substantial hardware savings over combinations of singlefunction designs. In addition, the union approach to multifunction synthesis is shown to be effective at creating low-cost hardware by exploiting hardware sharing, while remaining computationally tractable.
Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.