We present a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, skewing, tiling, index set splitting and statement reordering. The framework is based on the idea that a transformation can be represented as a mapping from the original iteration space to a new iteration space. The framework is designed to provide a uniform way to represent and reason about transformations. We also provide algorithms to test the legality of mappings, and to generate optimized code for mappings.
There has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as a ne mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ability to generate code that iterates over the points in these new iteration spaces in the appropriate order. This problem has been fairly well-studied in the case where all statements use the same mapping. We have developed an algorithm for the less well-studied case where each statement uses a potentially di erent mapping. Unlike many other approaches, our algorithm can also generate code from mappings corresponding to loop blocking. We address the important trade-o between reducing control overhead and duplicating code. AbstractThere has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as a ne mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ability to generate code that iterates over the points in these new iteration spaces in the appropriate order. This problem has been fairly well-studied in the case where all statements use the same mapping. We have developed an algorithm for the less well-studied case where each statement uses a potentially di erent mapping. Unlike many other approaches, our algorithm can also generate code from mappings corresponding to loop blocking. We address the important trade-o between reducing control overhead and duplicating code.
Integer tuple relations can concisely summarize many types of information gathered from analysis of scienti c codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation f i] ! i + 2] j 1 i n 2 g. Even when it is possible to enumerate the related pairs of tuples, such as for the relation f i; j] ! i 0 ; j 0 ] j 1 i; j; i 0 ; j 0 100 g, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of a ne constraints on the related pairs of tuples. As we just saw, these a ne constraints can be parameterized, so what we are really describing are in nite families of relations (or graphs). Many of our applications of tuple relations rely heavily on an operation called transitive closure. Computing the transitive closure of these \in nite graphs" is very di erent from the traditional problem of computing the transitive closure of a graph whose edges can be enumerated. For example, the transitive closure of the rst relation above is the relation f i] ! i 0 ] j 9 s:t: i 0 i = 2 ^1 i i 0 n g. As we will prove, this computation is not computable in the general case. We have developed algorithms that produce exact results in most commonly occurring cases and produce upper or lower bounds (as necessary) in the other cases. This paper will describe our algorithms for computing transitive closure and some of its applications such as determining which inter-processor synchronizations are redundant.
Integer tuple relations can concisely summarize many types of information gathered from analysis of scienti c codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation f i] ! i + 2] j 1 i n 2 g. Even when it is possible to enumerate the related pairs of tuples, such as for the relation f i; j] ! i 0 ; j 0 ] j 1 i; j; i 0 ; j 0 100 g, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of a ne constraints on the related pairs of tuples. As we just saw, these a ne constraints can be parameterized, so what we are really describing are in nite families of relations (or graphs). Many of our applications of tuple relations rely heavily on an operation called transitive closure. Computing the transitive closure of these \in nite graphs" is very di erent from the traditional problem of computing the transitive closure of a graph whose edges can be enumerated. For example, the transitive closure of the rst relation above is the relation f i] ! i 0 ] j 9 s:t: i 0 i = 2 ^1 i i 0 n g. As we will prove, this computation is not computable in the general case. We have developed algorithms that produce exact results in most commonly occurring cases and produce upper or lower bounds (as necessary) in the other cases. This paper will describe our algorithms for computing transitive closure and some of its applications such as determining which inter-processor synchronizations are redundant.
Abstract-MPSoCs with hierarchical communication infrastructures are promising architectures for low power embedded systems. Multiple CPU clusters are coupled using an Network-onChip (NoC). Our CoreVA-MPSoC targets streaming applications in embedded systems, like signal and video processing. In this work we introduce a tightly coupled shared data memory to each CPU cluster, which can be accessed by all CPUs of a cluster and the NoC with low latency. The main focus is the comparison of different memory architectures and their connection to the NoC. We analyze memory architectures with local data memory only, shared data memory only, and a hybrid architecture integrating both. Implementation results are presented for a 28 nm FD-SOI standard cell technology. A CPU cluster with shared memory shows similar area requirements compared to the local memory architecture. We use post place and route simulations for precise analysis of energy consumption on both cluster and NoC level using the different memory architectures. An architecture with shared data memory shows best performance results in combination with a high resource efficiency. On average, the use of shared memory shows a 17.2% higher throughput for a benchmark suite of 10 applications compared to the use of local memory only. the communication infrastructure goes the on-chip memory architecture, which also has a huge impact on performance and energy efficiency. The main focus of this paper is the comparison of different memory architectures and their interaction with the NoC for many core systems. Compared to traditional processor systems, lots of many cores feature a different memory management, which changes the requirements on memory and NoC infrastructure. Traditional processor systems use a memory hierarchy with several (private and shared) on-chip caches, external DRAM, and a unified address space. This allows for easy programming, but results in unpredictable memory access times. Additionally, the cache logic and the coherence handling require a high amount of chip area and power. Therefore, a lot of Many-Core systems omit data caches and use software-managed scratchpad memories instead, which provide a resource-efficient alternative [1]. For performance reasons, the scratchpad memories are tightly attached to each CPU and communication between CPUs is initiated by software. In [2] we showed that area and power consumption of a single CoreVA CPU's data memory increases by 10%, when using a cache instead of scratchpad memory. Due to cache coherence issues it can be expected that these values will even increase for a cache-based many core system. Additionally, software-managed scratchpad memories gives full control of data communication to the programmer or an automatic partitioning tool (cf. Section III-E) and allows for a more accurate performance estimation.The many core architecture considered in this work is our CoreVA-MPSoC, which targets streaming applications in embedded and energy-limited systems. Examples for streaming applications are signal pr...
Energy efficient embedded computing enables new application scenarios in mobile devices like software-defined radio and video processing. The hierarchical multiprocessor considered in this work may contain dozens or hundreds of resource efficient VLIW CPUs. Programming this number of CPU cores is a complex task requiring compiler support. The stream programming paradigm provides beneficial properties that help to support automatic partitioning. This work describes a compiler for streaming applications targeting the self-build hierarchical CoreVA-MPSoC multiprocessor platform. The compiler is supported by a programming model that is tailored to fit the streaming programming paradigm. We present a novel simulatedannealing (SA) based partitioning algorithm, called Smart SA. The overall speedup of Smart SA is 12.84 for an MPSoC with 16 CPU cores compared to a single CPU implementation. Comparison with a state of the art partitioning algorithm shows an average performance improvement of 34.07%. I . I N T R O D U C T I O NThe decreasing feature size of microelectronic circuits allows for the integration of more and more processing cores on a single chip. A Multiprocessor System-on-Chip (MPSoC) may consist of dozens of processing elements as CPU cores or specialized hardware accelerators connected by a highspeed communication infrastructure, i.e. a Network-On-Chip (NoC). However, mapping general purpose applications to a large number of MPSoC processing elements remains a nontrivial task. Manually writing low-level code for each core makes it difficult to experiment with different decompositions and mappings of computation to processors. Alternatively, higher-level programming frameworks allow the compiler to evaluate a larger design-space when mapping the application to different hardware configurations. Efficient mapping algorithms are important for finding optimized solutions. The Streaming paradigm provides regular and repeating computation and independent filters with explicit communication. This allows compilers to exploit the task more easily, data and pipeline parallelism commonly found in signal processing, multimedia, network processing, cryptology and similar application domains.A popular stream based programming language is StreamIt [1], [2]. The key principle of this language is to provide information about inherent parallelism of the program by using a structured data flow graph. This graph consisting of filters, pipelines, split-joins, and feedback loops.In this paper we present a compiler for the StreamIt Language targeting the self-build CoreVA-MPSoC architecture. The CoreVA-MPSoC is a highly scalable multiprocessor system based on a hierarchical communication infrastructure and the configurable VLIW 1 processor CoreVA.This paper is organized as follows: Section II describes our CoreVA-MPSoC hardware architecture. In Section III we discuss our StreamIt compiler with a focus on our novel simulated annealing partitioning algorithm (Smart SA). The communication model proposed in this work is presented in S...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.