This paper presents accurate area, time, power estimation models for implementations using FPGAs from the Xilinx Virtex-2Pro family (Deng et al. 2008). These models are designed to facilitate efficient design space exploration in an automated algorithmarchitecture codesign framework. Detailed models for estimating the number of slices, block RAMs and 18 × 18-bit multipliers for fixed point and floating point IP cores have been developed. These models are also utilized to develop power models that consider the effect of logic power, signal power, clock power and I/O power. Timing models have been developed to predict the latency of the fixed point and floating point IP cores. In all cases, the model coefficients have been derived by using curve fitting or regression analysis. The modeling error is quite small for single IP cores; the error for the area estimate, for instance, is on the This paper is an extension of the ICASSP'08 paper "Accurate Models for Estimating Area and Power of FPGA Implementations". average 0.95%. The error for fairly large examples such as floating point implementation of 8-point FFTs is also quite small; it is 1.87% for estimation of number of slices and 3.48% for estimation of power consumption. The proposed models have also been integrated into a hardware-software partitioning tool to facilitate design space exploration under area and time constraints.
One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves onchip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme.
Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.
Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts.
This paper introduces parallelization strategies for the Non-Uniform FFT (NUFFT) data translation on multicore architectures. The NUFFT enables the use of the celebrated FFT with un-equally spaced data in numerous situations in signal and image processing as well as in scientific computing. The critical extension lies at the translation of non-equally spaced or non-uniformly sampled data onto an equally spaced Cartesian grid or vice versa. The data translation can be made sufficiently accurate, with the arithmetic complexity linearly proportional to the size of the data ensemble. For large NUFFTs, however, the data translation is found substantially dominant in computation time on modern computers while it is expected to be dominated by the FFT. In order to match the FFT performance achieved by FFTW, data locality and parallelism in the data translation must be explored and exploited as well. We are concerned with two fundamental issues. First, the data translation can be described as a matrix-vector multiplication with a matrix of irregular sparsity. This is beyond the effective scope of the conventional tiling and parallelization schemes applied by a compiler for performance improvement on computation with dense matrices. Secondly, multicore processors exist and emerge in many different configurations, and are expected to evolve further in architectural variety. This may mean the end of performance tuning on a single type of architecture. In this paper, we introduce an automation tool that takes two specifications as input, one on an applicationspecific data translation algorithm, the other on a target multicore processor architecture. The tool generates a parallel code that explores the data locality and parallelism by utilizing both geometric structures in data translation and the processor-memory configurations in the target architecPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ture. We present preliminary experimental results on both a simulator and a commercial multicore machine. The results show that our parallelization strategy brings significant performance improvement for the NUFFT data translation by efficiently exploiting the data locality and concurrency in the application.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.