Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. In this paper, we address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outpedorm each of them on a range of benchmarks across a variety of architectures. Finally, we show how to quantitatively trade-off the number of profiles needed and the level of optimization that can be reached.
SUMMARYIterative compilation, where we search for the best program transformation based on profile feedback information, has been highly successful in determining good program optimizations, typically outperforming all static approaches. However, this benefit has come at a price, namely, a large number of executions of the target program which in many cases is unacceptable. This paper is concerned with reducing the number of program executions needed by iterative compilation by incorporating static models. We examine how static models may be incorporated into a purely iterative approach by developing several characterized heuristics. We empirically explore the tradeoff between reducing the number of executions required and the ultimate performance achieved for differing parameters or slack factors. We focus on tiling and unrolling as transformations and cache models as a test case for the feasibility of this approach. We show that a highly accurate model, used as a filter to profiling and appropriately parameterized, can reduce the number of executions by 50%. We also show that using a simple model to rank transformations and profiling, only those with the highest ranking can reduce the number of executions even further up to 70%, in cases where there is only a limited number of profiles available. We conclude that a production compiler might perform best using the last approach, gaining the benefit of iterative compilation at a much reduced cost.
The shared cache structures and snoop cache structures for single-chip multiprocessors are evaluated and compared using an instruction level simulator. Simulation results show that 1-port large shared cache achieves the best performance if there is no delay penalty for arbitration and accessing the bus. However, if 1-clock delay is assumed for accessing the shared cache, a snoop cache with internal wide bus and invalidate style NewKeio protocol overcomes shared caches.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.