We have developed a performance bounding methodology that explains the performance of loop-dominated scienti c applications on particular systems. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory interface, oating-point, instruction issue, and a \dependence unit" which is used to model the e ects of performance-limiting recurrences. We propose a workload characterization, and derive upper bounds on the performance of speci c machineworkload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable.A detailed analysis and performance improvement e ort for the IBM RS/6000, using the Livermore Fortran Kernels 1-12 to represent the target workload, produces a lower bound of average 1.27 clocks per oating-point operation (CPF), whereas machine peak performance is 0.5 CPF and the V2.01 Fortran compiler attains only 2.43 CPF. Code improvements in this study have achieved 1.36 CPF, increasing the harmonic mean steadystate inner loop performance to 97.6% of the MFLOPS bound. Subsequently the V2.02 compiler achieved 1.75 CPF, and 1.60 with carefully chosen preprocessing. A goal-directed 1 The authors would like to acknowledge the support of the Hewlett Packard Corporation and the assistance of many people at IBM. compiler with bound knowledge could produce higher performance code more e ciently and automatically.In general, achieved performance is also a ected by cache misses and register spill code.Simple calibration loops are used to characterize cache performance. The register requirements are characterized as a function of the latency and bandwidth of memory and function units for application kernels that have tree structured dependence graphs.