Squeezing more CPU performance out of a Cray-2 by vector block scheduling

Eisenbeis,; Jalby,; Lichnewsky,

doi:10.1109/superc.1988.44659

Cited by 6 publications

(4 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If C RQ is less or equal to 8, the register allocation procedure is directly applied. Otherwise, data spilling and delay insertion is performed (generating a new functional unit scheduling (.1U-sched*», in order to reduce CRQ down to 8 (cf [3]). Then the register allocation is applied.…”

Section: Spilling and Register Allocationmentioning

confidence: 99%

“…Conversely VAS72 has scheduled the operations corresponding to the same triple in non consecutive macrocycles, resulting in a better utilization of the memory bandwidth (cf Tables 2 and 4) For VAS72, when no spilling is required, the model is very close to the apparent chime because the critical path of code produced by VAS72 is mostly comprised of operations all of which last 72 cycles. As described elsewhere [3], the spilling procedure we used introduces macrocycles during which no operations are scheduled: they correspond to waiting for the liberation of result registers. In such cases, the model is considerably inaccurate because such macrocycles a.re accounted for a full macrocycle, while in reality, they last only about 10 cycles (82 -72).…”

Section: Codesmentioning

confidence: 99%

“…• a register allocation procedure that attempts to make efficient use of the limited size of the register set, This overall approach is described in our previous paper [3], and we will focus here on the choice of the model used to present an abstract and simplified description of the machine.…”

Section: Introductionmentioning

confidence: 99%

“…

AbstractIn a previous work [3], a cyclic scheduling method was shown efficient to generate vector code for the Cray-2 architecture, and compared to existing compilers. This method was using the framework of microcode compaction through a simplified model of the Cray-2 vector instruction stream.

…”

mentioning

confidence: 99%

See 3 more Smart Citations

Modeling the Memory of the Cray2 for Compile Time Optimization

1990

Self Cite

View full text Add to dashboard Cite

In a previous work [3], a cyclic scheduling method was shown efficient to generate vector code for the Cray-2 architecture, and compared to existing compilers. This method was using the framework of microcode compaction through a simplified model of the Cray-2 vector instruction stream. In this paper, we further elaborate on how to model the machine architecture within the underlying cyclic scheduling method. The impact of the choice of the model on the code generated is analyzed and performance results are presented.

show abstract

Section: Spilling and Register Allocationmentioning

confidence: 99%

Section: Codesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 2 more Smart Citations

Modeling the Memory of the Cray2 for Compile Time Optimization

1990

Self Cite

View full text Add to dashboard Cite

show abstract

Minimizing Register Requirements of a Modulo Schedule via Optimum Stage Scheduling

Eichenberger

Davidson

Abraham

1996

Int J Parallel Prog

View full text Add to dashboard Cite

Approaching a machine-application bound in delivered performance on scientific code

Mangione-Smith

Shih²,

Abraham³

et al. 1993

Proc. IEEE

View full text Add to dashboard Cite

We have developed a performance bounding methodology that explains the performance of loop-dominated scienti c applications on particular systems. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory interface, oating-point, instruction issue, and a \dependence unit" which is used to model the e ects of performance-limiting recurrences. We propose a workload characterization, and derive upper bounds on the performance of speci c machineworkload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable.A detailed analysis and performance improvement e ort for the IBM RS/6000, using the Livermore Fortran Kernels 1-12 to represent the target workload, produces a lower bound of average 1.27 clocks per oating-point operation (CPF), whereas machine peak performance is 0.5 CPF and the V2.01 Fortran compiler attains only 2.43 CPF. Code improvements in this study have achieved 1.36 CPF, increasing the harmonic mean steadystate inner loop performance to 97.6% of the MFLOPS bound. Subsequently the V2.02 compiler achieved 1.75 CPF, and 1.60 with carefully chosen preprocessing. A goal-directed 1 The authors would like to acknowledge the support of the Hewlett Packard Corporation and the assistance of many people at IBM. compiler with bound knowledge could produce higher performance code more e ciently and automatically.In general, achieved performance is also a ected by cache misses and register spill code.Simple calibration loops are used to characterize cache performance. The register requirements are characterized as a function of the latency and bandwidth of memory and function units for application kernels that have tree structured dependence graphs.

show abstract

Squeezing more CPU performance out of a Cray-2 by vector block scheduling

Cited by 6 publications

References 5 publications

Modeling the Memory of the Cray2 for Compile Time Optimization

Modeling the Memory of the Cray2 for Compile Time Optimization

Minimizing Register Requirements of a Modulo Schedule via Optimum Stage Scheduling

Approaching a machine-application bound in delivered performance on scientific code

Contact Info

Product

Resources

About