Available instruction-level parallelism for superscalar and superpipelined machines

Jouppi, Norman P.; Wall, David

doi:10.1145/68182.68207

Cited by 117 publications

(13 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To maximize the overall performance, the respective code has to be scheduled in a way to take maximum advantage of the pipelines provided by the architecture [40], [47]. Instruction scheduling is an optimization technique that rearranges the micro-operations executed in a processor's pipeline, attempting to maximize the number of functional units operating in parallel and to minimize the time they spend waiting for each other [30].…”

Section: Feedback Driven Optimizationsmentioning

confidence: 99%

Efficient Utilization of SIMD Extensions

et al. 2005

View full text Add to dashboard Cite

Abstract-This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions.FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general purpose C compiler to maintain portability.The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions.Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.

show abstract

Section: Feedback Driven Optimizationsmentioning

confidence: 99%

Efficient Utilization of SIMD Extensions

et al. 2005

View full text Add to dashboard Cite

show abstract

“…First, we defined the instruction set as well as the addressing modes we wanted to support (see Appendix II for details). Next, we started to design the internal structure of the CPU using superscalar and superpipeline concepts [9]. Based on this, we divided the CPU pipeline operation into the following stages: Instruction Fetch (IF), Instruction Dispatch (ID), Instruction Decode (D), Address Generation (AG), Operand Fetch (OF), Execution (EX), and Write Back (WB).…”

Section: Layout Of Architecturementioning

confidence: 99%

Superscalar and superpipelined microprocessor design, and simulation: a senior project

et al. 1997

View full text Add to dashboard Cite

Abstract-An undergraduate senior project to design and simulate a modern Central Processing Unit (CPU) with a mix of simple and complex instruction set using a systematic design method is presented. The main objectives of the project are to accustom the students with modern design methods as well as to help the students gain practical experience in designing digital computers. Findings and suggestions on the use of modern work and design concepts in a group project environment are discussed.Index Terms-Processor design, systematic design methods, work and design concepts, complex instruction set, superscalar, superpipeline, simulation of systems, class project.

show abstract

“…A second column has been inserted after each line number to indicate which processing element(s) process this RTL line. For instance, the first line ( [1]) initializes the variable q to zero. Since the only use of q is in code allocated to PE3, the initialization of q is allocated to PE3.…”

Section: Code Separationmentioning

confidence: 99%

Code scheduling for multiple instruction stream architectures

Tyson

Farrens²

1994

Int J Parallel Prog

View full text Add to dashboard Cite

Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism.

show abstract

Available instruction-level parallelism for superscalar and superpipelined machines

Cited by 117 publications

References 13 publications

Efficient Utilization of SIMD Extensions

Efficient Utilization of SIMD Extensions

Superscalar and superpipelined microprocessor design, and simulation: a senior project

Code scheduling for multiple instruction stream architectures

Contact Info

Product

Resources

About