Hardware Support For Large Atomic Units in Dynamically Scheduled Machines

Melvin, Stephen; Shebanow, Michael; Patt, Yale N.

doi:10.1109/micro.1988.639255

Cited by 42 publications

(13 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, DIF does not use an out-of-order engine but only a simple "primary engine" alongside its VLIW engine, and envisions that most code should execute on the VLIW engine. Atomic block-based cores: Melvin et al [39,38], and later Hao et al [24] and Sprangle et al [51], propose a core design in which the compiler provides atomic blocks at the ISA level. These works note multiple advantages of using atomic blocks: the core has a higher instruction fetch rate, and can also use a small local register file to reduce register pressure on a global register file [51].…”

Section: Related Workmentioning

confidence: 99%

The heterogeneous block architecture

Fallin¹,

Wilkerson

Mutlu³

2014

2014 IEEE 32nd International Conference on Computer Design (ICCD)

View full text Add to dashboard Cite

This paper makes two new observations that lead to a new heterogeneous core design. First, we observe that most serial code exhibits fine-grained heterogeneity: at the scale of tens or hundreds of instructions, regions of code fit different microarchitectures better (at the same point or at different points in time). Second, we observe that by grouping contiguous regions of instructions into blocks that are executed atomically, a core can exploit this heterogeneity: atomicity allows each block to be executed independently on its own execution backend that fits its characteristics best.Based on these observations, we propose a fine-grained heterogeneous design that combines heterogeneous execution backends into one core. Our core design, the heterogeneous block architecture (HBA), breaks the program into blocks of code, determines the best backend for each block, and specializes the block for that backend. As an initial, concrete design, we combine out-of-order, VLIW, and in-order backends, using simple heuristics to choose backends. We compare HBA to multiple baseline core designs (including monolithic out-of-order, clustered out-of-order, in-order and a state-of-the-art heterogeneous core design) and show that HBA can provide significantly better energy efficiency than all designs at similar performance. Averaged across 184 traces from a wide variety of workloads, HBA reduces core power by 36.4% and energy per instruction by 31.9% compared to a 4-wide out-of-order core. We conclude that HBA provides a flexible substrate for exploiting fine-grained heterogeneity, enabling new energy-performance tradeoff points in core design.

show abstract

Section: Related Workmentioning

confidence: 99%

The heterogeneous block architecture

Fallin¹,

Wilkerson

Mutlu³

2014

2014 IEEE 32nd International Conference on Computer Design (ICCD)

View full text Add to dashboard Cite

show abstract

“…In addition, it requires dynamic scheduling hardware in the main data path of the machine, which can have a negative effect on the clock cycle time. Franklin and Smotherman [13] proposed the use of a fill unit [25] to compact a dynamic stream of scalar instructions. Their fill unit accepts decoded instructions from the machine decoder, compacts them into a long instruction (the term used in the rest of this paper to refer to VLIW instructions), and saves this into a shadow cache.…”

Section: Tackling the Vliw Object Code Compatibility Problemmentioning

confidence: 99%

Dynamically Scheduling VLIW Instructions

Souza

Rounce

2000

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Very long instruction word (VLIW) machines potentially provide the most direct way to exploit instruction-level parallelism; however, they cannot be used to emulate current general-purpose instruction set architectures. In addition, programs scheduled for a particular implementation of a VLIW model cannot be guaranteed to be binary compatible with other implementations of the same machine model with a different number of functional units or functional units with different latencies. This paper describes an architecture, named dynamically trace scheduled VLIW (DTSVLIW), that can be used to implement machines that execute code of current RISC or CISC instruction set architectures in a VLIW fashion, with backward code compatibility. Preliminary measurements of the DTSVLIW performance, obtained with an execution-driven simulator running the SPECint95 benchmark suite, are also presented. Academic Press

show abstract

“…Blocks of instructions are preprocessed before being put in the trace cache, which greatly simplifies processing after they are fetched. Preprocessing can include capturing data dependence relationships, combining and reordering instructions, or determining instruction resource requirements 5 -all of which can be reused. To support precise interrupts, information about the original instruction order must also be saved with the trace.…”

Section: Instruction Preprocessingmentioning

confidence: 99%

Trace processors: moving to fourth-generation microarchitectures

Smith

Vajapeyam

1997

Computer

View full text Add to dashboard Cite

Fundamentally new generations of microarchitectures have been occurring approximately every two decades since the 1940s. Each generation has been driven by advances in underlying hardware technologies, and by attempts to extract and realize higher degrees of instructionlevel parallelism. Given this pattern and the continued push for higher performance, we are midway through the third generation and are currently laying the groundwork for the fourth.Technology trends are clear. By the end of the next decade a single IC chip will contain several hundred million, if not a billion, transistors. The Semiconductor Industry Association's road map 1 projects processors with 350 million transistors in 2007 and with 800 million by 2010. These large numbers of transistors result from greatly reduced feature sizes and lead to higher wiring densities. Thus, a major challenge is to use these transistors effectively and to accommodate the dramatic shifts in design constraints that will result from these changes.As the sidebar "Why Large Uniprocessors?" describes, there are primarily three ways to respond to this challenge: build a multiprocessor on chip, integrate more of the computer system on a chip, or build a large uniprocessor, which would realize the fourth generation of microarchitectures. We have chosen to explore building large uniprocessors, specifically trace processors. A trace processor can execute ordinary serial programs written in standard languages at much higher speeds than are currently possible. It replicates superscalar pipelines (characteristic of the current microarchitecture generation) to form a set of connected processing elements. To this is added a level of hierarchy for control and data. A high-level control unit partitions the instruction stream into segments, or traces. A specially organized cache holds traces, and the processor fetches and executes traces as a unit. Another important feature is the heavy use of prediction for both control and data, which increases the exploitable parallelism in ordinary programs.Although we describe features of the trace processor's architecture, our goal in this article is to focus on trace processors as a vehicle for describing the requirements of a fourth-generation microarchitecture. In this we include the technology trends that drive those requirements and the underlying features to support the microarchitecture. Figure 1 diagrams the four generations of microarchitectures. The first generation (top), serial processors, began in the 1940s with the first electronic digital computers and ended in the early 1960s. Serial processors fetch and execute each instruction before going to the next. The second generation was distinguished by pipelining and similar methods for overlapping instruction execution. IBM Stretch was a precursor of this generation, and the CDC 6600 was probably the first to achieve commercial success. The 6600 was followed shortly by pipelined processors in IBM mainframes. Second-generation microarchitectures using pipelining were the norm fo...

show abstract

Hardware Support For Large Atomic Units in Dynamically Scheduled Machines

Cited by 42 publications

References 3 publications

The heterogeneous block architecture

The heterogeneous block architecture

Dynamically Scheduling VLIW Instructions

Trace processors: moving to fourth-generation microarchitectures

Contact Info

Product

Resources

About