Area-Performance Trade-offs in Tiled Dataflow Architectures

Swanson, Steven; Putnam, Andrew; Mercaldi, Martha; Michelson, Ken; Petersen, Andrew; Schwerin, Andrew; Oskin, Mark; Eggers, Susan J.

doi:10.1145/1150019.1136513

Cited by 19 publications

(20 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The TRIPS architecture [21], [20] is an instantiation of an EDGE ISA which utilizes large cores consisting of a matrix of execution units. In [22], the authors explore the area-performance trade-offs of a tiled data-flow architecture. A tiled architecture promises to address several issues facing conventional processors, such as complexity and performance.…”

Section: Related Workmentioning

confidence: 99%

Verilog-based simulation of hardware support for data-flow concurrency on multicore systems

Matheou

Evripidou

2013

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)

View full text Add to dashboard Cite

Data-Driven Multithreading (DDM) is a threaded data-flow model that schedules threads for execution based on data availability. DDM is utilizing a Thread Scheduling Unit (TSU) for the management of the threads on sequential processors. In this work we present the hardware implementation of the TSU with synthesizable code using the Verilog HDL and its evaluation using the ISim simulator. The evaluation results show that the TSU is able to run at a maximum frequency of 180 MHz and consumes only 5% of the Xilinx Virtex-6 FPGA resources. The initial results obtained in this work will enable us to design an FPGA based DDM multicore chip consisting of several Microblaze cores driven by the TSU. Thus, we will be able to evaluate the performance of the novel threaded data-flow model and have direct comparison with the sequential model on the same hardware.

show abstract

Section: Related Workmentioning

confidence: 99%

Verilog-based simulation of hardware support for data-flow concurrency on multicore systems

Matheou

Evripidou

2013

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)

View full text Add to dashboard Cite

show abstract

“…The WaveScalar processor has a similar philosophy and execution model as TRIPS, but uses a hierarchy of interconnection networks to pass operands between processing elements [9]. Operands are broadcast within the eight processing elements making up one domain.…”

Section: Related Workmentioning

confidence: 99%

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

Gratz

Sankaralingam

Hanson

et al. 2007

First International Symposium on Networks-on-Chip (NOCS'07)

View full text Add to dashboard Cite

Abstract-Microarchitecturally integrated on-chip networks, or micronets, are candidates to replace busses for processor component interconnect in future processor designs. For micronets, tight coupling between processor microarchitecture and network architecture is one of the keys to improving processor performance. This paper presents the design, implementation and evaluation of the TRIPS operand network (OPN). The TRIPS OPN is a 5x5, dynamically routed, 2D mesh micronet that is integrated into the TRIPS microprocessor core. The TRIPS OPN is used for operand passing, register file I/O, and primary memory system I/O. We discuss in detail the OPN design, including the unique features that arise from its integration with the processor core, such as its connection to the execution unit's wakeup pipeline and its in flight mis-speculated traffic removal. We then evaluate the performance of the network under synthetic and realistic loads. Finally, we assess the processor performance implications of OPN design decisions with respect to the end-toend latency of OPN packets and the OPN's bandwidth.

show abstract

“…We scheduled nine sample applications from the Spec2000 [35] and Splash2 [4] benchmark suites (art, equake, gzip, mcf, radix, twolf and fft, lu, ocean, respectively). 2 The cycle-level simulator used for this study is tuned to match the latencies, resources, and restrictions of an RTL implementation [37] of the architecture. Table 2 shows the average performance of each of these nine schedules.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…A simple PE decreases both design and verification time; PE replication provides robustness in the face of fabrication errors; and the combination reduces wire delay for both data and control signal transmission. The result is a scalable architecture that allows a chip designer to target different levels of performance, with different area budgets [37].…”

Section: Introductionmentioning

confidence: 99%

Instruction scheduling for a tiled dataflow architecture

et al. 2006

Self Cite

View full text Add to dashboard Cite

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule.Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17% of the best schedule for each.

show abstract

Area-Performance Trade-offs in Tiled Dataflow Architectures

Cited by 19 publications

References 28 publications

Verilog-based simulation of hardware support for data-flow concurrency on multicore systems

Verilog-based simulation of hardware support for data-flow concurrency on multicore systems

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

Instruction scheduling for a tiled dataflow architecture

Contact Info

Product

Resources

About