TILT: A multithreaded VLIW soft processor family

Ovtcharov, Kalin; Tili, Ilian; Steffan, J. Gregory

doi:10.1109/fpl.2013.6645553

Cited by 10 publications

(7 citation statements)

References 8 publications

(7 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most successful TM overlays are based on soft processors. The more performance oriented ones include, SIMD Octavo [13], VectorBlox MXP [24] and VLIW TILT [19]. A massively parallel overlay, called GRVI Phalanx [7], based on the RISC-V processor and the Hoplite NOC [11] mapped 1680 RISC-V cores onto an UltraScale+ VU9P.…”

Section: Related Workmentioning

confidence: 99%

“…Similarly for the 2nd cluster, scheduling as: 14,26,21,10,16,11,27,22, resolves dependencies 14-11, 26-27, and 21-22, for all overlay versions. In cluster three, scheduling as: 18,24,28,23,19,30,8, resolves all dependencies for the V4 and V5 overlays, but not for the V3 overlay, which with an IWP of 5 requires 4 operations between dependant nodes. Hence, a single NOP must be added between 23 and 19 which then resolves all 4 sets of dependant instructions.…”

Section: Compiling To the Overlaymentioning

confidence: 99%

See 1 more Smart Citation

A time-multiplexed FPGA overlay with linear interconnect

Jain

Maskell

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

Coarse-grained overlays improve FPGA design productivity by providing fast compilation and software like programmability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput.The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels' characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Compiling To the Overlaymentioning

confidence: 99%

A time-multiplexed FPGA overlay with linear interconnect

Jain

Maskell

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

show abstract

“…To improve power consumption and throughput, smaller and faster processor architectures, such as the iDEA processor [22], have been proposed. Examples of multi-threaded and parallel processors include: CUSTARD [23], Octavo [24] and SIMD-Octavo [25], The VectorBlox MXP soft vector processor [26] and the TILT VLIW processor [27].…”

Section: B Time-multiplexed Overlaysmentioning

confidence: 99%

FPGA Overlays Hardware based Computing for the Masses

Phung¹,

Maskell²,

Li³

2018

Eighth International Conference on Advances in Computing, Electronics and Electrical Technology - CEET 2018

View full text Add to dashboard Cite

The hardware acceleration of compute intensive applications has definite advantages, particularly in terms of energy and application latency. Heterogeneous programmable system-on-chip (SoCs) FPGA devices, which combine general purpose processors with reconfigurable fabrics, provide a compelling platform for IoT applications. However, FPGA devices are constrained due to significant design productivity issues and a lack of suitable hardware abstraction. For FPGAs to compete as general purpose computing platforms they must be better virtualized, as eliminating the need to work with platform-specific details would make them more accessible to application developers who are accustomed to software API abstractions and fast development cycles. In this paper, we discuss the role of overlay architectures for enabling general purpose FPGA application acceleration.

show abstract

“…In the area of instruction programmable FPGA overlays, active academic research on vector processors [36,37] is going on in the area of embedded computing devices as throughput-optimized alternatives to scalar soft processors. Ovtcharov et al [38] add the concept of GPU-like multithreading to hide latencies of functional units and memory access by pipelining the execution of different threads. As proposed by Kingyens and Steffan [39] and brought forward by Convey with CHOMP [40] as successor to the vector processor utilized in this work, such a GPU-like architecture may be a promising architecture template for acceleration of server-and datacenter-scale computing tasks.…”

Section: Related Workmentioning

confidence: 99%

Exploring Trade-Offs between Specialized Dataflow Kernels and a Reusable Overlay in a Stereo Matching Case Study

Kenter

Schmitz

Plessl

2015

International Journal of Reconfigurable Computing

View full text Add to dashboard Cite

FPGAs are known to permit huge gains in performance and efficiency for suitable applications but still require reduced design efforts and shorter development cycles for wider adoption. In this work, we compare the resulting performance of two design concepts that in different ways promise such increased productivity. As common starting point, we employ a kernel-centric design approach, where computational hotspots in an application are identified and individually accelerated on FPGA. By means of a complex stereo matching application, we evaluate two fundamentally different design philosophies and approaches for implementing the required kernels on FPGAs. In the first implementation approach, we designed individually specialized data flow kernels in a spatial programming language for a Maxeler FPGA platform; in the alternative design approach, we target a vector coprocessor with large vector lengths, which is implemented as a form of programmable overlay on the application FPGAs of a Convey HC-1. We assess both approaches in terms of overall system performance, raw kernel performance, and performance relative to invested resources. After compensating for the effects of the underlying hardware platforms, the specialized dataflow kernels on the Maxeler platform are around 3x faster than kernels executing on the Convey vector coprocessor. In our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5x.

show abstract

TILT: A multithreaded VLIW soft processor family

Cited by 10 publications

References 8 publications

A time-multiplexed FPGA overlay with linear interconnect

A time-multiplexed FPGA overlay with linear interconnect

FPGA Overlays Hardware based Computing for the Masses

Exploring Trade-Offs between Specialized Dataflow Kernels and a Reusable Overlay in a Stereo Matching Case Study

Contact Info

Product

Resources

About