Self-hosted placement for massively parallel processor arrays

Smecher, G.; Wilton, Steven J. E.; Lemieux, Guy

doi:10.1109/fpt.2009.5377668

Cited by 3 publications

(4 citation statements)

References 8 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, work by Smecher, Wilton and Lemieux [38] demonstrated that the algorithm used in [37] could be applied to placement of communicating tasks for a Massively Parallel Processor Array (MPPA). Since MPPAs contain reasonably powerful CPUs, they can "self-host" or place themselves.…”

Section: Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

Scalable and deterministic timing-driven parallel placement for FPGAs

Wang

Lemieux

2011

Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

This thesis describes a parallel implementation of the timing-driven VPR 5.0 simulated-annealing placement engine. By partitioning the grid into regions and allowing distant data to grow stale, it is possible to consider a large number of nonconflicting moves in parallel and achieve a deterministic result. The full timingdriven placement algorithm is parallelized, including swap evaluation, boundingbox calculation and the detailed timing-analysis updates. The partitioned region approach slightly degrades the placement quality, but this is necessary to expose greater parallelism. We also suggest a method to recover the lost quality.In simulated annealing, runtime can be shortened at the expense of quality.Using this method, the serial placer can achieve a maximum speedup of 100X while quality metrics degrades as much as 100%. In contrast, the parallel placer can scale beyond 500X with all quality metrics degrading by less than 30%. Specifically, at the point where the parallel placer begins to dominate over the serial placer, the post-routing minimum channel width, wirelength and critical-path delay degrades 13%, 10% and 7% respectively on average compared to VPR's original algorithm, while achieving a 140X to 200X speedup 25 threads. Finally, it is shown that the amount of degradation in the parallel placer is independent of the number of threads used.ii

show abstract

Section: Previous Workmentioning

confidence: 99%

“…Smecher et al [38] to the full timing-driven placement algorithm from VPR 5.0 using Pthreads, allowing it to run on readily available shared-memory multicore computers.…”

Section: Previous Workmentioning

confidence: 99%

Scalable and deterministic timing-driven parallel placement for FPGAs

Wang

Lemieux

2011

Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…Casotto et al [1987] No Sequent Balance 8,000 (8-processors) 6.4× on 8 processors Kravitz and Rutenbar [1987] No VAX 11/784 (4-processors) 2.3× on 4 processors Rose et al [1988] No 6 Nat.Sem. 32,016 processors 4× with 5 processors Witte et al [1991] No Hypercube multiprocessors 8× with 16 processors Sun and Sechen [1994] No Networks of machines 3.3× on 16 processors Wrighton and Dehon [2003] No FPGAs 500×-2,500× over CPUs Ludwin et al [2008] Yes Multiprocessors 2.1× on 4 processors Smecher et al [2009] No MPPAs 1/256 less swaps with 1,024 cores Choong et al [2010] No GPU 10× on NVIDIA GTX280 Ludwin and Betz [2011] Yes Multiprocessors 2.4× on 8 processors Wang and Lemieux [2011] Yes Multiprocessors 161× using 25 processors…”

Section: Related Workmentioning

confidence: 99%

“…This leads to mentionable speedups; however, it is not widely accepted as it cannot follow the exponential growth in FPGA logic cell counts. -Develop novel parallel algorithms to take advantage of the existing and upcoming multicore processors [Casotto et al 1987;Choong et al 2010;Rose et al 1988;Ludwin et al 2008;Kravitz and Rutenbar 1987;Wrighton and Dehon 2003;Witte et al 1991;Smecher et al 2009]. With the current market trend of increasing the number of CPU cores rather than designing faster CPU cores [ITRS 2012], the usage of parallel CAD algorithms promises to alleviate the runtime crisis.…”

Section: Introductionmentioning

confidence: 99%

Genesis

Diamantopoulos

Siozios

Xydis

et al. 2015

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Placement is though as the most time-consuming processes in physical implementation flows for reconfigurable architectures, while it highly affects the quality of derived application implementation, as it has impact on the maximum operating frequency. Throughout this article, we propose a novel placer, based on genetic algorithm, targeting to FPGAs. Rather than relevant approaches, which are executed sequentially, the new placer exhibits inherent parallelism, which can benefit from multicore processors. Experimental results prove the effectiveness of this solution, as it achieves average reduction of execution runtime and application's delay by 67× and 16%, respectively.

show abstract

Rapid Synthesis and Simulation of Computational Circuits in an MPPA

Grant

Smecher

Lemieux

et al. 2010

J Sign Process Syst

Self Cite

View full text Add to dashboard Cite

Abstract-A computational circuit is custom-designed hardware which promises to offer maximum speedup of computationally intensive software algorithms. However, the practical needs to manage development cost and many low-level physical design details erodes much of the potential speedup by distracting attention away from high-level architectural design. Instead, designers need an inexpensive, processor-like platform where computational circuits can be rapidly synthesized and simulated. This enables rapid architectural evolution and mitigates the risk of producing custom hardware. In this paper we present a tool flow (RVETool) for compiling computational circuits into a massively parallel processor array (MPPA). We demonstrate the CAD runtime is on average 70x faster than FPGA tools, with a circuit speed 6.4x slower than FPGA devices. Unlike the fixed logic capacity of FPGAs, RVETool can trade area for simulation performance by targeting a wide range of processor cores.

show abstract

Self-hosted placement for massively parallel processor arrays

Cited by 3 publications

References 8 publications

Scalable and deterministic timing-driven parallel placement for FPGAs

Scalable and deterministic timing-driven parallel placement for FPGAs

Genesis

Rapid Synthesis and Simulation of Computational Circuits in an MPPA

Contact Info

Product

Resources

About