Architectural Considerations for Efficient Software Execution on Parallel Microprocessors

In this paper, we describe the application of two parallelization strategies to the Quartus II FPGA placer. The first uses a pipelining approach and achieves speedups of 1.3x on two processing cores. The second uses a parallel moves approach and achieves speedups of 2.2x on four cores. Unlike all previous parallel moves algorithms, ours is deterministic and always gives the same answer as the serial version of the algorithm, without any significant reduction in performance.We also describe a process to quantify multi-core performance effects, such as memory subsystem limitations and explicit synchronization overhead, and fully describe these effects on a CAD tool for the first time. Memory limitations alone are found to cost up to 35% of total runtime. Unlike previous algorithms, our algorithms have negligible explicit synchronization overhead. These results are relevant to both CAD designers and to any developers seeking to parallelize existing software.

show abstract

“…Similar conclusions for these types of configurations have also been reported for other fine-grained algorithms [22].…”

Section: Further Study Of Memory Inefficienciessupporting

confidence: 87%

“…Note that the other cores are not stalled and are doing useful work. The concept of having a thread change roles was described in [22] to improve cache efficiency, but we use it mainly to avoid stalls. This change alone improved the performance of our algorithm by approximately 30% at two cores.…”

Section: Methodsmentioning

confidence: 99%

High-quality, deterministic parallel placement for FPGAs on commodity hardware

Ludwin

Betz

Padalia

2008

Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…It is shown in [18] that communicating through the cache coherence mechanism is slower than communicating through memory for some common CMPs. Using Register-Based Synchronization (RBS) will reduce the spin waiting of the threads and cut cache contention and overhead.…”

Section: Introductionmentioning

confidence: 99%

Architecture optimizations for synchronization and communication on chip multiprocessors

Fide

Jenks

2008

2008 IEEE International Symposium on Parallel and Distributed Processing

Self Cite

View full text Add to dashboard Cite

Chip multiprocessors (CMPs) enable concurrent execution of multiple threads using several cores on a die. Current CMPs behave much like symmetric multiprocessors and do not take advantage of the proximity between cores to improve synchronization and communication between concurrent threads. Thread synchronization and communication instead use memory/cache interactions. We propose two architectural enhancements to support fine grain synchronization and communication between threads that reduce overhead and memory/cache contention. RegisterBased Synchronization exploits the proximity between cores to provide low-latency shared registers for synchronization. This approach can save significant power over spin waiting when blocking events that suspend the core are used. Prepushing provides software controlled data forwarding between caches to reduce coherence traffic and improve cache latency and hit rates. We explore the behavior of these approaches, and evaluate their effectiveness at improving synchronization and communication performance on CMPs with private caches. Our simulation results show significant reduction in inter-core traffic, latencies, and miss rates.

show abstract

Synchronization Mechanisms on Modern Multi-core Architectures

Liu

Gaudiot

2007

Advances in Computer Systems Architecture

View full text Add to dashboard Cite

Architectural Considerations for Efficient Software Execution on Parallel Microprocessors

Cited by 6 publications

References 27 publications

High-quality, deterministic parallel placement for FPGAs on commodity hardware

High-quality, deterministic parallel placement for FPGAs on commodity hardware

Architecture optimizations for synchronization and communication on chip multiprocessors

Synchronization Mechanisms on Modern Multi-core Architectures

Contact Info

Product

Resources

About