On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Kavadias, Stamatis; Katevenis, Manolis; Zampetakis, Michail; Nikolopoulos, Dimitrios S.

doi:10.1145/1787275.1787328

Cited by 30 publications

(30 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This paper extends on our previous work in [17]. Here, we elaborate on the architecture of cache-integrated network interfaces and the technique of event responses that enables their efficient implementation, and also measure the logic overhead of NI integration inside a cache.…”

Section: Introductionmentioning

confidence: 86%

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Kavadias

Katevenis

Zampetakis

et al. 2011

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the 123Int J Parallel Prog performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

show abstract

Section: Introductionmentioning

confidence: 86%

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Kavadias

Katevenis

Zampetakis

et al. 2011

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

show abstract

“…Interconnect design is one of the two open issues along with programming model in multicore system design [Rutzig, 2013]. Although data communication is a primary anticipated bottleneck for system performance [Dally and Towles, 2007;Kavadias et al, 2010;Orduña et al, 2004], the interconnect design for data communication among the accelerator kernels has not been well addressed in hardware accelerator systems. A simple bus or shared memory is usually used for data communication between the host and the kernels 1 as well as among the kernels.…”

Section: Problem Overviewmentioning

confidence: 99%

“…In data intensive applications, a large amount of data needs to be transferred from core to core. Therefore, data communication is usually a primary anticipated bottleneck for system performance [Altera, 2008;Becker et al, 2007;Donchev et al, 2006;Kavadias et al, 2010]. One important method to improve the performance of such systems is reducing data communication overhead.…”

Section: Introductionmentioning

confidence: 99%

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Pham‐Quoc¹,

Heisswolf²,

Werner³

et al. 2013

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2013

View full text Add to dashboard Cite

Heterogeneous multicore systems are becoming increasingly important as the need for computation power grows, especially when we are entering into the big data era. As one of the main trends in heterogeneous multicore, hardware accelerator systems provide application specific hardware circuits and are thus more energy efficient and have higher performance than general purpose processors, while still providing a large degree of flexibility. However, system performance dose not scale when increasing the number of processing cores due to the communication overhead which increases greatly with the increasing number of cores. Although data communication is a primary anticipated bottleneck for system performance, the interconnect design for data communication among the accelerator kernels has not been well addressed in hardware accelerator systems. A simple bus or shared memory is usually used for data communication between the accelerator kernels. In this dissertation, we address the issue of interconnect design for heterogeneous hardware accelerator systems.Evidently, there are dependencies among computations, since data produced by one kernel may be needed by another kernel. Data communication patterns can be specific for each application and could lead to different types of interconnect. In this dissertation, we use detailed data communication profiling to design an optimized hybrid interconnect that provides the most appropriate support for the communication pattern inside an application while keeping the hardware resource usage for the interconnect minimal. Firstly, we propose a heuristicbased approach that takes application data communication profiling into account to design a hardware accelerator system with a custom interconnect. A number of solutions are considered including crossbar-based shared local memory, direct memory access (DMA) supporting parallel processing, local buffers, and hardware duplication. This approach is mainly useful for embedded system where the hardware resources are limited. Secondly, we propose an automated hybrid interconnect design using data communication profiling to define an optimized interconnect for accelerator kernels of a generic hardware accelerator system. The hybrid interconnect consists of a network-on-chip (NoC), vii viii ABSTRACT shared local memory, or both. To minimize hardware resource usage for the hybrid interconnect, we also propose an adaptive mapping algorithm to connect the computing kernels and their local memories to the proposed hybrid interconnect. Thirdly, we propose a hardware accelerator architecture to support streaming image processing. In all presented approaches, we implement the approach using a number of benchmarks on relevant reconfigurable platforms to show their effectiveness. The experimental results show that our approaches not only improve system performance but also reduce overall energy consumption compared to the baseline systems.

show abstract

“…Our work on explicit communication and synchronization for the SARC architecture includes an FPGA prototype described in [2] and a longer description of the architecture, with performance measurements collected on the FPGA prototype [13].…”

Section: Related Workmentioning

confidence: 99%

Explicit Communication and Synchronization in SARC

et al. 2010

View full text Add to dashboard Cite

SARC merges cache controller and network interface functions by relying on a single hardware primitive: each access checks the tag and the state of the addressed line for possible occurrence of events that may trigger responses like coherence actions, RDMA, synchronization, or configurable event notifications. The fully virtualized and protected user-level API is based on specially marked lines in the scratchpad space that respond as command buffers, counters, or queues. The runtime system maps communication abstractions of the programming model to data transfers among local memories using remote write or read DMA and into task synchronization and scheduling using notifications, counters, and queues. The on-chip network provides efficient communication among these configurable memories, using advanced topologies and routing algorithms, and providing for process variability in NoC links. We simulate benchmark kernels on a full-system simulator to compare speedup and network traffic against cache-only systems with directory-based coherence and prefetchers. Explicit communication provides 10 to 40% higher speedup on 64 cores, and reduces network traffic by factors of 2 to 4, thus economizing on energy and power; lock and barrier latency is reduced by factors of 3 to 5. EXPLICIT COMMUNICATION AND NETWORK INTERFACE EVOLUTIONInterprocessor communication (IPC) is the basis of parallel processing. IPC can be implicit, when the addresses supplied by the software do not identify physical data locations or (time of) movement, or it can be explicit, when software (the application, or compiler, or runtime system) is able to also indicate physical placement or transfers, besides specifying computation on data. The SARC architecture [1], supports both implicit IPC, through cache coherence, for ease of programming, and explicit IPC, through scratchpad memories and remote store instructions or remote DMA operations, to be used by software whenever possible for achieving scalable performance.In order to hide IPC latency, when using implicit communication, we need large issue windows in out-of-order-execution processors, or sophisticated data prefetchers, or both. Explicit communication has the potential to better hide IPC latency, in those cases when software knows better than hardware what transfers need to take place and when. Remote store instructions, to addresses that indicate proximity to the consumer, when that is known at production time, will transfer data at the earliest possible time; hardware should coalesce writes to adjacent targets into few network packets, and the processor should not wait for the arrival acknowledgments. Remote direct memory access (RDMA) is the other method for explicit communication, in cases that require either reads -when the consumer is unknown or unavailable at production time-or multi-word writes -to achieve good coalescence.Traditional systems viewed networks as external (slow) devices, provided DMA in the network interface (NI), and interacted to it through (slow) input/output (I...

show abstract

On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Cited by 30 publications

References 30 publications

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Explicit Communication and Synchronization in SARC

Contact Info

Product

Resources

About