Data forwarding in scalable shared-memory multiprocessors

Koufaty, David; Chen, X.; Poulsen, David K.; Torrellas, Josep

doi:10.1145/224538.224569

Cited by 41 publications

(13 citation statements)

References 18 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A smart buffer compiler exploits the fact that the input data on consecutive calls to a given co-processor frequently share items with previous calls; these items do not need to be copied. Similar techniques for propagating values in shared memory multiprocessors, such as data forwarding [15], can be used. CUBA allows data to be hosted by the co-processor local storage and uses a hybrid writethrough/write-back L2 cache policy.…”

Section: Data Transfermentioning

confidence: 99%

Cuba

Gelado

Kelm

Ryoo

et al. 2008

Proceedings of the 22nd Annual International Conference on Supercomputing

View full text Add to dashboard Cite

Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a generalpurpose CPU. However, applications often have to be modified in non-intuitive and complicated ways to mitigate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads cannot be amortized and co-processors are unable to provide benefit. The additional effort and complexity of incorporating co-processors makes it difficult, if not impossible, to effectively utilize co-processors in large applications.This paper presents CUBA, an architecture model where coprocessors encapsulated as function calls can efficiently access their input and output data structures through pointer parameters. The key idea is to map the data structures required by the co-processor to the co-processor local memory as opposed to the CPU's main memory. The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory. The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors. CUBA allows the CPU to cache hosted data structures with a selective write-through cache policy, allowing the CPU to access hosted data structures while supporting efficient communication with the co-processors. Benchmark simulation results show that a CUBAbased system can approach optimal transfer rates while requiring few changes to the code that executes on the CPU.

show abstract

Section: Data Transfermentioning

confidence: 99%

Cuba

Gelado

Kelm

Ryoo

et al. 2008

Proceedings of the 22nd Annual International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…The decoupling of correctness and performance provides an opportunity to reduce the number of cache misses by predictively pushing data between system components. This predictive transfer of data can be triggered by a coherence protocol predictor [1,21,35], by software (e.g., the KSR1's "poststore" [37] and DASH's "deliver" [24]), or by allowing the memory to push data into processor caches. Since Token Coherence allows data and tokens to be transferred between system components without affecting correctness, these schemes are easily implemented correctly as part of a performance protocol.…”

Section: Other Performance Protocol Opportunitiesmentioning

confidence: 99%

Token coherence

Martin

Hill

Wood

2003

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add indirection for common cache-to-cache misses (directory protocols) or require a totally-ordered interconnect (traditional snooping protocols). Unfortunately, totally-ordered interconnects are difficult to implement in glueless designs. An ideal coherence protocol would avoid indirections and interconnect ordering; however, such an approach introduces numerous protocol races that are difficult to resolve.We propose a new coherence framework to enable such protocols by separating performance from correctness. A performance protocol can optimize for the common case (i.e., absence of races) and rely on the underlying correctness substrate to resolve races, provide safety, and prevent starvation. We call the combination Token Coherence, since it explicitly exchanges and counts tokens to control coherence permissions. This paper develops TokenB, a specific Token Coherence performance protocol that allows a glueless multiprocessor to both exploit a low-latency unordered interconnect (like directory protocols) and avoid indirection (like snooping protocols). Simulations using commercial workloads show that our new protocol can significantly outperform traditional snooping and directory protocols.

show abstract

“…In distributed DDM applications, remote memory accesses are introduced, resulting from producer and consumer DThreads running on different nodes. The distributed FREDDO implementation provides implicit data forwarding [36] to the node where the consumer DThread is scheduled to run. In particular, a consumer DThread can be scheduled for execution only when all of its input data are available in the main memory.…”

mentioning

confidence: 99%

“…In particular, a consumer DThread can be scheduled for execution only when all of its input data are available in the main memory. This helps to reduce memory latencies [36]. FREDDO is publicly available for download in [42].Distributed FREDDO provides implicit data forwarding through a distributed shared memory (DSM) implementation [54] with a shared global address space (GAS).…”

mentioning

confidence: 99%

See 1 more Smart Citation

Data-Driven Concurrency for High Performance Computing

Matheou

Evripidou

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming framework that enables data-driven concurrency on HPC systems. The proposed framework is based on data-driven multithreading (DDM), a hybrid control-flow/dataflow model that schedules threads based on data availability on sequential processors. The proposed framework was evaluated using several benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to MPI, DDM-VM, and OmpSs@Cluster. The comparison results show that the proposed framework obtains comparable or better performance.Systems based on dynamic dataflow/data-driven execution [28], such as DDM, have several advantages over the sequential model of execution: (i) allow asynchronous data-driven execution of fine-grain tasks/threads, and fine-grain programming models have a great potential to efficiently use the underlying hardware [5,33,39,61]; (ii) can expose the maximum degree of parallelism in a program since the dataflow model only enforces true data dependencies [31]; and (iii) can handle concurrency and tolerate memory and synchronization latencies efficiently [10]. Thus, systems based on dynamic dataflow can be used to efficiently exploit the computing power of current and future HPC systems [22,33,39,40,61].In this work we extend the functionalities of DDM to enable efficient and portable distributed data-driven concurrency on HPC systems. The proposed functionalities are implemented in the FREDDO system [45], an efficient C++ implementation of DDM that until recently was supporting data-driven execution on single-node multicore systems. In distributed DDM applications, remote memory accesses are introduced, resulting from producer and consumer DThreads running on different nodes. The distributed FREDDO implementation provides implicit data forwarding [36] to the node where the consumer DThread is scheduled to run. In particular, a consumer DThread can be scheduled for execution only when all of its input data are available in the main memory. This helps to reduce memory latencies [36]. FREDDO is publicly available for download in [42].Distributed FREDDO provides implicit data forwarding through a distributed shared memory (DSM) implementation [54] with a shared global address space (GAS). Coherence operations implemented in typical DSM systems [53] are not required between the nodes because the produced data is forwarded to consumers that will be executed on remote nodes. DSM eases the development of distributed FREDDO/DDM applications that use shared objects/data-structures (scalar values, arrays, etc.). The programmer onl...

show abstract

Data forwarding in scalable shared-memory multiprocessors

Cited by 41 publications

References 18 publications

Cuba

Cuba

Token coherence

Data-Driven Concurrency for High Performance Computing

Contact Info

Product

Resources

About