On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems

Choi, Wonje; Duraisamy, Karthi; Kim, Ryan; Doppa, Janardhan Rao; Pande, Partha Pratim; Marculescu, Diana; Mărculescu, Radu

doi:10.1109/tc.2017.2777863

Cited by 73 publications

(37 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Due to its heterogeneity, CPU-GPU based systems exhibit several interesting traffic characteristics, for instance, GPUs typically only communicate with a few shared last level caches (LLCs) which results in many-to-few traffic patterns (i.e., many GPUs communicate with a few LLCs) with negligible inter-GPU communication [7], [13], [14]. This can cause the LLCs to become bandwidth bottlenecks under heavy network loads and lead to significant performance degradation [7]. In addition, since heterogeneous systems share the memory resources, the GPUs can monopolize the memory and cause high CPU memory access latency [15].…”

Section: D Heterogeneous Nocsmentioning

confidence: 99%

“…Due to the heterogeneity of the cores integrated on a single chip, the communication requirements for each core can vary significantly. For example, in a CPU-GPU based heterogeneous system, CPUs require low memory latency while GPUs need high-throughput data transfers [7]. In addition to the individual core requirements, 3D ICs allow dense circuit integration but have much higher power density than their 2D counterparts.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Joardar

Kim

Doppa

et al. 2019

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance threedimensional (3D) manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Networkon-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system and explores various tradeoffs using an efficient machine learning (ML)-based multi-objective optimization (MOO) technique. The proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, throughput, temperature, and energy), we can achieve 9.6% better Energy-Delay Product on average at nearly iso-temperature conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 3D NoCs only incur a 1.8% (36-tile system) and 1.1% (64-tile system) average performance loss compared to application-specific NoCs.

show abstract

Section: D Heterogeneous Nocsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Joardar

Kim

Doppa

et al. 2019

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…This does not invalidate the potential of the WNoC paradigm, but leads to erroneous assumptions on the achievable speed and power. For instance, many WNoC architectures assume rates over 10 Gb/s [12], [27], [28], which may not be achievable due to multipath effects. Other works obtain power consumption estimates by assuming path losses between 25 and 30 dB [36]- [39], values that are far from the true in standard chip packages.…”

Section: Introductionmentioning

confidence: 99%

Engineer the Channel and Adapt to it: Enabling Wireless Intra-Chip Communication

Timoneda

Abadal

Franques

et al. 2020

IEEE Trans. Commun.

View full text Add to dashboard Cite

Ubiquitous multicore processors nowadays rely on an integrated packet-switched network for cores to exchange and share data. The performance of these intra-chip networks is a key determinant of the processor speed and, at high core counts, becomes an important bottleneck due to scalability issues. To address this, several works propose the use of mm-wave wireless interconnects for intra-chip communication and demonstrate that, thanks to their low-latency broadcast and system-level flexibility, this new paradigm could break the scalability barriers of current multicore architectures. However, these same works assume 10+ Gb/s speeds and efficiencies close to 1 pJ/bit without a proper understanding on the wireless intra-chip channel. This paper first demonstrates that such assumptions are far from realistic by evaluating losses and dispersion in commercial chips. Then, we leverage the system's monolithic nature to engineer the channel, this is, to optimize its frequency response by carefully choosing the chip package dimensions. Finally, we exploit the static nature of the channel to adapt to it, pushing efficiency-speed limits with simple tweaks at the physical layer. Our methods reduce losses by 47 dB and dispersion by 7.3×, enabling intrachip wireless communications over 10 Gb/s and only 1.9 dB away from the dispersion-free case.

show abstract

“…Figures 12,13,14 show the achieved network throughput for all three traffic patterns. Similar to latency, the network throughput is also consistent with the packet injection rate.…”

mentioning

confidence: 99%

Ring-mesh: a scalable and high-performance approach for manycore accelerators

Mazumdar

Scionti

2019

J Supercomput

View full text Add to dashboard Cite

There are increasing number of works addressing the design challenges of fast, scalable solutions for the growing number of new type of applications. Recently, many of the solutions aimed at improving processing element capabilities to speed up the execution of machine learning application domain. However, only a few works focused on the interconnection subsystem as a potential source of performance improvement. Wrapping many cores together offer excellent parallelism, but it brings other challenges (e.g., adequate interconnections). Scalable, power-aware interconnects are required to support such a growing number of processing elements, as well as modern applications. In this paper, we propose a scalable and energy efficient Network-on-Chip architecture fusing the advantages of rings as well as the 2D-mesh without using any bridge router to provide high-performance. A dynamic adaptation mechanism allows to better adapt to the application requirements. Simulation results show efficient power consumption (up to 141.3% saving for connecting 1024 cores), 2x (on average) throughput growth with better scalability (up to 1024 processing elements) compared to popular 2D-mesh while tested in multiple statistical traffic pattern scenarios.

show abstract

On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems

Cited by 73 publications

References 42 publications

Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Learning-Based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems

Engineer the Channel and Adapt to it: Enabling Wireless Intra-Chip Communication

Ring-mesh: a scalable and high-performance approach for manycore accelerators

Contact Info

Product

Resources

About