James Balfour scite author profile

We develop detailed area and energy models for on-chip interconnection networks and describe tradeoffs in the design of efficient networks for tiled chip multiprocessors. Using these detailed models we investigate how aspects of the network architecture including topology, channel width, routing strategy, and buffer size affect performance and impact area and energy efficiency. We simulate the performance of a variety of on-chip networks designed for tiled chip multiprocessors implemented in an advanced VLSI process and compare area and energy efficiencies estimated from our models. We demonstrate that the introduction of a second parallel network can increase performance while improving efficiency, and evaluate different strategies for distributing traffic over the subnetworks. Drawing on insights from our analysis, we present a concentrated mesh topology with replicated subnetworks and express channels which provides a 24% improvement in area efficiency and a 48% improvement in energy efficiency over other networks evaluated in this study.

show abstract

Flattened Butterfly Topology for On-Chip Networks

Kim¹,

Balfour²,

Dally³

2007

106

196

View full text Add to dashboard Cite

With the trend towards increasing number of cores in chip multiprocessors, the on-chip interconnect that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip interconnection networks and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conventional on-chip topologies. In addition, by exploiting the two dimensional planar VLSI layout, the on-chip flattened butterfly can exploit the bypass channels such that non-minimal routing can be used with minimal impact on latency and energy consumption. We evaluate the flattened butterfly and compare it to alternate on-chip topologies using synthetic traffic patterns and traces and show that the flattened butterfly can increase throughput by up to 50% compared to a concentrated mesh and reduce latency by 28% while reducing the power consumption by 38% compared to a mesh network.

show abstract

A detailed and flexible cycle-accurate Network-on-Chip simulator

et al. 2013

View full text Add to dashboard Cite

Abstract-Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulations to evaluate the performance impact and analyze the cost of novel NoC architectures. In this work, we present BookSim, a cycle-accurate simulator for NoCs. The simulator is designed for simulation flexibility and accurate modeling of network components. It features a modular design and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes. BookSim furthermore emphasizes detailed implementations of network components that accurately model the behavior of actual hardware. We have validated the accuracy of the simulator against RTL implementations of NoC routers.

show abstract

Efficient Embedded Computing

Dally¹,

Balfour²,

Black-Shaffer³

et al. 2008

Computer

134

View full text Add to dashboard Cite

E mbedded computing applications demand both efficiency and flexibility: The bulk of computation today happens not in desktops, laptops, or data centers, but rather in embedded media devices. More than one billion cell phones are sold each year, and a 3G cell phone performs more operations per second than a typical desktop CPU.Media devices like cell phones, video cameras, and digital televisions perform more computations than all but the fastest supercomputers at power levels orders of magnitude lower than general-purpose desktop and laptop machines. For example, a 3G mobile phone receiver requires 35 to 40 giga operations per second (GOPS) of performance to handle a 14.4-Mbps channel, and researchers estimate the requirements for a 100-Mbps orthogonal frequency-division multiplexing (OFDM) channel at between 210 and 290 GOPS. In contrast, a typical desktop computer system has a peak performance of a few GOPS and sustains far less on most applications. A cell phone's computing challenges are even more impressive when we consider that these performance levels must be achieved in a small handheld package with a maximum power dissipation of about 1W. Simple arithmetic gives a required efficiency of 25 mW/GOP or 25 pJ/op for the 3G receiver and 3-5 pJ/op for the OFDM receiver.Demanding performance and efficiency requirements drive most media devices to perform their computations with hardwired logic in the form of an applicationspecific integrated circuit. A carefully designed ASIC can achieve an efficiency of 5 pJ/op in a 90-nm CMOS technology.2 In contrast, very efficient embedded processors and DSPs require about 250 pJ/op 3 (50X more energy than an ASIC), and a popular laptop processor requires 20 nJ/op 4 (4,000X more energy than an ASIC). The efficiencies of these programmable processors is simply inadequate for demanding embedded applications-forcing designers to use hardwired logic to keep energy demands within limits.While ASICs meet the energy-efficiency demands of embedded applications, they are difficult to design and inflexible. It takes two years to design a typical ASIC, and the cost is $20 million or more. This high cost places ASIC efficiency out of reach for all but the highest-volume applications. The long design cycle causes ASICs to lag far behind the latest developments in algorithms, modems, and codecs. Inflexibility increases an ASIC's area and complexity. If a system must support several air interfaces, for example, an ASIC implementation instantiates separate hardwired modems for each interface-even though only one will be used at any time. If it meets the efficiency requirement, a programmable processor can use a single hardware resource to implement all the interfaces by running different software.As media applications evolve and become more complex, the problems of ASICs become larger. The Hardwired ASICs-50X more efficient than programmable processors-sacrifice programmability to meet the efficiency requirements of demanding embedded systems.Programmable processors use energy mostly to ...

show abstract

Design tradeoffs for tiled CMP on-chip networks

Balfour

Dally

2014

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

James Balfour

Design tradeoffs for tiled CMP on-chip networks

Flattened Butterfly Topology for On-Chip Networks

A detailed and flexible cycle-accurate Network-on-Chip simulator

Efficient Embedded Computing

Design tradeoffs for tiled CMP on-chip networks

Contact Info

Product

Resources

About