Energy efficient architecture for graph analytics accelerators

Özdal, Mustafa; Yesil, Serif; Kim, Taemin; Ayupov, Andrey; Greth, John; Burns, Steven; Öztürk, Özcan

doi:10.1145/3007787.3001155

Cited by 63 publications

(67 citation statements)

References 31 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Google's Tensor Processing Unit described above is an example that is targeted for neural network applications. Other workloads of interest that may justify ASIC accelerators include cryptography [26], compression [27], machine learning [28], database [29], and large-scale graph processing [3,30,31].…”

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

See 1 more Smart Citation

Emerging Accelerator Platforms for Data Centers

Özdal

2018

IEEE Des. Test

Self Cite

View full text Add to dashboard Cite

Today's server architectures are designed considering the needs of a wide range of applications. For example, superscalar processors include complex control logic for out of order execution to extract instruction-level parallelism (ILP) from arbitrary programs. However, not all workloads utilize the features of a superscalar processor effectively. For example, a workload that exhibits a regular execution pattern (e.g. a dense linear algebra kernel) may not require the expensive ILP control logic for parallelism. Instead, it can be run on a throughput-oriented architecture with thousands of simple cores, such as a GPU, which can lead to much better performance and power efficiency. On the other hand, only a limited class of data-parallel applications can utilize the high throughputs provided by such architectures. As a matter of fact, existing CPU and GPU platforms may not be the most efficient choices for the compute patterns of a wide range of applications.For big data workloads, access to data is typically at least as important bottleneck as computation. The memory subsystems of today's CPU architectures are optimized for workloads that have reasonable data access locality. CPU cache hierarchies include different sizes of caches, which help capture different levels of access localities in different applications. However, if an application exhibits very little or no locality, the data access operations become inefficient for these architectures.As an example, let us consider graph applications that run on very large and unstructured datasets. Typically, the data of a vertex is computed/updated based on the data of its neighbors. In an unstructured graph, the neighbors of a vertex are stored in memory locations that may be far from each other. So, traversing the neighbors of a vertex may involve a random memory access per neighbor. If the graph is large enough so that it does not fit into the last level cache (LLC), each access to a neighbor's data may require a random DRAM access, which is typically hundreds of clock cycles. However, existing CPU architectures are not optimized for frequent random DRAM accesses. For example, each Intel Haswell Xeon core has 10 line-fill-buffers (LFBs), which means that each core can handle at most 10 L1 cache misses at a given time. However, an off-chip DRAM latency of hundreds of cycles requires hundreds of outstanding memory requests to be able to utilize the full DRAM bandwidth available in the system [1]. It was reported that 10 or more Xeon cores were needed for various graph applications to fully utilize the available DRAM bandwidth [2]. Furthermore, due to the low compute to memory-access ratios in graph applications, these cores are frequently stalled while waiting for data from off-chip memory. This leads to high power consumption by 10+ superscalar cores while not doing useful work. It was shown that custom architectures that target such communication patterns have the potential to improve power efficiency by a factor of 50x or more compared to the general-purpose CPU...

show abstract

Section: Google's Tensor Processing Unitmentioning

confidence: 99%

“…This leads to high power consumption by 10+ superscalar cores while not doing useful work. It was shown that custom architectures that target such communication patterns have the potential to improve power efficiency by a factor of 50x or more compared to the general-purpose CPUs [3]. …”

mentioning

confidence: 99%

Emerging Accelerator Platforms for Data Centers

Özdal

2018

IEEE Des. Test

Self Cite

View full text Add to dashboard Cite

show abstract

“…Instead, an accelerator is expected to generate many concurrent DRAM requests to be able to hide long (typically hundreds of cycles) latencies and fully utilize the available DRAM bandwidth. It has been shown that hardware accelerators can operate at power levels that are much lower than the state-ofthe-art multi-core CPUs [26].…”

Section: Introductionmentioning

confidence: 99%

“…Because of this reason, the application-specific accelerators designed using the HLS methodology of this section will be used as a baseline in our experiments (Section VI) IV. PROPOSED ARCHITECTURE The preliminary version of this paper has proposed several microarchitectural features to achieve both high throughput and high work efficiency for asynchronous execution of graph appli- cations [26]. The basic idea is to allow processing tens/hundreds of vertices/edges to be able to hide long access latencies to main memory.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

Ayupov

Yesil

Özdal

et al. 2018

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Graph applications have been gaining importance in the last decade due to emerging big data analytics problems such as web graphs, social networks, and biological networks. For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. It has been shown that specialized hardware accelerators can achieve much better power and energy efficiency compared to the general purpose CPUs and GPUs. In this work, we present a template-based methodology specifically targeted for hardware accelerator design of big-data graph applications. Important architectural features that are key for energy efficient execution are implemented in a common template. The proposed template-based methodology is used to design hardware accelerators for different graph applications with little effort. Compared to an application-specific high-level synthesis (HLS) methodology, we show that the proposed methodology can generate hardware accelerators with up to 18x better energy efficiency and requires less design effort.

show abstract