Wookeun Jung scite author profile

The state-of-the-art ARM processors provide multiple cores and SIMD instructions. OpenCL is a promising programming model for utilizing such parallel processing capability because of its SPMD programming model and built-in vector support. Moreover, it provides portability between multicore ARM processors and accelerators in embedded systems. In this paper, we introduce the design and implementation of an efficient OpenCL framework for multicore ARM processors. Computational tasks in a program are implemented as OpenCL kernels and run on all CPU cores in parallel by our OpenCL framework. Vector operations and built-in functions in OpenCL kernels are optimized using the NEON SIMD instruction set. We evaluate our OpenCL framework using 37 benchmark applications. The result shows that our approach is effective and promising.

show abstract

Fa3c

Cho

Park

et al. 2019

View full text Add to dashboard Cite

Versatile and scalable parallel histogram construction

Jung

Park

Lee

2014

View full text Add to dashboard Cite

Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width).This paper presents versatile histogram methods that achieve competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel R Xeon Phi TM coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter.For histograms with 256 fixed-width bins, a dual-socket 8-core Intel R Xeon R E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel R Xeon Phi TM 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than phoenix and tbb. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.

show abstract

Transparent GPU memory management for DNNs

Park

Cho

Jung

et al. 2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Wookeun Jung

Performance analysis of CNN frameworks for GPUs

OpenCL framework for ARM processors with NEON support

Fa3c

Versatile and scalable parallel histogram construction

Transparent GPU memory management for DNNs

Contact Info

Product

Resources

About