Langshi Chen scite author profile

In the age of Big Data, parallel graph processing has been a critical technique to analyze and understand connected data. Meanwhile, Moore's Law continues by integrating more cores into a single chip in the deep-nano regime. Many-Integrated-Core (MIC) processors emerge as a promising solution to process large graphs. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, an Nvidia Tesla P40 GPU and a Xeon Phi 7210 MIC processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance and power efficiency when processing graphs, so that it can become an auspicious alternative to traditional CPUs and GPUs. We further characterize the impact of KNL architectural enhancements on the performance of a state-of-the-art graph framework. We have four key observations: ❶ Different graph applications require distinctive numbers of threads to reach the peak performance. For the same application, various datasets need even different numbers of threads to achieve the best performance. ❷ Not all graph applications actually benefit from high bandwidth MCDRAMs, while some of them favor low latency DDR4 DRAMs. ❸ Vector processing units executing AVX512 SIMD instructions on KNLs are underutilized when running the state-of-the-art graph framework. ❹ The sub-NUMA cache clustering mode offering the lowest local memory access latency hurts the performance of graph benchmarks that are lack of NUMA awareness. At last, we suggest future works including system auto-tuning tools and graph framework optimizations to fully exploit the potential of KNL for parallel graph processing.

show abstract

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

Zhang¹,

Chen²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The development of personalized recommendation has significantly improved the accuracy of information matching and the revenue of e-commerce platforms. Recently, it has two trends: 1) recommender systems must be trained timely to cope with ever-growing new products and ever-changing user interests from online marketing and social network; 2) state-of-the-art recommendation models introduce deep neural network (DNN) modules to improve prediction accuracy. Traditional CPU-based recommender systems cannot meet these two trends, and GPUcentric training has become a trending approach. However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in Computer Vision (CV) and Neural Language Processing (NLP) areas. This issue can be explained by two characteristics of these recommendation models: First, they contain up to a thousand of input feature fields, introducing fragmentary and memory-intensive operations; Second, the multiple constituent feature interaction submodules introduce substantial small-sized compute kernels. To remove this roadblock to the development of recommender systems, we propose a novel framework named PICASSO to accelerate the training of recommendation models on commodity hardware. Specifically, we conduct a systematic analysis to reveal the bottlenecks encountered in training recommendation models. We leverage the model structure and data distribution to unleash the potential of hardware through our packing, interleaving, and caching optimization. Experiments show that PICASSO increases the hardware utilization by an order of magnitude on the basis of state-of-the-art baselines and brings up to 6× throughput improvement for a variety of industrial recommendation models. Using the same hardware budget in production, PICASSO on average shortens the walltime of daily training tasks by 7 hours, significantly reducing the delay of continuous delivery.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Langshi Chen

Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters

HarpGBDT: Optimizing Gradient Boosting Decision Tree for Parallel Efficiency

Finding and Counting Tree-Like Subgraphs Using MapReduce

Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

Contact Info

Product

Resources

About