Jiaao He scite author profile

Jiaao He

11Publications

66Citation Statements Received

1,714Citation Statements Given

How they've been cited

How they cite others

1,206

1,712

Affiliations

Tsinghua University

Publications

Order By: Most citations

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Luo

Zhuo

et al. 2020

View full text Add to dashboard Cite

Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds -designing a distributed training method that has both high performance as All-Reduce in homogeneous environment and good heterogeneity tolerance as AD-PSGD?In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization, emphasizing the interplay between algorithm and system implementation. To reduce synchronization cost, we propose a novel communication primitive Partial All-Reduce that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Ripples is 1.1× faster than the state-of-the-art implementation of All-Reduce, 5.1× faster than Parameter Server and 4.3× faster than AD-PSGD. In a heterogeneous setting, Ripples shows 2× speedup over All-Reduce, and still obtains 3× speedup over the Parameter Server baseline.

show abstract

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

domains indexed by Google News. It contains 31 million documents with an average length of 793 BPE tokens. Like C4, it excludes examples with duplicate URLs. News dumps from December 2016 through March 2019 were used as training data, articles published in April 2019 from the April 2019 dump were used for evaluation. OpenWebText2(OWT2). OWT2 is an enhanced version of the original OpenWebTextCorpus, including content from multiple languages, document metadata, multiple dataset versions, and open source replication code, covering all Reddit submissions from 2005 up until April 2020. PubMed Central(PMC). PMC is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The dataset is updated daily. In addition to full-text articles, they contain corrections, retractions, and expressions of concern, as well as file lists that include metadata for articles in each dataset.PMC obtained by open registration in Amazon Web Services (AWS) includes The PMC Open Access Subset and The Author Manuscript Dataset. The PMC Open Access Subset includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse. The Author Manuscript Dataset includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining. ArXiv. ArXiv is a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. It provides open access to academic articles, covering many subdisciplines from vast branches of physics to computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics, which is helpful to the potential downstream applications of the research field. In addition, the writing language of LaTeX also contributes to the study of language models. Colossal Clean Crawled Corpus(C4). C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It is based on Common Crawl dataset and was used to train the T5 text-to-text Transformer models. The cleaned English version of C4 has 364,868,901 training examples and 364,608 validation examples, while the uncleaned English version has 1,063,805,324 training examples and 1,065,029 validation examples; the realnewslike version has 13,799,838 training examples and 13,863 validation examples, while the webtextlike version has 4,500,788 training examples and 4,493 validation examples. Wiki-40B. Wikipedia (Wiki-40B) is a clean-up text collection containing more than 40 Wikipedia language editions of pages corresponding to entities. The dataset is split into train/validation/test sets for each language. The training set has 2,926,536 examples, the validation set has 163,597 examples, and the test set has 162,274 examples. Wiki-40B is cleaned by a page filter to remove ambiguous, redirected, deleted, and non-physical pages. CLUECorpus2020. CLUECorpus2020 ...

show abstract

FasterMoE

Zhai

Antunes

et al. 2022

View full text Add to dashboard Cite

BaGuaLu

Qiu

et al. 2022

View full text Add to dashboard Cite

How do Colleges and Universities Push Forward National “Internet Plus” Strategy?

Wen¹,

Chen²,

He³

et al. 2019

View full text Add to dashboard Cite

show abstract

Efficiently emulating high-bitwidth computation with low-bitwidth hardware

Wang

Feng

et al. 2022

View full text Add to dashboard Cite

Domain-Specific Accelerators (DSAs) are being rapidly developed to support high-performance domain-specific computation. Although DSAs provide massive computation capability, they often only support limited native data types. To mitigate this problem, previous works have explored software emulation for certain data types, which provides some compensation for hardware limitations. However, how to efficiently design more emulated data types and choose a high-performance one without hurting correctness or precision for a given application still remains an open problem.To address these challenges, we present Ape, which can 1) provide different strategies for emulating high-bitwidth data types using native data types with in-depth error analysis; 2) dynamically and automatically select proper data types and generate efficient code for a given computation in fine-granularity to achieve higher performance while maintaining both correctness and precision at the same time without human efforts. We implement Ape on both NVIDIA Tensor Core and Huawei Ascend. Results show that Ape can boost General Matrix Multiplication and convolution by up to 3.12× and 1.86× on Tensor Core over CUDA Core and accelerate various applications by up to 1.78× (1.65× on average). CCS CONCEPTS• Computing methodologies → Parallel algorithms.

show abstract

Student Cluster Competition 2018, Team Tsinghua University: Reproducing performance of multi-physics simulations of the Tsunamigenic 2004 Sumatra megathrust earthquake on the Intel Skylake Architecture

Zhao

et al. 2019

Parallel Computing

View full text Add to dashboard Cite

Heterogeneity-Aware Asynchronous Decentralized Training

Luo¹,

He²,

Zhuo³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jiaao He

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

A Roadmap for Big Model

FasterMoE

BaGuaLu

How do Colleges and Universities Push Forward National “Internet Plus” Strategy?

Efficiently emulating high-bitwidth computation with low-bitwidth hardware

Student Cluster Competition 2018, Team Tsinghua University: Reproducing performance of multi-physics simulations of the Tsunamigenic 2004 Sumatra megathrust earthquake on the Intel Skylake Architecture

Heterogeneity-Aware Asynchronous Decentralized Training

Contact Info

Product

Resources

About