Xiaoda Zhang scite author profile

The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors 2 and MindSpore framework 3 , and present the language model with 1.085T parameters named PanGu-Σ. With parameter inherent from PanGu-α [1], we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation (ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-Σ provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.

show abstract

A Survey on Auto-Parallelism of Neural Networks Training

Liang¹,

Tao²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Efficient scheduling for multi-stage coflows

et al. 2019

View full text Add to dashboard Cite

Towards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System

Zhang¹,

Qian²,

Zhang³

et al. 2018

Preprint

View full text Add to dashboard Cite

Geo-distributed data analytics are increasingly common to derive useful information in large organisations. Naive extension of existing cluster-scale data analytics systems to the scale of geo-distributed data centers faces unique challenges including WAN bandwidth limits, regulatory constraints, changeable/unreliable runtime environment, and high monetary costs. Our goal in this work is to develop a practical geo-distributed data analytics system that (1) employs an intelligent mechanism for jobs to efficiently utilize (adjust to) the resources (changeable environment) across data centers; (2) guarantees the reliability of jobs due to the possible failures; and (3) is generic and flexible enough to run a wide range of data analytics jobs without requiring any changes.To this end, we present a new, general geo-distributed data analytics system, HOUTU, that is composed of multiple autonomous systems, each operating in a sovereign data center. HOUTU maintains a job manager (JM) for a geo-distributed job in each data center, so that these replicated JMs could individually and cooperatively manage resources and assign tasks. Our experiments on the prototype of HOUTU running across four Alibaba Cloud regions show that HOUTU provides efficient job performance as in the existing centralized architecture, and guarantees reliable job executions when facing failures.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xiaoda Zhang

COBRA: Toward Provably Efficient Semi-Clairvoyant Scheduling in Data Analytics Systems

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

A Survey on Auto-Parallelism of Neural Networks Training

Efficient scheduling for multi-stage coflows

Towards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System

Contact Info

Product

Resources

About