The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors 2 and MindSpore framework 3 , and present the language model with 1.085T parameters named PanGu-Σ. With parameter inherent from PanGu-α [1], we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation (ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-Σ provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.
Geo-distributed data analytics are increasingly common to derive useful information in large organisations. Naive extension of existing cluster-scale data analytics systems to the scale of geo-distributed data centers faces unique challenges including WAN bandwidth limits, regulatory constraints, changeable/unreliable runtime environment, and high monetary costs. Our goal in this work is to develop a practical geo-distributed data analytics system that (1) employs an intelligent mechanism for jobs to efficiently utilize (adjust to) the resources (changeable environment) across data centers; (2) guarantees the reliability of jobs due to the possible failures; and (3) is generic and flexible enough to run a wide range of data analytics jobs without requiring any changes.To this end, we present a new, general geo-distributed data analytics system, HOUTU, that is composed of multiple autonomous systems, each operating in a sovereign data center. HOUTU maintains a job manager (JM) for a geo-distributed job in each data center, so that these replicated JMs could individually and cooperatively manage resources and assign tasks. Our experiments on the prototype of HOUTU running across four Alibaba Cloud regions show that HOUTU provides efficient job performance as in the existing centralized architecture, and guarantees reliable job executions when facing failures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.