High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.
Joint inversion of multiple observation models has important applications in many disciplines including geoscience, image processing and computational biology. One of the methodologies for joint inversion of ill-posed observation equations naturally leads to multi-parameter regularization, which has been intensively studied over the last several years. However, problems such as the choice of multiple regularization parameters remain unsolved. In the present study, we discuss a rather general approach to the regularization of multiple observation models, based on the idea of the linear aggregation of approximations corresponding to different values of the regularization parameters. We show how the well-known linear functional strategy can be used for such an aggregation and prove that the error of a constructive aggregator differs from the ideal error value by a quantity of an order higher than the best guaranteed accuracy from the most trustable observation model. The theoretical analysis is illustrated by numerical experiments with simulated data.
In natural language processing (NLP), the "Transformer" architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduce the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07× and 81 × improvement in performance and energy efficiency compared to CPU, and up to 8.80× improvement in energy efficiency compared to GPU.
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect so errors o ine in the fast Fourier transform (FFT) a er computation nishes, none of the existing ABFT schemes detect so errors online before the computation nishes. is paper presents an online ABFT scheme for FFT so that so errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and nally incorporate it into the widely used FFTW library-one of the today's fastest FFT so ware implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing o ine ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and be er fault coverage.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.