Batch Size Influence on Performance of Graphic and Tensor Processing Units During Training and Inference Phases

Kochura, Yuriy; Gordienko, Yuri; Taran, Vlad; Gordienko, Nikita; Rokovyi, Alexandr; Alienin, Oleg; Stirenko, Sergii

doi:10.1007/978-3-030-16621-2_61

Cited by 24 publications

(12 citation statements)

References 22 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The computation of Eq. ( 7) therefore operates on different sizes of tensors, which is sub-optimal for efficient batching in GPU and TPU [20]. For batching purpose, we perform the attention over all codewords in each codebook, fixing the "context size" of the attention to 𝑊 :…”

Section: Linear-time Self Attentionmentioning

confidence: 99%

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Lian

Gong

et al. 2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

Self-attention has become increasingly popular in a variety of sequence modeling tasks from natural language processing to recommendation, due to its effectiveness. However, self-attention suffers from quadratic computational and memory complexities, prohibiting its applications on long sequences. Existing approaches that address this issue mainly rely on a sparse attention context, either using a local window, or a permuted bucket obtained by localitysensitive hashing (LSH) or sorting, while crucial information may be lost. Inspired by the idea of vector quantization that uses cluster centroids to approximate items, we propose LISA (LInear-time Self Attention), which enjoys both the effectiveness of vanilla selfattention and the efficiency of sparse attention. LISA scales linearly with the sequence length, while enabling full contextual attention via computing differentiable histograms of codeword distributions. Meanwhile, unlike some efficient attention methods, our method poses no restriction on casual masking or sequence length. We evaluate our method on four real-world datasets for sequential recommendation. The results show that LISA outperforms the state-ofthe-art efficient attention methods in both performance and speed; and it is up to 57x faster and 78x more memory efficient than vanilla self-attention. CCS CONCEPTS• Information systems → Recommender systems; Users and interactive retrieval.

show abstract

Section: Linear-time Self Attentionmentioning

confidence: 99%

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Lian

Gong

et al. 2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

show abstract

“…It offers cloud services specialized for ML (Amazon EC2 P3 instances) and is equipped with an NVIDIA Tesla V100 Graphics Processing Unit (GPU). GPU computational capacity values for ML can be found in [18], where we select an average value of 6000 training samples/sec. Finally, the computational tasks for the cloud server include training (in the CML case) and model parameter aggregation (FML, EML cases).…”

Section: Training Process and Entities Computational Characteristicsmentioning

confidence: 99%

“…For the former we select an average value of 40,000 training samples/sec, assuming a Data Center is equipped with a Tensor Processing Unit (TPU). For the latter (model parameter aggregation), no reference values can be found in the literature, thus we rely on an empirical approach; we measure the average capacity for training and aggregation tasks in our personal computer (PC) setup (i.e., 6250 training samples/sec and 1.56 model aggregations/sec respectively) and compare against the training capacity reference value of 40,000 training samples/sec that was selected, according to [18]. Assuming a linear relation, the average cloud aggregation capacity is calculated as 10 model aggregations/sec.…”

Section: Training Process and Entities Computational Characteristicsmentioning

confidence: 99%

A Distributed ML Framework for Service Deployment in the 5G-based Automotive Vertical

Sourlas

Rizk

Κατσαρός

et al. 2021

2021 IEEE International Mediterranean Conference on Communications and Networking (MeditCom)

View full text Add to dashboard Cite

5G is the convergence technology for the new generation of mobile networks, expected to be massively deployed in the coming years. Building on network slicing and edge computing capabilities, 5G promises to address the diverse and quite demanding performance requirements of a wide range of use cases (UCs). As a result of these technological transformations, vertical industries will have enhanced technical capacity available to trigger the development of new products and services. Driven by these advances, Machine learning (ML) applications are headed towards collaborative distributed (CDL) schemes, to exploit the abundance of clients' data. Contrary to the traditional cloud-based centralized solutions (CML), in CDL schemes, computational load is shifted to the intelligent edge and extends further beyond, to the user-equipment (including connected vehicles). Here, we present a distributed ML (DML) framework, that will provide functionalities for simplified management and orchestration of collections of ML service components and will allow ML-based applications to penetrate the Automotive world.

show abstract

“…It offers cloud services specialized for ML and is equipped with an NVIDIA Tesla V100 Graphics Processing Unit (GPU). Thus, an edge node's computational capacity equals that of a GPU's computational capacity, whose values for ML tasks can be found in [19], from where we select an average value of =6000 training samples/sec.…”

Section: Ue and Servers' Computational Capacitymentioning

confidence: 99%

On the Resource Consumption of Distributed ML

Drainakis

Pantazopoulos

Κατσαρός

et al. 2021

2021 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN)

View full text Add to dashboard Cite

The convergence of Machine Learning (ML) with the edge computing paradigm has paved the way for distributing processing-heavy ML tasks to the network's extremes. As the edge deployment details still remain an open issue, distributed ML schemes tend to be network-agnostic; thus, their effect on the underlying network's resource consumption is largely ignored.In our work, assuming a network tree structure of varying size and edge computing characteristics, we introduce an analytical system model based on credible real-world measurements to capture the end-to-end consumption of ML schemes. In this context, we employ an edge-based (EL) and a federated (FL) ML scheme and in-depth compare their bandwidth needs and energy footprint against a cloud-based (CL) baseline approach.Our numerical evaluation suggests that EL exhibits a minimum of 25% bandwidth-efficiency compared to CL and FL, if employed by a few nodes higher in the edge network, while halving the network's energy costs.

show abstract

Batch Size Influence on Performance of Graphic and Tensor Processing Units During Training and Inference Phases

Cited by 24 publications

References 22 publications

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

A Distributed ML Framework for Service Deployment in the 5G-based Automotive Vertical

On the Resource Consumption of Distributed ML

Contact Info

Product

Resources

About