Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes-that load entire models to a single server-are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers.This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner-P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
We present a general transformation for combining a constant number of binary search tree data structures (BSTs) into a single BST whose running time is within a constant factor of the minimum of any "well-behaved" bound on the running time of the given BSTs, for any online access sequence. (A BST has a well-behaved bound with f (n) overhead if it spends at most O(f (n)) time per access and its bound satisfies a weak sense of closure under subsequences.) In particular, we obtain a BST data structure that is O(log log n) competitive, satisfies the working set bound (and thus satisfies the static finger bound and the static optimality bound), satisfies the dynamic finger bound, satisfies the unified bound with an additive O(log log n) factor, and performs each access in worst-case O(log n) time.
A lower bound is presented which shows that a class of heap algorithms in the pointer model with only heap pointers must spend Ω log log n log log log n amortized time on the Decrease-Key operation (given O(log n) amortized-time Extract-Min). Intuitively, this bound shows the key to having O(1)-time Decrease-Key is the ability to sort O(log n) items in O(log n) time; Fibonacci heaps [M. .L. Fredman and R. E. Tarjan. J. ACM 34(3):596-615 (1987)] do this through the use of bucket sort. Our lower bound also holds no matter how much data is augmented; this is in contrast to the lower bound of Fredman [J. ACM 46(4):473-501 (1999)] who showed a tradeoff between the number of augmented bits and the amortized cost of Decrease-Key. A new heap data structure, the sort heap, is presented. This heap is a simplification of the heap of Elmasry [SODA 2009: 471-476] and shares with it a O(log log n) amortized-time Decrease-Key, but with a straightforward implementation such that our lower bound holds. Thus a natural model is presented for a pointer-based heap such that the amortized runtime of a self-adjusting structure and amortized lower asymptotic bounds for Decrease-Key differ by but a O(log log log n) factor.
The order type of a point set in R d maps each (d+1)-tuple of points to its orientation (e.g., clockwise or counterclockwise in R 2 ). Two point sets X and Y have the same order type if there exists a mapping f from X to Y for which every (d+1)-tuple (a1, a2, . . . , a d+1 ) of X and the corresponding tuple (f (a1), f (a2), . . . , f (a d+1 )) in Y have the same orientation.In this paper we investigate the complexity of determining whether two point sets have the same order type. We provide an O(n d ) algorithm for this task, thereby improving upon the O(n 3d/2 ) algorithm of Goodman and Pollack (1983).The algorithm uses only order type queries and also works for abstract order types (or acyclic oriented matroids). Our algorithm is optimal, both in the abstract setting and for realizable points sets if the algorithm only uses order type queries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.