Computer architectures are increasingly based on multi-core CPUs and large memories. Memory bandwidth, which has not kept pace with the increasing number of cores, has become the primary processing bottleneck, replacing disk I/O as the limiting factor. To address this challenge, we provide novel algorithms for increasing the throughput of Business Intelligence (BI) queries, as well as for ensuring fairness and avoiding starvation among a concurrent set of such queries. To maximize throughput, we propose a novel FullSharing scheme that allows all concurrent queries, when performing base-table I/O, to share the cache belonging to a given core. We then generalize this approach to a BatchSharing scheme that avoids thrashing on "agg-tables"-hash tables that are used for aggregation processing-caused by execution of too many queries on a core. This scheme partitions queries into batches such that the working-set of agg-table entries for each batch can fit into a cache; an efficient sampling technique is used to estimate selectivities and working-set sizes for purposes of query partitioning. Finally, we use lottery-scheduling techniques to ensure fairness and impose a hard upper bound on staging time to avoid starvation. On our 8-core testbed, we were able to completely remove the memory I/O bottleneck, increasing throughput by a factor of 2 to 2.5, while also maintaining fairness and avoiding starvation.
In this paper, a novel adaptive dynamic programming (ADP) algorithm, called "iterative zero-sum ADP algorithm," is developed to solve infinite-horizon discrete-time two-player zero-sum games of nonlinear systems. The present iterative zero-sum ADP algorithm permits arbitrary positive semidefinite functions to initialize the upper and lower iterations. A novel convergence analysis is developed to guarantee the upper and lower iterative value functions to converge to the upper and lower optimums, respectively. When the saddle-point equilibrium exists, it is emphasized that both the upper and lower iterative value functions are proved to converge to the optimal solution of the zero-sum game, where the existence criteria of the saddle-point equilibrium are not required. If the saddle-point equilibrium does not exist, the upper and lower optimal performance index functions are obtained, respectively, where the upper and lower performance index functions are proved to be not equivalent. Finally, simulation results and comparisons are shown to illustrate the performance of the present method.
In this paper, a discrete-time optimal control scheme is developed via a novel local policy iteration adaptive dynamic programming algorithm. In the discrete-time local policy iteration algorithm, the iterative value function and iterative control law can be updated in a subset of the state space, where the computational burden is relaxed compared with the traditional policy iteration algorithm. Convergence properties of the local policy iteration algorithm are presented to show that the iterative value function is monotonically nonincreasing and converges to the optimum under some mild conditions. The admissibility of the iterative control law is proven, which shows that the control system can be stabilized under any of the iterative control laws, even if the iterative control law is updated in a subset of the state space. Finally, two simulation examples are given to illustrate the performance of the developed method.
Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memoryefficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX , a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.