Machine learning at the limit

Canny, John; Zhao, Huasha; Jaros, Bobby; Chen, Ye; Mao, Jiangchang

doi:10.1109/bigdata.2015.7363760

Cited by 22 publications

(24 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the case of word2vec, however, not all word vectors are updated at the same frequency as those are proportional to the word unigram frequencies, e.g., the vectors in the model associated with popular words are updated more frequently than those of rare words. We therefore strive to match model update frequency to word frequency, and a sub-model (instead of full-model) synchronization scheme, similar to the one exploited in BIDMach [10], is used.…”

Section: E Distributed Memory Parallelizationmentioning

confidence: 99%

“…For the purpose of comparison, we also include in Fig. 4 BIDMach's performances on N = 1, 4 NVidia Titan-X GPUs provided by [10], which reports the state-of-the-art performance achieved on multi-GPU systems. Again, good scalability is only meaningful when similar or better accuracy is achieved.…”

Section: Distributed Multi-node Systemsmentioning

confidence: 99%

“…As a comparison, BIDMach only delivers 2.4X speedup on 4 GPU cards vs. 1 GPU card or a 60% efficiency. Last, we collect the best known performance of distributed word2vec from the literature [10], and compare it with our performance on Intel Broadwell and KNL nodes and report them in Table V. We only consider the meaningful throughputs that maintain a comparable accuracy.…”

Section: Distributed Multi-node Systemsmentioning

confidence: 99%

“…Our distributed w2v on Intel BDW Our distributed w2v on Intel KNL BIDMach on NVidia Titan−X Scalabilities of our distributed word2vec on multiple Intel Broadwell and Knight Landing nodes, and BIDMach on N = 1, 4 NVidia Titan-X nodes as reported in[10]..…”

mentioning

confidence: 99%

See 3 more Smart Citations

Parallelizing Word2Vec in Shared and Distributed Memory

Satish

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures but are based on vector-vector operations that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we improve reuse of various data structures in the algorithm through the use of minibatching, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. In combination, these techniques allow us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.

show abstract

Section: E Distributed Memory Parallelizationmentioning

confidence: 99%

Section: Distributed Multi-node Systemsmentioning

confidence: 99%

Section: Distributed Multi-node Systemsmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Parallelizing Word2Vec in Shared and Distributed Memory

Satish

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…[20] further accelerates CCD++ on GPUs using loop fusion and tiling. The resulting algorithm is shown to be faster than CCD++ on CPUs [36] as [22]; multi-nodes: FactorBird [30], Petuum [5] blocking: workers pick non-overlapping blocks blockDim=#workers: DSGD [9] blockDim>#workers: LIBMF [39], NOMAD [37], DSGD++ [32] nested blocking: dcMF [21], MLGF-MF [27] single and multiple GPUs: GPU-SGD -SGD with lockfree and blocking [35] ALS replicate all features: PALS [38], DALS [32] partial replicate features: SparkALS [18], GraphLab [17], Sparkler [16] rotate features: Facebook [13] approximate ALS: [29] single GPU: BIDMach [2], HPC-ALS [8] single and multiple GPUs: GPU-ALS [31] and CUM-FALS CCD multi-core and multi node: CCD++ [36] single GPU: parallel CCD++ [20] well as GPU-ALS [31] that is without memory optimization and approximate computing.…”

Section: A Parallel Sgdmentioning

confidence: 99%

Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

Tan

Chang²,

Fong³

et al. 2018

Proceedings of the 47th International Conference on Parallel Processing

View full text Add to dashboard Cite

Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (e.g. easy to handle positive-unlabeled inputs), fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck.To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both memory optimization and approximate computing. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin but also has a 2x-4x performance gain compared to the state-of-the-art GPU solutions. Our implementations are open-sourced and publicly available.A graphical illustration of our proposed approach is shown in

show abstract