Word Mover’s Embedding: From Word2Vec to Document Embedding

Wu, Lingfei; Yen, Ian En-Hsu; Xu, Kun; Xu, Fangli; Balakrishnan, Avinash; Chen, Pin-Yu; Ravikumar, Pradeep; Witbrock, Michael

doi:10.18653/v1/d18-1482

Cited by 85 publications

(88 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It also reveals that probably there is still room to improve ranking performance, especially for cases in which a larger number of seed websites is available. Given recent advances in text representation using dense vector representations and document distance computation in the embedding space [17,33], this is a promising area for future work. While this is a limitation of our ranking approach, it is still suitable for our context, where we assume to have only a small list of seeds as input.…”

Section: Website Ranking Evaluationmentioning

confidence: 99%

Bootstrapping Domain-Specific Content Discovery on the Web

Pham

Santos

Freire

2019

The World Wide Web Conference

View full text Add to dashboard Cite

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites. DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keywordbased and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-ofthe-art methods. CCS CONCEPTS• Information systems → Web searching and information discovery; Page and site ranking; Content ranking; Rank aggregation.

show abstract

Section: Website Ranking Evaluationmentioning

confidence: 99%

Bootstrapping Domain-Specific Content Discovery on the Web

Pham

Santos

Freire

2019

The World Wide Web Conference

View full text Add to dashboard Cite

show abstract

“…In this work, instead of using Random Features to approximate a pre-defined kernel function, we overcome all these aforementioned issues by generalizing Random Features to develop a new family of efficient and effective string kernels that not only are positive-definite but also reduce the computational complexity from quadratic to linear in both the number and the length of strings. Note that, our approach is different from a recent work [45] on distance kernel learning that mainly focuses on theoretical analysis of these kernels on structured data like time-series [46] and text [44]. Instead, we focus on developing empirical methods that could often outperform or are highly competitive to other state-of-the-art approaches, including kernel based and Recurrent Neural Networks based methods, as we will show in our experiments.…”

Section: Conventional Random Features For Scaling Up Kernel Machinementioning

confidence: 99%

Efficient Global String Kernel with Random Features

Yen²,

Huo

et al. 2019

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

Self Cite

View full text Add to dashboard Cite

Analysis of large-scale sequential data has been one of the most crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of short substructures in the string, which hardly capture long discriminative patterns, (ii) sum over too many substructures, such as all possible subsequences, which leads to diagonal dominance of the kernel matrix, or (iii) rely on non-positive-definite similarity measures derived from the edit distance. Furthermore, while there have been works addressing the computational challenge with respect to the length of string, most of them still experience quadratic complexity in terms of the number of training samples when used in a kernel-based classifier. In this paper, we present a new class of global string kernels that aims to (i) discover global properties hidden in the strings through global alignments, (ii) maintain positive-definiteness of the kernel, without introducing a diagonal dominant kernel matrix, and (iii) have a training cost linear with respect to not only the length of the string but also the number of training string samples. To this end, the proposed kernels are explicitly defined through a series of different random feature maps, each corresponding to a distribution of random strings. We show that kernels defined this way are always positive-definite, and exhibit computational benefits as they always produce Random String Embeddings (RSE) that can be directly used in any linear classification models. Our extensive experiments on nine benchmark datasets corroborate that RSE achieves better or comparable accuracy in comparison to state-of-the-art baselines, especially with the strings * Corresponding author † Shouling Ji is also with Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies of longer lengths. In addition, we empirically show that RSE scales linearly with the increase of the number and the length of string. CCS CONCEPTS• Computing methodologies → Kernel methods.

show abstract

“…Le and Mikolov (2014); Li et al (2015); Dai et al (2015) explored Paragraph Vector with various lengths (sentence, paragraph, document) trained on next word/n-gram prediction given context sampled from the paragraph. The work from Roy et al (2016); Chen (2017); Wu et al (2018) obtained document embeddings from word-level embeddings. More recent work has been focused on learning document embeddings through hierarchical training.…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Document Encoder for Parallel Corpus Mining

Guo¹,

Yang²,

Stevens³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

View full text Add to dashboard Cite

We explore using multilingual document embeddings for nearest neighbor mining of parallel data. Three document-level representations are investigated: (i) document embeddings generated by simply averaging multilingual sentence embeddings; (ii) a neural bagof-words (BoW) document encoding model; (iii) a hierarchical multilingual document encoder (HiDE) that builds on our sentence-level model. The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data. Analysis experiments demonstrate our hierarchical models are very robust to variations in the underlying sentence embedding quality. Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9% P@1 1 for en-fr and 97.3% P@1 for en-es.

show abstract

Word Mover’s Embedding: From Word2Vec to Document Embedding

Cited by 85 publications

References 46 publications

Bootstrapping Domain-Specific Content Discovery on the Web

Bootstrapping Domain-Specific Content Discovery on the Web

Efficient Global String Kernel with Random Features

Hierarchical Document Encoder for Parallel Corpus Mining

Contact Info

Product

Resources

About