Leonid Boytsov scite author profile

We show that it is possible to reliably discriminate whether a syntactic construction is meant literally or metaphorically using lexical semantic features of the words that participate in the construction. Our model is constructed using English resources, and we obtain state-of-the-art performance relative to previous work in this language. Using a model transfer approach by pivoting through a bilingual dictionary, we show our model can identify metaphoric expressions in other languages. We provide results on three new test sets in Spanish, Farsi, and Russian. The results support the hypothesis that metaphors are conceptual, rather than lexical, in nature.

show abstract

Decoding billions of integers per second through vectorization

Lemire

Boytsov

2013

Softw. Pract. Exper.

214

183

View full text Add to dashboard Cite

SUMMARYIn many important applications-such as search engines and relational database systems-data are stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and single-instruction, multiple-data (SIMD) instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128? that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128? saves up to 2 bits/int. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple8b) while being two times faster during decoding.

show abstract

SIMD compression and the intersection of sorted integers

Lemire

Boytsov

Kurz³

2015

Softw Pract Exp

View full text Add to dashboard Cite

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single-instruction, multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded 32-bit integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD GALLOPING algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32-bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections, GOV2 and ClueWeb09 (category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.

show abstract

Engineering Efficient and Effective Non-metric Space Library

Boytsov

Naidan

2013

View full text Add to dashboard Cite

Indexing methods for approximate dictionary searching

Boytsov¹

2011

ACM J. Exp. Algorithmics

View full text Add to dashboard Cite

The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k . Benchmark results are presented for the practically important cases of k =1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.

show abstract

Permutation search methods are efficient, yet faster search is possible

2015

View full text Add to dashboard Cite

We survey permutation-based methods for approximate knearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original points. Thus, it should be possible to efficiently retrieve most true nearest neighbors by examining only a tiny subset of data points whose permutations are similar to the permutation of a query. We further test this assumption by carrying out an extensive experimental evaluation where permutation methods are pitted against state-of-the art benchmarks (the multi-probe LSH, the VP-tree, and proximity-graph based retrieval) on a variety of realistically large data set from the image and textual domain. The focus is on the highaccuracy retrieval methods for generic spaces. Additionally, we assume that both data and indices are stored in main memory. We find permutation methods to be reasonably efficient and describe a setup where these methods are most useful. To ease reproducibility, we make our software and data sets publicly available.

show abstract

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

Efimov

Chertok

Boytsov

et al. 2020

View full text Add to dashboard Cite

Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants

Kamphuis

Vries

Boytsov

et al. 2020

View full text Add to dashboard Cite

When researchers speak of BM25, it is not entirely clear which variant they mean, since many tweaks to Robertson et al.'s original formulation have been proposed. When practitioners speak of BM25, they most likely refer to the implementation in the Lucene open-source search library. Does this ambiguity "matter"? We attempt to answer this question with a large-scale reproducibility study of BM25, considering eight variants. Experiments on three newswire collections show that there are no significant effectiveness differences between them, including Lucene's often maligned approximation of document length. As an added benefit, our empirical approach takes advantage of databases for rapid IR prototyping, which validates both the feasibility and methodological advantages claimed in previous work.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Leonid Boytsov

Metaphor Detection with Cross-Lingual Model Transfer

Decoding billions of integers per second through vectorization

SIMD compression and the intersection of sorted integers

Engineering Efficient and Effective Non-metric Space Library

Indexing methods for approximate dictionary searching

Permutation search methods are efficient, yet faster search is possible

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants

Contact Info

Product

Resources

About