Efficient Nearest-Neighbor Search in the Probability Simplex

Krstovski, Kriste; Smith, David A.; Wallach, Hanna; McGregor, Andrew

doi:10.1145/2499178.2499189

Cited by 12 publications

(8 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But only few methods handle density metrics in a simplex space. A first approach transformed the He divergence into an Euclidean distance so that existing ANN techniques, such as LSH and k-d tree, could be applied [25]. But this solution does not consider the special attributions of probability distributions, such as Non-negative and Sum-equal-one.…”

Section: Hashing Topic Distributionsmentioning

confidence: 99%

“…if the corpus size is equal to 1000 elements, only the top 5 most similar documents are considered relevant for a given document). This value has been considered after reviewing datasets used in similar experiments [25,31]. In those experiments, the reference data is obtained from existing categories, and the minimum average between corpus size and categorized documents is around 0.5%.…”

Section: Retrieving Similar Documentsmentioning

confidence: 99%

See 1 more Smart Citation

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

Badenes-Olmedo

Redondo-García

Corcho

2020

View full text Add to dashboard Cite

Searching for similar documents and exploring major themes covered across groups of documents are common activities when browsing collections of scientific papers. This manual knowledge-intensive task can become less tedious and even lead to unexpected relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstract them away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information gets hidden behind hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows extending those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.

show abstract

Section: Hashing Topic Distributionsmentioning

confidence: 99%

Section: Retrieving Similar Documentsmentioning

confidence: 99%

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

Badenes-Olmedo

Redondo-García

Corcho

2020

View full text Add to dashboard Cite

show abstract

“…Jensen Divergence works with exponential kernel which is convex function, and it has a close relationship with Hellinger distance and Triangle Measures [29][30][31]. Jensen Divergence (Js ), Hellinger (He ) and Triangle (T) measure are given in Eqs.…”

Section: The Discrete Wavelet Transform For Feature Extractionmentioning

confidence: 99%

“…Therefore the relationship between Jensen-Shannon, Hellinger and Triangle measures [32] can be expressed by: …”

Section: The Discrete Wavelet Transform For Feature Extractionmentioning

confidence: 99%

A New Method Based for Diagnosis of Breast Cancer Cells from Microscopic Images: DWEE—JHT

Korkmaz

Poyraz

2014

J Med Syst

View full text Add to dashboard Cite

In these days, there are many various diseases, whose diagnosis is very hardly. Breast cancer is one of these type diseases. In this paper, accuracy diagnosis of normal, benign, and malign breast cancer cell were found by combining mean success rates Jensen Shannon, Hellinger, and Triangle measure which connected with each other. In this article, an diagnostic method based on feature extraction Discrete Wavelet Entropy Energy (DWEE) and Jensen Shannon, Hellinger, Triangle Measure (JHT) Classifier for diagnosis of breast cancer. This diagnosis method is called as DWEE--JHT this paper. With this diagnosis method have found optimal feature subset using discrete wavelet transform feature extraction. Then these convenient features are given to Jensen Shannon, Hellinger, Triangle Measure (JHT) classifier. Then, between classifiers which are Jensen Shannon, Hellinger, and triangle distance have been validated the measures via relationships. Afterwards, breast cancer cells are classified using Jensen Shannon, Hellinger, and Triangle distance. Mean success rate of 16 feature vector with Jensen Shannon classifier is found % 97.81. Mean success rate of 16 feature vector with Hellinger classifier is found % 97.75. Mean success rate of 16 feature vector with Triangle classifier is found % 97.87. By averaging of results obtained from these 3 classifiers are found as 97.81 % average of accuracy diagnosis.

show abstract

“…Additionally, several hashing schemes have been defined for alternative measures of probability distributions such as the Chi-squared distance (12), and the Hellinger distance (13). Also, Mu and Yan (25) proposed a family of LSH functions for dealing with non-metric distances, but they considered a symmetric version of the KL divergence.…”

Section: Locality Sensitive Hashing For Related Distancesmentioning

confidence: 99%

Approximate Nearest Neighbor Search for the Kullback-Leibler Divergence

MESEJO-LEON¹

View full text Add to dashboard Cite

First and foremost I would like to thank my advisor Eduardo Sany Laber for giving me the freedom to choose my own research topic and guiding me during the whole process. Also to my co-workers at Jobzi, specially Paula Guedes and Alexandre Renteria for their continuous support and advice. To my family for always been there, in particular to my parents Angela and Alejandro. Last but no least, I would like to thanks my wife Haydée for everything she does.

show abstract

Efficient Nearest-Neighbor Search in the Probability Simplex

Cited by 12 publications

References 32 publications

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

A New Method Based for Diagnosis of Breast Cancer Cells from Microscopic Images: DWEE—JHT

Approximate Nearest Neighbor Search for the Kullback-Leibler Divergence

Contact Info

Product

Resources

About