Ping Li scite author profile

efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. this paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice.we give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme. intRoDuCtionWith the advent of the Internet, many applications are faced with very large and inherently high-dimensional datasets. A common task on these is similarity search, that is, given a high-dimensional data point, the retrieval of data points that are close under a given distance function. In many scenarios, the storage and computational requirements for computing exact distances between all data points are prohibitive, making data representations that allow compact storage and efficient approximate distance computation necessary.In this paper, we describe b-bit minwise hashing, which leverages properties common to many application scenarios to obtain order-of-magnitude improvements in the storage space and computational overhead required for a given level of accuracy over existing techniques. Moreover, while the theoretical analysis of these gains is technically challenging, the resulting algorithm is simple and easy to implement.To describe our approach, we first consider the concrete task of Web page duplicate detection, which is of critical importance in the context of Web search and was one of the motivations for the development of the original minwise hashing algorithm by Broder et al. 2,4 Here, the task is to identify pairs of pages that are textually very similar. For this purpose, Web pages are modeled as "a set of shingles," where a shingle corresponds to a string of w contiguous words occurring on the page. Now, given two such sets S 1 , S 2 ⊆ Ω, |Ω| = D, the normalized similarity known as resemblance or Jaccard similarity, denoted by R, is

show abstract

Using social media to strengthen public awareness of wildlife conservation

Xie

Huang

et al. 2018

Ocean & Coastal Management

117

View full text Add to dashboard Cite

Applicability of resonant two-photon ionization in supersonic beam mass spectrometry to halogenated aromatic hydrocarbons

Tembreull¹,

Sin²,

Li³

et al. 1985

Anal. Chem.

110

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ping Li

Bioleaching of copper from waste printed circuit boards by bacterial consortium enriched from acid mine drainage

Theory and applications of b -bit minwise hashing

Using social media to strengthen public awareness of wildlife conservation

Applicability of resonant two-photon ionization in supersonic beam mass spectrometry to halogenated aromatic hydrocarbons

Contact Info

Product

Resources

About