Eric Crestan scite author profile

Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia.

show abstract

Web-scale table census and classification

Crestan

Pantel

2011

View full text Add to dashboard Cite

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.

show abstract

Modelling and Detecting Changes in User Satisfaction

Kiseleva

Crestan

Brigo

et al. 2014

View full text Add to dashboard Cite

Web-scale knowledge extraction from semi-structured tables

Crestan

Pantel

2010

View full text Add to dashboard Cite

A fine-grained taxonomy of tables on the web

Crestan

Pantel

2010

View full text Add to dashboard Cite

Helping editors choose better seed sets for entity set expansion

Vyas

Pantel

Crestan

2009

View full text Add to dashboard Cite

Sets of named entities are used heavily at commercial search engines such as Google, Yahoo and Bing. Acquiring sets of entities typically consists of combining semi-supervised expansion algorithms with manual cleaning of the resulting expanded sets. In this paper, we study the effects of different seed sets in a state-of-the-art semi-supervised expansion system and show a tremendous variation in expansion performance depending on the choice of seeds. We further show that human editors, in general, provide very bad seed sets, which perform well-below the average random seed set. We identify three factors of seed set composition, namely prototypicality, ambiguity and coverage, and we investigate their effects on expansion performance. Finally, we propose various automatic systems for improving editor-generated seed sets, which seek to remove ambiguous and other error-prone seed instances. An extensive experimental analysis shows that expansion quality, measured in R-precision, can be improved on average by a maximum of 46% by removing the right seeds from a seed set. Our automatic methods outperform the human editors seed sets and on average improve expansion performance by up to 34% over the original seed sets.

show abstract

Browsing help for faster document retrieval

Crestan

Loupy

2004

View full text Add to dashboard Cite

In this paper, the search engine Intuition is described. It allows the user to navigate through the documents retrieved with a given query. Several "browse help" functions are provided by the engine and described here: conceptualisation, named entities, similar documents and entity visualization. They intend to "save the user's time". In order to evaluate the amount of time these features can save, an evaluation was made. It involves 6 users, 18 queries and the corpus is made of 16 years of the newspaper Le Monde. The results show that, with the different features, a user get faster to the needed information. fewer non-relevant documents are read (filtering) and more relevant documents are retrieved in less time.

show abstract

Natural language processing for browse help

Crestan

Loupy

2004

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Eric Crestan

Web-scale distributional similarity and entity set expansion

Web-scale table census and classification

Modelling and Detecting Changes in User Satisfaction

Web-scale knowledge extraction from semi-structured tables

A fine-grained taxonomy of tables on the web

Helping editors choose better seed sets for entity set expansion

Browsing help for faster document retrieval

Natural language processing for browse help

Contact Info

Product

Resources

About