Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurns is significantly closer to WordNet than Roget Thesaurus is.
One of the main challenges in question-answering is the potential mismatch between the
expressions in questions and the expressions in texts. While humans appear to use inference
rules such as ‘X writes Y’ implies ‘X is the author of Y’ in answering questions, such rules
are generally unavailable to question-answering systems due to the inherent difficulty in constructing
them. In this paper, we present an unsupervised algorithm for discovering inference
rules from text. Our algorithm is based on an extended version of Harris’ Distributional
Hypothesis, which states that words that occurred in the same contexts tend to be similar.
Instead of using this hypothesis on words, we apply it to paths in the dependency trees of a
parsed corpus. Essentially, if two paths tend to link the same set of words, we hypothesize
that their meanings are similar. We use examples to show that our system discovers many
inference rules easily missed by humans.
Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurns is significantly closer to WordNet than Roget Thesaurus is.
Non-compositional expressions present a special challenge to NLP applications. We present a method for automatic identification of non-compositional expressions using their statistical properties in a text corpus. Our method is based on the hypothesis that when a phrase is non-composition, its mutual information differs significantly from the mutual informations of phrases obtained by substituting one of the word in the phrase with a similar word.
Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers word senses from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning words to their most similar clusters. After assigning an element to a cluster, we remove their overlapping features from the element. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses. Each cluster that a word belongs to represents one of its senses. We also present an evaluation methodology for automatically measuring the precision and recall of discovered senses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.