Chinatsu Aone scite author profile

Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of FMeasure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and kmeans. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, "vector average damping," that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.

show abstract

Evaluating automated and manual acquisition of anaphora resolution strategies

Aone¹,

Scott²

1995

107

View full text Add to dashboard Cite

We describe one approach to build an automatically trainable anaphora resolution system. In this approach, we use Japanese newspaper articles tagged with discourse information as training examples for a machine learning algorithm which employs the C4.5 decision tree algorithm by Quinlan (Quinlan, 1993). Then, we evaluate and compare the results of several variants of the machine learning-based approach with those of our existing anaphora resolution system which uses manually-designed knowledge sources. Finally, we compare our algorithms with existing theories of anaphora, in particular, Japanese zero pronouns.

show abstract

Trainable, scalable summarization using robust NLP and machine learning

Aone¹,

Okurowski²,

Gorlinsky³

1998

View full text Add to dashboard Cite

We describe a trainable and scalable summarization system which utilizes features derived from information retrieval, information extraction, and NLP techniques and on-line resources. The system combines these features using a trainable feature combiner learned from summary examples through a machine learning algorithm. We demonstrate system scalability by reporting results on the best combination of summarization features for different document sources. We also present preliminary results from a task-based evaluation on summarization output usability. IntroductionFrequency-based (Edmundson, 1969;Kupiec, Pedersen, and Chen, 1995;Brandow, Mitze, and Rau, 1995), knowledge-based (Reimer andHahn, 1988; McKeown and l:Ladev, 1995), and discoursebased (Johnson et al., 1993;Miike et al., 1994;Jones, 1995) approaches to automated summarization correspond to a continuum of increasing understanding of the text and increasing complexity in text processing. Given the goal of machine-generated summaries, these approaches attempt to answer three central questions:• How does the system count words to calculate worthiness for summarization?• How does the system incorporate the knowledge of the domain represented in the text?• How does the system create a coherent and cohesive summary?Our work leverages off of research in these three approaches and attempts to remedy some of the difficulties encountered in each by applying a combination of information retrieval, information extraction, "We would like to thank Jamie Callan for his help with the INQUERY experiments.and NLP techniques and on-line resources with machine learning to generate summaries. Our DimSum system follows a common paradigm of sentence extraction, but automates acquiring candidate knowledge and learns what knowledge is necessary to summarize.We present how we automatically acquire candidate features in Section 2. Section 3 describes our training methodology for combining features to generate summaries, and discusses evaluation results of both batch and machine learning methods. Section 4 reports our task-based evaluation. Extracting FeaturesIn this section, we describe how the system counts linguistically-motivated, automaticallyderived words and multi-words in calculating worthiness for summarization. We show how the system uses an external corpus to incorporate domain knowledge in contrast to text-only statistics. Finally, we explain how we attempt to increase the cohesiveness of our summaries by using name aliasing, WordNet synonyms, and morphological variants. Defining Single and Multi-word TermsFrequency-based summarization systems typically use a single word string as the unit for counting frequency. Though robust, such a method ignores the semantic content of words and their potential membership in multi-word phrases and may introduce noise in frequency counting by treating the same strings uniformly regardless of context. Our approach, similar to (Tzoukerman, Klavans, and Jacquemin, 1997), is to apply NLP tools to extract multi-word phrases autom...

show abstract

Applying machine learning to anaphora resolution

Aone¹,

Bennett²

1996

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chinatsu Aone

Kernel methods for relation extraction

Fast and effective text mining using linear-time document clustering

Evaluating automated and manual acquisition of anaphora resolution strategies

Trainable, scalable summarization using robust NLP and machine learning

Applying machine learning to anaphora resolution

Contact Info

Product

Resources

About