This paper presents an unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations. The algorithm is based on two powerful constraints-that words tend to have one sense per discourse and one sense per collocation-exploited in an iterative bootstrapping procedure. Tested accuracy exceeds 96%.
Social media outlets such as Twitter have become an important forum for peer interaction. Thus the ability to classify latent user attributes, including gender, age, regional origin, and political orientation solely from Twitter user language or similar highly informal content has important applications in advertising, personalization, and recommendation. This paper includes a novel investigation of stacked-SVM-based classification algorithms over a rich set of original features, applied to classifying these four user attributes. It also includes extensive analysis of features and approaches that are effective and not effective in classifying user attributes in Twitter-style informal written genres as distinct from the other primarily spoken genres previously studied in the userproperty classification literature. Our models, singly and in ensemble, significantly outperform baseline models in all cases. A detailed analysis of model components and features provides an often entertaining insight into distinctive language-usage variation across gender, age, regional origin and political orientation in modern informal communication.
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish.Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections.Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system. This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection.
This paper describes a program that disambignates English word senses in unrestricted text using statistical models of the major Roget's Thesaurus categories. Roget's categories serve as approximations of conceptual classes. The categories listed for a word in Roger's index tend to correspond to sense distinctions; thus selecting the most likely category provides a useful level of sense disambiguatiou. The selection of categories is accomplished by identifying and weighting words that are indicative of each category when seen in context, using a Bayesian theoretical framework.
The 2016 SIGMORPHON Shared Task was devoted to the problem of morphological reinflection. It introduced morphological datasets for 10 languages with diverse typological characteristics. The shared task drew submissions from 9 teams representing 11 institutions reflecting a variety of approaches to addressing supervised learning of reinflection. For the simplest task, inflection generation from lemmas, the best system averaged 95.56% exact-match accuracy across all languages, ranging from Maltese (88.99%) to Hungarian (99.30%). With the relatively large training datasets provided, recurrent neural network architectures consistently performed best-in fact, there was a significant margin between neural and non-neural approaches. The best neural approach, averaged over all tasks and languages, outperformed the best nonneural one by 13.76% absolute; on individual tasks and languages the gap in accuracy sometimes exceeded 60%. Overall, the results show a strong state of the art, and serve as encouragement for future shared tasks that explore morphological analysis and generation with varying degrees of supervision.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.