Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of classification problems in natural language processing. However, the flexibility of ME models is not without cost. While parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are very large, and may well contain many thousands of free parameters. In this paper, we consider a number of algorithms for estimating the parameters of ME models, including iterative scaling, gradient ascent, conjugate gradient, and variable metric methods. Surprisingly, the standardly used iterative scaling algorithms perform quite poorly in comparison to the others, and for all of the test problems, a limitedmemory variable metric algorithm outperformed the other choices.
Crosslinguistically, inflectional morphology exhibits a spectacular range of complexity in both the structure of individual words and the organization of systems that words participate in. We distinguish two dimensions in the analysis of morphological complexity. enumerative complexity (E-complexity) reflects the number of morphosyntactic distinctions that languages make and the strategies employed to encode them, concerning either the internal composition of words or the arrangement of classes of words into inflection classes. This, we argue, is constrained by integrative complexity (I-complexity). The I-complexity of an inflectional system reflects the difficulty that a paradigmatic system poses for language users (rather than lexicographers) in information-theoretic terms. This becomes clear by distinguishing average paradigm entropy from average conditional entropy . The average entropy of a paradigm is the uncertainty in guessing the realization for a particular cell of the paradigm of a particular lexeme (given knowledge of the possible exponents). This gives one a measure of the complexity of a morphological system—systems with more exponents and more inflection classes will in general have higher average paradigm entropy—but it presupposes a problem that adult native speakers will never encounter. In order to know that a lexeme exists, the speaker must have heard at least one word form, so in the worst case a speaker will be faced with predicting a word form based on knowledge of one other word form of that lexeme. Thus, a better measure of morphological complexity is the average conditional entropy, the average uncertainty in guessing the realization of one randomly selected cell in the paradigm of a lexeme given the realization of one other randomly selected cell. This is the I-complexity of paradigm organization. Viewed from this information-theoretic perspective, languages that appear to differ greatly in their E-complexity—the number of exponents, inflectional classes, and principal parts—can actually be quite similar in terms of the challenge they pose for a language user who already knows how the system works. We adduce evidence for this hypothesis from three sources: a comparison between languages of varying degrees of E-complexity, a case study from the particularly challenging conjugational system of Chiquihuitlán Mazatec, and a Monte Carlo simulation modeling the encoding of morphosyntactic properties into formal expressions. The results of these analyses provide evidence for the crucial status of words and paradigms for understanding morphological organization.
Humans show an amazing ability to produce novel words based on previous experience. What analogical processes are at work in this process, and how do analogical generalizations emerge from complex morphological systems? This chapter addresses these questions with new quantitative measures. Words are construed as recombinant gestalts. The predictive value of particular words in relation to others is calculated in terms of measures of conditional entropy. When applied to Tundra Nenets nominal paradigms, the model captures central aspects of morphological organization and learning.
(Research paper) PurposeTo evaluate and extend existing natural language processing techniques into the domain of informal online political discussions. Design/methodology/approachA database of postings from a U.S. political discussion site was collected, along with self-reported political orientation data for the users. A variety of sentiment analysis, text classification, and social network analysis methods were applied to the postings and evaluated against the users' self-descriptions. FindingsPurely text-based methods performed poorly, but could be improved using techniques which took into account the users' position in the online community. Research limitationsThe techniques we applied here are fairly simple, and more sophisticated learning algorithms may yield better results for text-based classification. Practical implicationsThis work suggests that social network analysis is an important tool for performing natural language processing tasks with informal web texts.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.