Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (2005), the use of the 2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.
This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes-ity and-ness in the 17 th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of-ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women's letters with the bounds, we are able to confirm our hypothesis.
Many corpus linguists make the tacit assumption that part-of-speech frequencies remain constant during the period of observation. In this article, we will consider two related issues: (1) the reliability of part-of-speech tagging in a diachronic corpus, and (2) shifts in tag ratios over time. The purpose is both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results. We use noun and pronoun ratios as diagnostics indicative of opposing stylistic tendencies, but we are also interested in testing whether any observed variation in the ratios could be accounted for in sociolinguistic terms. The material for our study is provided by the Parsed Corpus of Early English Correspondence (PCEEC), which consists of 2.2 million running words covering the period 1415-1681. The part-of-speech tagging of the PCEEC has its problems, which we test by reannotating the corpus according to our own principles and comparing the two annotations. While there are quite a few changes, the mean percentage of change is very small for both nouns and pronouns. As for variation over time, the mean frequency of nouns declines somewhat, while the mean frequency of pronouns fluctuates with no clear diachronic trend. However, women consistently use more pronouns than men, while men use more nouns than women. More fine-grained distinctions are needed to uncover further regularities and possible reasons for this variation.
This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.
The paper focuses on three important themes in historical sociolinguistics: (1) the emergence of national language planning in the Netherlands around 1800, (2) the influence of historical prescriptivism on usage, and (3) genre as a crucial factor in explaining variation and change. The case study deals with relativisation, particularly the neuter relative pronoun in eighteenth-and nineteenth-century Dutch. Analysing both internal and external factors, we show that the definiteness of the antecedent does not explain the variation, contrary to what is assumed in the research literature. Likewise, a strong effect of language norms on usage patterns cannot be established. The crucial factor turns out to be genre.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.