Tanja Säily scite author profile

Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (2005), the use of the  2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

show abstract

Comparing type counts: The case of women, men and -ity in early English letters

Säily¹,

Suomela²

2009

View full text Add to dashboard Cite

This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes-ity and-ness in the 17 th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of-ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women's letters with the bounds, we are able to confirm our hypothesis.

show abstract

Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations

Säily¹

2011

View full text Add to dashboard Cite

Variation in noun and pronoun frequencies in a sociohistorical corpus of English

Säily

Nevalainen

Siirtola

2011

Literary and Linguistic Computing

View full text Add to dashboard Cite

Many corpus linguists make the tacit assumption that part-of-speech frequencies remain constant during the period of observation. In this article, we will consider two related issues: (1) the reliability of part-of-speech tagging in a diachronic corpus, and (2) shifts in tag ratios over time. The purpose is both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results. We use noun and pronoun ratios as diagnostics indicative of opposing stylistic tendencies, but we are also interested in testing whether any observed variation in the ratios could be accounted for in sociolinguistic terms. The material for our study is provided by the Parsed Corpus of Early English Correspondence (PCEEC), which consists of 2.2 million running words covering the period 1415-1681. The part-of-speech tagging of the PCEEC has its problems, which we test by reannotating the corpus according to our own principles and comparing the two annotations. While there are quite a few changes, the mean percentage of change is very small for both nouns and pronouns. As for variation over time, the mean frequency of nouns declines somewhat, while the mean frequency of pronouns fluctuates with no clear diachronic trend. However, women consistently use more pronouns than men, while men use more nouns than women. More fine-grained distinctions are needed to uncover further regularities and possible reasons for this variation.

show abstract

Revisiting NMT for Normalization of Early English Letters

Hämäläinen¹,

Säily²,

Rueter³

et al. 2019

View full text Add to dashboard Cite

This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.

show abstract

Relativisation in Dutch diaries, private letters and newspapers (1770–1840)

Krogull¹,

Rutten²,

Wal³

et al. 2017

View full text Add to dashboard Cite

The paper focuses on three important themes in historical sociolinguistics: (1) the emergence of national language planning in the Netherlands around 1800, (2) the influence of historical prescriptivism on usage, and (3) genre as a crucial factor in explaining variation and change. The case study deals with relativisation, particularly the neuter relative pronoun in eighteenth-and nineteenth-century Dutch. Analysing both internal and external factors, we show that the definiteness of the antecedent does not explain the variation, contrary to what is assumed in the research literature. Likewise, a strong effect of language norms on usage patterns cannot be established. The crucial factor turns out to be genre.

show abstract

Sociolinguistic variation in morphological productivity in eighteenth-century English

Säily¹

2016

View full text Add to dashboard Cite

show abstract

Chapter 12. Change or variation? Productivity of the suffixes ‑ness and ‑ity

Säily

2018

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Tanja Säily

Significance testing of word frequencies in corpora

Comparing type counts: The case of women, men and -ity in early English letters

Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations

Variation in noun and pronoun frequencies in a sociohistorical corpus of English

Revisiting NMT for Normalization of Early English Letters

Relativisation in Dutch diaries, private letters and newspapers (1770–1840)

Sociolinguistic variation in morphological productivity in eighteenth-century English

Chapter 12. Change or variation? Productivity of the suffixes ‑ness and ‑ity

Contact Info

Product

Resources

About