This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web. Literature Review Registers and GenresOver the past 3 decades, register has emerged as one of the most important predictors of linguistic variation, and a wide range of registers have been described and compared
Keyword analysis has become an indispensable tool for discourse analysts, being applied to identify the words that are especially characteristic of the texts in a target discourse domain. But, surprisingly, the statistical computation of keyness makes no reference to those texts. Rather, once a corpus has been constructed, it is treated as a homogeneous whole for the computation of keyness. As a result, the keywords in such lists are relatively frequent in the corpus, but they are often not widely dispersed across the texts of that corpus and are thus not truly representative of the target discourse domain. The purpose of this study is to propose a new method for keyword analysis – text dispersion keyness – that is based on text dispersion, rather than corpus frequency. We compare the effectiveness of this measure to four other methods for computing keyness, carrying out a series of case studies to identify the keywords that are typical of online travel blogs. A variety of quantitative and qualitative analyses are carried out to compare these methods based on their content-generalisability and content-distinctiveness, demonstrating that text dispersion keyness is a superior measure for generating keyword lists.
Parametric analyses such as t tests and ANOVAs are the norm-if not the default-statistical tests found in quantitative applied linguistics research (Gass 2009). Applied statisticians and one applied linguist (Larson-Hall 2010, 2012; Larson-Hall and Herrington 2010), however, have argued that this approach may not be appropriate for small samples and/or non-normally distributed data (e.g., Wilcox 2003), both common in second language (L2) research. They recommend instead 'robust statistics' such as bootstrapping, a non-parametric procedure that randomly re-samples from an observed data set to produce a simulated but more stable and statistically accurate outcome. The present study tests the usefulness of bootstrapping by reanalyzing raw data from 26 studies of applied linguistics research. Our results found no evidence of Type II error (false negative). However, four out of 16 statistically significant results were not replicated (i.e., a Type I error 'misfit' five times higher than an alpha of .05). We discuss empirically-justified suggestions for the use of bootstrapping in the context of broader methodological issues and reforms in applied linguistics (see Author, in press).
Most previous linguistic investigations of the web have focused on special linguistic features associated with Internet language (e.g., the use of emoticons, abbreviations, contractions, and acronyms) and the “new” Internet registers that are especially salient to observers (e.g., blogs, Internet forums, instant messages, tweets). Multi-Dimensional (MD) analysis has also been used to analyze Internet registers, focusing on core grammatical features (e.g., nouns, verbs, prepositional phrases). MD research differs theoretically and methodologically from most other research approaches in linguistics in that it is built on the notion of linguistic co-occurrence, with the claim that register differences are best described in terms of sets of co-occurring linguistic features that have a functional underpinning. At the same time, though, most previous MD studies are similar to other previous research in their focus on new Internet registers, such as blogs, Facebook/Twitter posts, and email messages. These are the registers that we immediately think of in association with the Internet, and thus it makes sense that they should be the focus of most previous research. However, that emphasis means that we know surprisingly little at present about the full range of registers found on the web and the patterns of linguistic variation among those registers. This is the goal of the present study. Rather than beginning with a focus on new registers that are assumed to be interesting, we analyze a representative sample of the entire searchable web. End-users coded the situational and communicative characteristics of each document in our corpus, leading to a much wider range of register categories than that used in any previous linguistic study: eight general categories; several hybrid register categories; and twenty-seven specific register categories. This approach thus leads to a much more inclusive and diverse sample of web registers than that found in any previous study of English Internet language. The goal of the present study is to document the patterns of linguistic variation among those registers. Using MD analysis, we explore the dimensions of linguistic variation on the searchable web, and the similarities and differences among web registers with respect to those dimensions.
Frequency is often the only variable considered when researchers or teachers develop vocabulary materials for second language (L2) learners. However, researchers have also found that many other variables affect vocabulary acquisition. In this study, we explored the relationship between L2 vocabulary acquisition and a variety of lexical characteristics using vocabulary recognition test data from L2 English learners. Conducting best subsets multiple regression analysis to explore all possible combinations of variables, we produced a best‐fitting model of vocabulary difficulty consisting of six variables (R2 = .37). The fact that many variables significantly contributed to the regression model and that a large amount of variance remained yet unexplained by the frequency variable considered in this study indicates that much more than frequency alone affects the likelihood that learners will learn certain L2 words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.