Dawn Knight scite author profile

This paper reports on the construction of CANELC: the Cambridge and Nottingham e-language Corpus 3 . CANELC is a one million word corpus of digital communication in English, taken from online discussion boards, blogs, tweets, emails and SMS messages. The paper outlines the approaches used when planning the corpus: obtaining consent; collecting the data and compiling the corpus database. This is followed by a detailed analysis of some of the patterns of language used in the corpus. The analysis includes a discussion of the key words and phrases used as well as the common themes and semantic associations connected with the data. These discussions form the basis of an investigation of how e-language operates in both similar and different ways to spoken and written records of communication (as evidenced by the BNC -British National Corpus).

show abstract

The future of multimodal corpora

Knight

2011

Rev. bras. linguist. apl.

View full text Add to dashboard Cite

This paper takes stock of the current state-of-the-art in multimodal corpus linguistics, and proposes some projections of future developments in this field. It provides a critical overview of key multimodal corpora that have been constructed over the past decade and presents a wish-list of future technological and methodological advancements that may help to increase the availability, utility and functionality of such corpora for linguistic research.

show abstract

Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh

Knight

Loizides

Neale

et al. 2020

Lang Resources & Evaluation

View full text Add to dashboard Cite

CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes-National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grassroots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work. Keywords Language resources Á Natural language processing Á Data modelling Á Information retrieval Á Web interfaces Á Usability testing

show abstract

The reception of public health messages during the COVID-19 pandemic

McClaughlin¹,

Vilar-Lluch²,

Parnell³

et al. 2023

Applied Corpus Linguistics

View full text Add to dashboard Cite

How can a corpus be used to explore patterns?

Adolphs¹,

Knight²

2010

View full text Add to dashboard Cite

Formality in Digital Discourse: A Study of Hedging in CANELC

Knight

Adolphs

Carter

2013

View full text Add to dashboard Cite

This chapter provides a corpus-based analysis of formality in e-language. It examines how levels of formality differ from one 'mode' of e-language to the next, and how these collectively compare to spoken and written discourse, providing the foundations for enhancing our descriptions and understanding of e-language use. The chapter focuses on common indicators of formality in discourse with particular reference to the use of hedging. It profiles the use of specific varieties of this phenomenon, paying particular attention to how the frequency and use of hedges compares from different modes of e-language and text topics to the next, and, more generally, how they compare to one-million-word samples of data taken from the written and spoken BNC. The analyses are based on the newly constructed one-million-word CANELC corpus of digital English. CANELC stands for the Cambridge and Nottingham e-language Corpus. It contains data from online discussion boards, blogs, tweets, emails and SMS messages. The data covers a range of different discursive topics, from the more public concerns of 'news, media and current affairs', through to 'teaching, academia and education', 'hobbies and pastimes', 'music', 'celebrity news and gossip' to 'personal and daily life'.

show abstract

Building a spoken corpus

Adolphs¹,

Knight²

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dawn Knight

Chest physiotherapy and porencephalic brain lesions in very preterm infants

CANELC: constructing an e-language corpus

The future of multimodal corpora

Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh

The reception of public health messages during the COVID-19 pandemic

How can a corpus be used to explore patterns?

Formality in Digital Discourse: A Study of Hedging in CANELC

Building a spoken corpus

Contact Info

Product

Resources

About