Using web data for linguistic purposes

Lüdeling, Anke; Evert, Stefan; Baroni, Marco

doi:10.1163/9789401203791_003

Cited by 49 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Bergh, Seppänen, & Trotta, 1998;Fletcher, 2007;Kilgarriff & Grefenstette, 2003;Lüdeling, Evert, & Baroni, 2007). Nevertheless, the large size and diversity of online textual resources present a considerable challenge to anyone who wants to explore these resources for linguistic purposes.…”

Section: Google Scholar As a Linguistic Search Enginementioning

confidence: 99%

Use of Google Scholar in corpus-driven EAP research

Březina

2012

Journal of English for Academic Purposes

View full text Add to dashboard Cite

Section: Google Scholar As a Linguistic Search Enginementioning

confidence: 99%

Use of Google Scholar in corpus-driven EAP research

Březina

2012

Journal of English for Academic Purposes

View full text Add to dashboard Cite

“…Each of these may respond with different results for the same search query. In addition, web sites that update or change Computer Assisted Language Learning 389 their web content also add to the instability of the retrieved data and sometimes it is impossible to replicate a linguistic experiment in an exact way at a later time (Lu¨deling et al, 2005). However, the concordances and collocates which Google provides are reliable as far as DDL is concerned.…”

Section: Analysis and Commentsmentioning

confidence: 99%

“…Whatever the controversy over the web may be, early this century some corpus linguists' attention was attracted to the Internet and the search engine (Kilgarriff, 2001;Kilgarriff & Grefenstette, 2003;Lu¨deling, Evert, & Baroni, 2005;Resnik, Elkiss, Lau, & Taylor, 2005;Resnik & Smith, 2003), as Johns (2002) discovers ''the potential of the Web in defining and supporting a 'worldwide data-driven learning (DDL) community''', which may reach billions of English words in the web pages out there and accessible to everyone with an Internet connection, and save the trouble of gathering authentic texts in a machine-readable format as those who try to build a DIY corpus with WordSmith Tools 4.0 (corpus software developed by M. Scott, published by Oxford University Press, 2003). Strictly speaking, a search engine is not a corpus.…”

mentioning

confidence: 99%

Using Google as a super corpus to drive written language learning: a comparison with the British National Corpus

Guo-quan

2010

Computer Assisted Language Learning

View full text Add to dashboard Cite

“…Finally, it is important to acknowledge that the method introduced here appears to be one of the most successful applications of commercial search engines for the collection of linguistic data-a practice that has recently been criticized in the literature (Kilgarriff, 2006;Lüdeling, Evert & Baroni, 2006;Baroni and Kilgarriff, 2006;Fletcher, 2012). Among other issues, mining Google hit counts has been criticized on the grounds that register variation cannot be controlled, that webpages can be repeated and thus counted more than once, that the number of searches that can be made per day is limited, that webpages are not annotated for grammatical information, and that search engines count pages containing particular strings rather than the strings themselves.…”

Section: Discussionmentioning

confidence: 99%

Site-Restricted Web Searches for Data Collection in Regional Dialectology

2013

View full text Add to dashboard Cite

This paper presents a new method for data collection in regional dialectology based on site-restricted web searches. The method allows for the values of many lexical alternation variables to be measured across a region of interest using common search engines such as Google or Bing. The method involves estimating the proportions of the variants of a lexical alternation variable over a series of cities by counting the number of webpages that contain these variants on newspaper websites originating from these cities through site-restricted web searches. The method is evaluated by mapping the 26 variants of 10 content word alternation variables with known distributions in American English. In almost all cases, the maps based on site-restricted web searches align closely with traditional dialect maps based on data gathered through questionnaires, demonstrating the accuracy of this method for the observation of regional linguistic variation. However, unlike collecting dialect data using traditional methods, which is a relatively slow process, the use of site-restricted web searches allows for dialect data to be collected from across a region as large as the United States in a matter of days.

show abstract

Using web data for linguistic purposes

Cited by 49 publications

References 14 publications

Use of Google Scholar in corpus-driven EAP research

Use of Google Scholar in corpus-driven EAP research

Using Google as a super corpus to drive written language learning: a comparison with the British National Corpus

Site-Restricted Web Searches for Data Collection in Regional Dialectology

Contact Info

Product

Resources

About