Corpus Linguistics and the Web 2007
DOI: 10.1163/9789401203791_003
|View full text |Cite
|
Sign up to set email alerts
|

Using web data for linguistic purposes

Abstract: The world wide web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette 2003). A growing body of studies has shown that simple algorithms using web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled data sources (cf. Turney 2001;Keller and Lapata 2003). Most current internet-based linguistic studies access the web through a commercial search engine. For example, some researchers r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0
4

Year Published

2010
2010
2020
2020

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 49 publications
(15 citation statements)
references
References 14 publications
0
11
0
4
Order By: Relevance
“…Bergh, Seppänen, & Trotta, 1998;Fletcher, 2007;Kilgarriff & Grefenstette, 2003;Lüdeling, Evert, & Baroni, 2007). Nevertheless, the large size and diversity of online textual resources present a considerable challenge to anyone who wants to explore these resources for linguistic purposes.…”
Section: Google Scholar As a Linguistic Search Enginementioning
confidence: 99%
“…Bergh, Seppänen, & Trotta, 1998;Fletcher, 2007;Kilgarriff & Grefenstette, 2003;Lüdeling, Evert, & Baroni, 2007). Nevertheless, the large size and diversity of online textual resources present a considerable challenge to anyone who wants to explore these resources for linguistic purposes.…”
Section: Google Scholar As a Linguistic Search Enginementioning
confidence: 99%
“…Each of these may respond with different results for the same search query. In addition, web sites that update or change Computer Assisted Language Learning 389 their web content also add to the instability of the retrieved data and sometimes it is impossible to replicate a linguistic experiment in an exact way at a later time (Lu¨deling et al, 2005). However, the concordances and collocates which Google provides are reliable as far as DDL is concerned.…”
Section: Analysis and Commentsmentioning
confidence: 99%
“…Whatever the controversy over the web may be, early this century some corpus linguists' attention was attracted to the Internet and the search engine (Kilgarriff, 2001;Kilgarriff & Grefenstette, 2003;Lu¨deling, Evert, & Baroni, 2005;Resnik, Elkiss, Lau, & Taylor, 2005;Resnik & Smith, 2003), as Johns (2002) discovers ''the potential of the Web in defining and supporting a 'worldwide data-driven learning (DDL) community''', which may reach billions of English words in the web pages out there and accessible to everyone with an Internet connection, and save the trouble of gathering authentic texts in a machine-readable format as those who try to build a DIY corpus with WordSmith Tools 4.0 (corpus software developed by M. Scott, published by Oxford University Press, 2003). Strictly speaking, a search engine is not a corpus.…”
mentioning
confidence: 99%
“…Finally, it is important to acknowledge that the method introduced here appears to be one of the most successful applications of commercial search engines for the collection of linguistic data-a practice that has recently been criticized in the literature (Kilgarriff, 2006;Lüdeling, Evert & Baroni, 2006;Baroni and Kilgarriff, 2006;Fletcher, 2012). Among other issues, mining Google hit counts has been criticized on the grounds that register variation cannot be controlled, that webpages can be repeated and thus counted more than once, that the number of searches that can be made per day is limited, that webpages are not annotated for grammatical information, and that search engines count pages containing particular strings rather than the strings themselves.…”
Section: Discussionmentioning
confidence: 99%