2006
DOI: 10.1162/coli.2006.32.3.295
|View full text |Cite
|
Sign up to set email alerts
|

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Abstract: Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficien… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
23
0

Year Published

2007
2007
2018
2018

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(23 citation statements)
references
References 16 publications
0
23
0
Order By: Relevance
“…Testing on lists containing only erroneous word forms, they could not perform an evaluation in terms of precision, although this is not mentioned. [16] apply the system to classify English and German webpages according to the level of lexical errors they contain. The paper discusses in full how the error dictionaries that are used for this purpose are built.…”
Section: Large Dictionary Word Variant Retrieval: Prior Workmentioning
confidence: 99%
“…Testing on lists containing only erroneous word forms, they could not perform an evaluation in terms of precision, although this is not mentioned. [16] apply the system to classify English and German webpages according to the level of lexical errors they contain. The paper discusses in full how the error dictionaries that are used for this purpose are built.…”
Section: Large Dictionary Word Variant Retrieval: Prior Workmentioning
confidence: 99%
“…Tavosanis (2007) looks at classifying different types of spelling errors, particularly in blogs, Varnhagen et al (2009) performs a detailed categorisation of spellings in Instant Messaging, Myslin & Gries (2010) carry out an exploratory and descriptive study of Spanish internet orthography, and Driscoll (2002) in her study of "Gamer chat" notes a series of such features, including shortenings, acronyms, alternative spellings and new meanings for standard words. There has also been some research into normalising spelling in CMC data (Clark 2003, Ringlstetter et al 2006, and specifically for SMS (Aw et al 2006, Choudhury et al 2007, Kobus et al 2008, Acharyya et al 2009, Cook & Stevenson 2009, Yvon 2010, Beaufort et al 2010, chat (Wong et al 2006, Wong et al 2008, emails (Sproat et al 2001, Agarwal et al 2007) and newsgroups (Agarwal et al 2007, Zhu et al 2007). Furthermore, recent NLP and Information Retrieval workshops have focussed on research in the area of CMC language (Karlgren 2006, Lopresti et al 2008.…”
Section: Computer Mediated Communication (Cmc)mentioning
confidence: 99%
“…Working in this mainstream Finite State Automata (fsa) paradigm, mainly on German corpora, the group around professor Schultz at the University of Munich work on largescale corpus clean-up. In [11], the focus is on post-correction of ocr-ed corpora, while in [12], on the cleaning of web-derived corpora. The paper describes in detail how the typical error types that are observed in collections of typographical, spelling and ocr-errors are modeled.…”
Section: Related Workmentioning
confidence: 99%