Applications of <i>n</i>‐grams in textual information systems

Robertson, Anne M.; Willett, Peter

doi:10.1108/eum0000000007161

Cited by 70 publications

(40 citation statements)

References 90 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We can consider two bases for the characterisation and manipulation of text (Robertson and Willett, 1998): on the one hand, the individual characters that form the basis for the byte-level operations available to computers, and on the other, the individual words that are used by people -in this work represented by the spelling correction approaches previously discussed. These basic units can then be assembled into larger text segments such as sentences, paragraphs, etc.…”

Section: The N-gram Based Approachmentioning

confidence: 99%

“…Formally, an n-gram is a sub-sequence of n characters from a given word (Robertson and Willett, 1998). So, for example, we can split the word "potato" into four overlapping character 3-grams: -pot-, -ota-, -tat-and -ato-.…”

Section: The N-gram Based Approachmentioning

confidence: 99%

“…The second strategy is to consider a technique based on the use of character ngrams (McNamee and Mayfield, 2004a;Robertson and Willett, 1998). This technique is applicable to the case of isolated-word error correction and is independent of the extent of linguistic knowledge.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Managing misspelled queries in IR applications

Vilares

Otero

2011

Information Processing & Management

View full text Add to dashboard Cite

Section: The N-gram Based Approachmentioning

confidence: 99%

Section: The N-gram Based Approachmentioning

confidence: 99%

See 1 more Smart Citation

Managing misspelled queries in IR applications

Vilares

Otero

2011

Information Processing & Management

View full text Add to dashboard Cite

“…In n-gram matching words are decomposed into n-grams, i.e., into substrings of length n Pfeifer et al, 1996;Pirkola et al, 2002;Robertson and Willett, 1998;Salton, 1989). N-gram matching has been reported to be an effective technique among various approximate matching techniques in name searching (Pfeifer et al, 1996;Zobel and Dart, 1995) and cross-lingual spelling variant matching and is an appropriate fuzzy matching technique for use with TRT.…”

Section: N-gram Matchingmentioning

confidence: 99%

“…Approximate matching techniques involve Soundex and Phonix, which compare words on the basis of their phonetic similarity (Gadd, 1990), edit distance (Zobel and Dart, 1996), and n-gram based matching (Robertson and Willett, 1998). In ngram matching text strings are decomposed into n-grams, i.e., substrings of length n, which usually consist of the adjacent characters of the text strings.…”

Section: Introductionmentioning

confidence: 99%

Translating cross-lingual spelling variants using transformation rules

Toivonen

Pirkola

Keskustalo

et al. 2005

Information Processing & Management

View full text Add to dashboard Cite

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

Järvelin

Keskustalo

Sormunen

et al. 2015

Asso for Info Science & Tech

View full text Add to dashboard Cite

The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation. IntroductionDigitization is a good way to preserve cultural heritage documents and make them widely accessible for researchers and the general public. Cultural institutions are aware of this potential and often consider digitization of their cultural heritage collections as an obligation. Consequently, the quantity of digitized historical documents available is constantly growing. Transforming print cultural heritage collections into digital resources accessible and searchable through modern information and communication technologies requires that the digitized document images are transformed into digital text through optical character recognition (OCR). While OCR can currently reach over 99% accuracy in recognition of characters from high-quality images of original documents with a simple book layout, the accuracy for historical newspapers is lower than that. OCR quality is dependent on the environment and the condition of the original documents: print and paper quality, typefaces, and layout complexity affect the accuracy of the result. Generally, the older the newspaper is, the lower the accuracy rate is likely to be. Holley (2009) reported raw character recognition accuracy rates varying from 71% to 98% in a sample of digitized newspapers from 1803-1954, the lowest rate indicating almost every third character being erroneously recognized and virtually all words containing errors. Even a 98% accuracy rate results in an error in, on average, every sixth word in Finnish text (with an average word length of around eight characters), if the errors are evenly distributed. Such error rates may lead to a quadrupling of the number of unique index words and sign...

show abstract

Applications of n‐grams in textual information systems

Cited by 70 publications

References 90 publications

Managing misspelled queries in IR applications

Managing misspelled queries in IR applications

Translating cross-lingual spelling variants using transformation rules

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

Contact Info

Product

Resources

About