State-of-the-Art in Weighted Finite-State Spell-Checking

Pirinen, Tommi A.; Lindén, Krister

doi:10.1007/978-3-642-54903-8_43

Cited by 24 publications

(24 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Traditional spelling correction techniques rely on the fact that most spelling errors are within a short edit-distance of their correct form Kukich (1992); Max & Wisniewski (2010). That is why spelling correction needs special treatment in case of MRLs, which by nature consist of longer words resulting in errors in longer edit distances and mostly due to the wrong spellings outside the word lemma (within the affixes) Ingason et al (2009); Pirinen & Lindén (2014); Pirinen et al (2010). In case of languages having rather shorter words than MRLs, the complete omission of diacritics and vowels would not be a severe problem and could be jointly solved with spelling correction.…”

Section: The Proposed Architecturementioning

confidence: 99%

“…Figure 5 gives the general flow of the employed system. SC#4 is inspired by Linden and Pirinen 2014, in that it uses a language and an error model together in order to generate candidates. Candidates which are generated by the error model are validated using the language model and the best proposal is the candidate with minimum rule cost and maximum unigram probability.…”

Section: The Proposed Architecturementioning

confidence: 99%

See 1 more Smart Citation

Social media text normalization for Turkish

Eryiğit

Torunoğlu-Selamet

2017

Nat. Lang. Eng.

View full text Add to dashboard Cite

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

show abstract

Section: The Proposed Architecturementioning

confidence: 99%

Section: The Proposed Architecturementioning

confidence: 99%

Social media text normalization for Turkish

Eryiğit

Torunoğlu-Selamet

2017

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…Recently, there has been a surge of interest in solving the spelling error correction problem via the web (e.g., Whitelaw et al, 2009;Sun et al, 2010) and to correct query strings for search engines (e.g., Duan and Hsu, 2011, and many others). Further approaches to spelling correction include finite state techniques (e.g., Pirinen and Lindén, 2014) and deep graphical models (e.g., Raaijmakers, 2013). Kukich (1992) summarizes many of the earlier approaches to spell checking such as based on triebased edit distances.…”

Section: Related Workmentioning

confidence: 99%

A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction

Eger¹,

Brück²,

Mehler³

2016

The Prague Bulletin of Mathematical Linguistics

View full text Add to dashboard Cite

We consider the isolated spelling error correction problem as a specific subproblem of the more general string-to-string translation problem. In this context, we investigate four general string-to-string transformation models that have been suggested in recent years and apply them within the spelling error correction paradigm. In particular, we investigate how a simple ‘k-best decoding plus dictionary lookup’ strategy performs in this context and find that such an approach can significantly outdo baselines such as edit distance, weighted edit distance, and the noisy channel Brill and Moore model to spelling error correction. We also consider elementary combination techniques for our models such as language model weighted majority voting and center string combination. Finally, we consider real-world OCR post-correction for a dataset sampled from medieval Latin texts.

show abstract

“…This means that the language experts can collect and curate data, while the engineers improve and add NLP systems, and when a new or improved system for a specific NLP application is finalised, it can be applied to all languages providing language data in the infrastructure. In practice for example, this has in past meant, that when new research was published making weighted finite-state spell-checking and correction end-user usable [9], all languages in the infrastructure could have an additional (albeit basic) spell-checker and corrector. Both in GiellaLT infra and Apertium system this is implemented at low level by simply applying the necessary changes to all of the language repositories.…”

Section: Infrastructures and Resourcesmentioning

confidence: 99%