An evaluation to detect and correct erroneous characters wrongly substituted, deleted and inserted in Japanese and English sentences using Markov models
Abstract:In optical character recognition and coni.inuous speech recognition of a natural language, it has been diflicult to detect error characters which are wrongly deleted and inserted. ]n <>rder to judge three types of the errors, which are characters wrongly substituted, deleted or inserted in a Japanese "bunsetsu" and an l';nglish word, and to correct these errors, this paper proposes new methods using rn-th order Markov chain model for Japanese "l~anjikana" characters and Fmglish alphabets, assuming that Markov … Show more
“…Up to now, the methods to detect and correct erroneous characters wrongly substituted, deleted, or inserted at the inner position in Japanese sentences using m th‐order Markov chain model for Japanese ‘kanji‐kana’ characters, have been known to be useful to detect and correct these erroneous characters [11–18]. For an example, the value of the second‐order Markov probability for each character of the erroneous chain \documentclass{article}\usepackage{amsmath}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{amsfonts}\pagestyle{empty}\begin{document}$\Gamma^{(2)}_S$ \end{document} or \documentclass{article}\usepackage{amsmath}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{amsfonts}\pagestyle{empty}\begin{document}$\Gamma^{(2)}_I$ \end{document} remains smaller value than the critical value T just four times.…”
Section: A New Methods Of Error Detection Using Cmcp and Smcpmentioning
confidence: 99%
“…In order to solve this problem, by using the relation between the types of errors and the length of a chain in which the values of Markov joint probability remain small, a new method has been proposed to judge the three types of the errors, which are characters wrongly substituted, deleted, or inserted in Japanese sentences and ‘bunsetsu’s; to find the locations and the lengths of these erroneous characters; and to correct these errors in Japanese ‘kanji‐kana’ chains using m th‐order Markov chain model [11–18].…”
“…Up to now, the methods to detect and correct erroneous characters wrongly substituted, deleted, or inserted at the inner position in Japanese sentences using m th‐order Markov chain model for Japanese ‘kanji‐kana’ characters, have been known to be useful to detect and correct these erroneous characters [11–18]. For an example, the value of the second‐order Markov probability for each character of the erroneous chain \documentclass{article}\usepackage{amsmath}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{amsfonts}\pagestyle{empty}\begin{document}$\Gamma^{(2)}_S$ \end{document} or \documentclass{article}\usepackage{amsmath}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{amsfonts}\pagestyle{empty}\begin{document}$\Gamma^{(2)}_I$ \end{document} remains smaller value than the critical value T just four times.…”
Section: A New Methods Of Error Detection Using Cmcp and Smcpmentioning
confidence: 99%
“…In order to solve this problem, by using the relation between the types of errors and the length of a chain in which the values of Markov joint probability remain small, a new method has been proposed to judge the three types of the errors, which are characters wrongly substituted, deleted, or inserted in Japanese sentences and ‘bunsetsu’s; to find the locations and the lengths of these erroneous characters; and to correct these errors in Japanese ‘kanji‐kana’ chains using m th‐order Markov chain model [11–18].…”
“…An example of 2nd-order Markov chain models to skip one character, is shown in Fig.3. The precise definitions of the error types, the "Relevance Factor" P and the "Recall Factor" R are given in [2].…”
“…Araki et al [1] tried to correct not only substitution errors but also insertion and deletion errors in Japanese text using character trigram statistics. In insertion errors, some wrong characters are inserted into the original text, and in deletion errors, some characters are lost from the original text.…”
Section: Introductionmentioning
confidence: 99%
“…Instead, it has been shown that character n-gram statistics are effective in detecting and correcting erroneous Japanese text [1,7,8].…”
While the accuracy of current OCR systems is getting very high, they are still error-prone. In this paper, we clarify how much of recognition errors in text can be corrected using linguistic information from on-line texts. We present an OCR error correction method which uses character trigram, stochastic morphological analysis and word trigram models. These models are trained on a large untagged text. The proposed method does not use any graphical information about characters. Therefore the method can be applied to any domain that has a large on-line text corpus. When our method is applied to text which include random character substitution, it improves a text of 90% correct character rate into that of 94.3% correct rate and a 95% correct text into a 96.9% correct one.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.