Spelling Error Correction with Soft-Masked BERT

Zhang, Shaohua; Huang, Haoran; Liu, Jicong; Li, Hang

doi:10.48550/arxiv.2005.07421

Cited by 11 publications

(20 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the end, a contextsensitive model is used to score all candidates and pick up the best one. In [12], Bi-GRU network is used to assign error occurrence probability for each input token. Then vector representation of each token is interpolated with a special mask token representation according to those calculated probabilities.…”

Section: Pipeline Text-based Correction Methodsmentioning

confidence: 99%

BART based semantic correction for Mandarin automatic speech recognition system

Zhao,

Yang,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Although automatic speech recognition (ASR) systems achieved significantly improvements in recent years, spoken language recognition error occurs which can be easily spotted by human beings. Various language modeling techniques have been developed on post recognition tasks like semantic correction. In this paper, we propose a Transformer based semantic correction method with pretrained BART initialization, Experiments on 10000 hours Mandarin speech dataset show that character error rate (CER) can be effectively reduced by 21.7% relatively compared to our baseline ASR system. Expert evaluation demonstrates that actual improvement of our model surpasses what CER indicates.

show abstract

Section: Pipeline Text-based Correction Methodsmentioning

confidence: 99%

BART based semantic correction for Mandarin automatic speech recognition system

Zhao,

Yang,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…CSC Data Augmentation: In order to make up for the lack of labeled data, previous studies usually build additional pseudo data to increase the performance. The mainstream method is based on the confusion set Zhang et al, , 2020, the pseudo data constructed in this way is extensive in size but relatively low in quality because of the big gap from the true error distribution. Another relatively high-quality construction method is based on ASR or OCR method (Wang et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME

Hu¹,

Meng²,

Zhou³

2022

Preprint

View full text Add to dashboard Cite

Chinese Spelling Correction (CSC) is a task to detect and correct spelling mistakes in texts. In fact, most of Chinese input is based on pinyin input method, so the study of spelling errors in this process is more practical and valuable. However, there is still no research dedicated to this essential scenario. In this paper, we first present a Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), including 40,000 annotated sentences from real posts of official media on Sina Weibo. Furthermore, we propose a novel method to automatically construct large-scale and high-quality pseudo data by simulating the input through pinyin IME. A series of analyses and experiments on CSCD-IME show that spelling errors produced by pinyin IME hold a particular distribution at pinyin level and semantic level and are challenging enough. Meanwhile, our proposed pseudo-data construction method can better fit this error distribution and improve the performance of CSC systems. Finally, we provide a useful guide to using pseudo data, including the data scale, the data source, and the training strategy 1 .

show abstract

“…Various GEC studies modified and utilized the mask mechanism of BERT [7] to detect the errors and further correct them [8,9]. Asano [35] incorporated the BERT to detect sentences with grammatical errors.…”

Section: Related Workmentioning

confidence: 99%

“…Prior GEC and GED studies have obtained the outstanding achievements in this area. Most of them employed n-gram [2,3], confusion set [4,5], language model [6] including the BERT [7][8][9] etc. to diagnose the errors.…”

Section: Introductionmentioning

confidence: 99%

Pre-Training-Based Grammatical Error Correction Model for the Written Language of Chinese Hearing Impaired Students

Chen

Zhang

2022

IEEE Access

View full text Add to dashboard Cite

Grammatical error correction has been considered as an application closely related to daily life and an important shared task in many prestigious competitions and workshops. The neural machine translation with an encoder-decoder architecture containing language models has been the fundamental solution for the grammatical error correction. Whereas Grammatical error correction task on texts of deaf people or its solution has not been seen yet, and common Grammatical error correction tasks are suffering several challenges, such as insufficient training data, insufficient accuracy due to the unsatisfactory capacity of extracting semantic and grammatical patterns. Under these circumstances, we proposed a novel encoderdecoder architecture based on multi-head self-attention along with multiple strategies, which excels at extracting deep representations from the corrupted sentences of deaf students and further reconstructing the sentences into grammatical ones. Via the re-ranking strategy, our model can correct various kinds of errors including spelling and complex syntax errors. The ablation experiments prove that the semantic extracting of self-attention mechanism excluding the position encoding with the word order shuffle operation can significantly learn the deaf students' sentence patterns whose word order is quite different from the ones of hearing people and improve the correction scores. The pre-training can enhance the restoring efficiency of sentence structure in the decoding process. The comparison experiments with baseline models show that our model obtains superior performance either in the deaf students' grammatical error correction or in a common grammatical error correction shared task.

show abstract

Spelling Error Correction with Soft-Masked BERT

Cited by 11 publications

References 16 publications

BART based semantic correction for Mandarin automatic speech recognition system

BART based semantic correction for Mandarin automatic speech recognition system

CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME

Pre-Training-Based Grammatical Error Correction Model for the Written Language of Chinese Hearing Impaired Students

Contact Info

Product

Resources

About