Spelling Error Correction with Soft-Masked BERT

Zhang, Shaohua; Huang, Haoran; Liu, Jicong; Li, Hang

doi:10.18653/v1/2020.acl-main.82

Cited by 130 publications

(104 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the experiment, our method outperforms Hong et al (2019) with a large margin, which indicates the effectiveness of the globally optimized chunk-based decoding. Zhang et al (2020) propose to train a detection and a correction network jointly. In the experiment, although they employ 5 million pseudo data for extra pretraining, the proposed method still obtains an improved performance on the correction level.…”

Section: Experiments Results On the Csc Datasetsmentioning

confidence: 99%

“…generate pseudo data by replacing the character in the training sentence with characters in the confusion set. Similarly, Zhang et al (2020) generate homophonous pseudo data to pretrain the detection and correction network jointly. Web texts are in large quantities and contain more errors than published articles.…”

Section: Related Workmentioning

confidence: 99%

“…The accumulated errors make the spell checking even more difficult. Thus, characterbased models are proposed to perform the correction at the character-level directly, which are more robust to segmentation errors (Zhang et al, 2015;Hong et al, 2019;Zhang et al, 2020). However, the character-based model cannot effectively utilize the word-level semantic information, and the correction is also more difficult to interpret.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Chunk-based Chinese Spelling Check with Global Optimization

Bao¹,

Li²,

Wang³

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Chinese spelling check is a challenging task due to the characteristics of the Chinese language, such as the large character set, no word boundary, and short word length. On the one hand, most of the previous works only consider corrections with similar character pronunciation or shape, failing to correct visually and phonologically irrelevant typos. On the other hand, pipeline-style architectures are widely adopted to deal with different types of spelling errors in individual modules, which is difficult to optimize. In order to handle these issues, in this work, 1) we extend the traditional confusion sets with semantical candidates to cover different types of errors; 2) we propose a chunk-based framework to correct single-character and multi-character word errors uniformly; and 3) we adopt a global optimization strategy to enable a sentence-level correction selection. The experimental results show that the proposed approach achieves a new state-of-the-art performance on three benchmark datasets, as well as an optical character recognition dataset.

show abstract

Section: Experiments Results On the Csc Datasetsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Chunk-based Chinese Spelling Check with Global Optimization

Bao¹,

Li²,

Wang³

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…Traditional methods of CSC firstly detect misspelled characters and generate candidates via a language model, and then use a phonetic model or rules to filter wrong candidates (Chang, 1995;Chen et al, 2013;Dong et al, 2016). To improve CSC performance, studies mainly focus on two issues: 1) how to improve the language model (Wu et al, 2010;Dong et al, 2016;Zhang et al, 2020) and 2) how to utilize external knowledge of phonological similarity (Jia et al, 2013;Yu and Li, 2014;Cheng et al, 2020). The language model is used to generate fluent sentences and the phonetic features can prevent the model from producing predictions whose pronunciation deviates from that of the original word.…”

Section: Introductionmentioning

confidence: 99%

“…These methods take phonetic information as external knowledge but the discrete candidate selection obstructs the language model from learning directly via backpropagation. Zhang et al (2020) proposed an endto-end CSC model by modifying the mask mechanism of BERT. However, they did not use any phonological information, which is important for exploring words similarity.…”

Section: Introductionmentioning

confidence: 99%

Correcting Chinese Spelling Errors with Phonetic Pre-training

Zhang¹,

Pang²,

Zhang³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Chinese spelling correction (CSC) is an important yet challenging task. Existing state-ofthe-art methods either only use a pre-trained language model or incorporate phonological information as external knowledge. In this paper, we propose a novel end-to-end CSC model that integrates phonetic features into language model by leveraging the powerful pre-training and fine-tuning method. Instead of conventionally masking words with a special token in training language model, we replace words with phonetic features and their sound-alike words. We further propose an adaptive weighted objective to jointly train error detection and correction in a unified framework. Experimental results show that our model achieves significant improvements on SIGHAN datasets and outperforms the previous state-of-the-art methods.

show abstract