SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check

Cheng, Xingyi; Xu, Weidi; Chen, Kunlong; Jiang, Shaohua; Wang, Feng; Wang, Taifeng; Chu, Wei; Yuan, Qi

doi:10.18653/v1/2020.acl-main.81

Cited by 78 publications

(95 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the experiment, although they employ 5 million pseudo data for extra pretraining, the proposed method still obtains an improved performance on the correction level. Cheng et al (2020) propose to incorporate phonological and visual confusion sets into the CSC models through a graph convolutional network. As the performance reported in their paper is obtrained with external training data, we reproduced their results on the standard CSC datasets by rerunning their released code and evaluation scripts.…”

Section: Experiments Results On the Csc Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Chunk-based Chinese Spelling Check with Global Optimization

Bao¹,

Li²,

Wang³

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Chinese spelling check is a challenging task due to the characteristics of the Chinese language, such as the large character set, no word boundary, and short word length. On the one hand, most of the previous works only consider corrections with similar character pronunciation or shape, failing to correct visually and phonologically irrelevant typos. On the other hand, pipeline-style architectures are widely adopted to deal with different types of spelling errors in individual modules, which is difficult to optimize. In order to handle these issues, in this work, 1) we extend the traditional confusion sets with semantical candidates to cover different types of errors; 2) we propose a chunk-based framework to correct single-character and multi-character word errors uniformly; and 3) we adopt a global optimization strategy to enable a sentence-level correction selection. The experimental results show that the proposed approach achieves a new state-of-the-art performance on three benchmark datasets, as well as an optical character recognition dataset.

show abstract

Section: Experiments Results On the Csc Datasetsmentioning

confidence: 99%

“…Zhao et al (2017) use conditional random fields (CRFs) to handle two types of misspelled single-character word. Cheng et al (2020) propose to incorporate phonological and visual similarity knowledge into the CSC models via a graph convolutional network.…”

Section: Related Workmentioning

confidence: 99%

Chunk-based Chinese Spelling Check with Global Optimization

Bao¹,

Li²,

Wang³

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…For the pre-training corpus, we collect a variety of data, such as encyclopedia articles, news, scientific papers, and movie subtitles from a search engine. The CSC training data used in our experiments is the same as Wang et al (2019) and Cheng et al (2020), including three human-annotated training datasets Tseng et al, 2015) and an automatically generated dataset with the approach proposed in Table 1: The performance on SIGHAN13, SIGHAN14, and SIGHAN15 testset. Soft-Masked BERT* is our reproduction of Soft-Masked BERT using the same training data as in our method, while Soft-Masked BERT was trained on an in-house dataset containing 5 million sentences and their counterparts with automatically generated errors, as reported in Zhang et al (2020), where the authors only provided their results on SIGHAN15.…”

Section: Data Processingmentioning

confidence: 99%

“…• SpellGCN (Cheng et al, 2020) incorporates two similarity graphs into a pre-trained sequence-labeling model via graph convolutional network. The two graphs are derived from a confusion set and correspond to pronunciation and shape similarities.…”

Section: Model Settingsmentioning

confidence: 99%

See 1 more Smart Citation

Correcting Chinese Spelling Errors with Phonetic Pre-training

Zhang¹,

Pang²,

Zhang³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Chinese spelling correction (CSC) is an important yet challenging task. Existing state-ofthe-art methods either only use a pre-trained language model or incorporate phonological information as external knowledge. In this paper, we propose a novel end-to-end CSC model that integrates phonetic features into language model by leveraging the powerful pre-training and fine-tuning method. Instead of conventionally masking words with a special token in training language model, we replace words with phonetic features and their sound-alike words. We further propose an adaptive weighted objective to jointly train error detection and correction in a unified framework. Experimental results show that our model achieves significant improvements on SIGHAN datasets and outperforms the previous state-of-the-art methods.

show abstract

Improve Chinese Spelling Check by Reevaluation

Wang

Lin

2022

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check

Cited by 78 publications

References 15 publications

Chunk-based Chinese Spelling Check with Global Optimization

Chunk-based Chinese Spelling Check with Global Optimization

Correcting Chinese Spelling Errors with Phonetic Pre-training

Improve Chinese Spelling Check by Reevaluation

Contact Info

Product

Resources

About