A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check

Wang, Dingmin; Yan, Shuicheng; Li, Jing; Han, Jialong; Zhang, Haisong

doi:10.18653/v1/d18-1273

Cited by 93 publications

(92 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Table 6, FASPell achieves state-ofthe-art F1 performance on both detection level and correction level. It is better in precision than the model by Wang et al (2018) and better in recall than the model by Zhang et al (2015). In comparison with Zhao et al (2017), It is better by any metric.…”

Section: Performancementioning

confidence: 85%

“…2. insufficiency in utilizing character similarity. Since a cut-off threshold of quantified character similarity (Liu et al, 2010;Wang et al, 2018) is used to produce the confusion set, similar characters are actually treated indiscriminately in terms of their similarity. This means the information of character similarity is not sufficiently utilized.…”

Section: Related Work and Bottlenecksmentioning

confidence: 99%

“…Since Chinese spell checking data require tedious professional manual work, they have always been underresourced. To prevent the filter from overfitting, Wang et al (2018) propose an automatic method to generate pseudo spell checking data. However, the precision of their spell checking model ceases to improve when the generated data reaches 40k sentences.…”

Section: Related Work and Bottlenecksmentioning

confidence: 99%

See 2 more Smart Citations

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm

Hong¹,

Yu²,

He³

et al. 2019

Proceedings of the 5th Workshop on Noisy User-Generated Text (W-Nut 2019)

View full text Add to dashboard Cite

We propose a Chinese spell checker-FASPell based on a new paradigm which consists of a denoising autoencoder (DAE) and a decoder. In comparison with previous stateof-the-art models, the new paradigm allows our spell checker to be Faster in computation, readily Adaptable to both simplified and traditional Chinese texts produced by either humans or machines, and to require much Simpler structure to be as much Powerful in both error detection and correction. These four achievements are made possible because the new paradigm circumvents two bottlenecks. First, the DAE curtails the amount of Chinese spell checking data needed for supervised learning (to <10k sentences) by leveraging the power of unsupervisedly pre-trained masked language model as in BERT, XLNet, MASS etc. Second, the decoder helps to eliminate the use of confusion set that is deficient in flexibility and sufficiency of utilizing the salient feature of Chinese character similarity.

show abstract

Section: Performancementioning

confidence: 85%

Section: Related Work and Bottlenecksmentioning

confidence: 99%

Section: Related Work and Bottlenecksmentioning

confidence: 99%

See 1 more Smart Citation

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm

Hong¹,

Yu²,

He³

et al. 2019

Proceedings of the 5th Workshop on Noisy User-Generated Text (W-Nut 2019)

View full text Add to dashboard Cite

show abstract

“…Hsieh et al (2015) propose to extract spelling error samples from the Google web 1T corpus. Wang et al (2018) propose the OCR-based and ASR-based methods to mimic human errors. They further proposed a pointer network to model the CSC task under the framework of a seq2seq model .…”

Section: Related Workmentioning

confidence: 99%

“…The results with ‡ are reproduced by rerunning the released code and evaluation scripts on the standard CSC datasets. TheWang et al (2018) and calculate the performance on the character-level, which makes their results incomparable with other works.…”

mentioning

confidence: 92%

Chunk-based Chinese Spelling Check with Global Optimization

Bao¹,

Li²,

Wang³

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Chinese spelling check is a challenging task due to the characteristics of the Chinese language, such as the large character set, no word boundary, and short word length. On the one hand, most of the previous works only consider corrections with similar character pronunciation or shape, failing to correct visually and phonologically irrelevant typos. On the other hand, pipeline-style architectures are widely adopted to deal with different types of spelling errors in individual modules, which is difficult to optimize. In order to handle these issues, in this work, 1) we extend the traditional confusion sets with semantical candidates to cover different types of errors; 2) we propose a chunk-based framework to correct single-character and multi-character word errors uniformly; and 3) we adopt a global optimization strategy to enable a sentence-level correction selection. The experimental results show that the proposed approach achieves a new state-of-the-art performance on three benchmark datasets, as well as an optical character recognition dataset.

show abstract