Truecasing German user-generated conversational text

Grishina, Yulia; Gueudré, Thomas; Winkler, Ralf

doi:10.18653/v1/2020.wnut-1.19

Cited by 4 publications

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[11] first introduced character-based LSTM for this task and completely solved the mixed case word problem. Recently, [2] compared character-based n-gram (n up to 15) language models with the character LSTM of [11]. [12] advanced the state of the art with a character-based CNN-LSTM-CRF model.…”

Section: Related Workmentioning

confidence: 99%

“…The vast amount of online text powers language models for speech recognition, typing suggestions and many other language generation tasks. However user-generated texts, especially those from mobile applications such as Twitter Tweets [1], often violate the grammatical rules of casing in English and other western languages [2]. The process of restoring the proper case, often known as tRuEcasIng [3], provides a factorized solution with a dedicated model for case normalization.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Zhang¹,

Cheng²,

Kumar³

et al. 2022

Preprint

View full text Add to dashboard Cite

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A caseaware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Zhang¹,

Cheng²,

Kumar³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Susanto et al (2016) first introduced character-based LSTM for this task and completely solved the mixed case word problem. Recently, Grishina et al (2020) compared character-based n-gram (n up to 15) language models with the character LSTM of Susanto et al (2016). Ramena et al (2020) advanced the state of the art with a characterbased CNN-LSTM-CRF model which introduced local output label dependencies.…”

Section: Related Workmentioning

confidence: 99%

“…Automatically generated texts such as speech recognition (ASR) transcripts as well as user-generated texts from mobile applications such as Twitter Tweets (Nebhi et al, 2015) often violate the grammatical rules of casing in English and other western languages (Grishina et al, 2020). The process of restoring the proper case, often known as tRuEcas-Ing (Lita et al, 2003), is not only important for the ease of consumption by end-users (e.g.…”

Section: Introductionmentioning

confidence: 99%

Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Zhang¹,

Cheng²,

Kumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

Truecasing is the task of restoring the correct case (uppercase or lowercase) of noisy text generated either by an automatic system for speech recognition or machine translation or by humans. It improves the performance of downstream NLP tasks such as named entity recognition and language modeling. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model, the first of its kind for this problem. Using sequence distillation, we also address the problem of truecasing while ignoring token positions in the sentence, i.e. in a position-invariant manner.

show abstract