Upcycle Your

Krishna, Amrith; Majumder, Bodhisattwa Prasad; Bhat, Rajesh Shreedhar; Goyal, Pawan

doi:10.18653/v1/k18-1034

Cited by 6 publications

(5 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• COPY: This system is the base architecture with a copy mechanism as described in Section 5.2. The single-source variant of this model is used for OCR post-correction on Romanized Sanskrit in Krishna et al (2018).…”

Section: Methodsmentioning

confidence: 99%

“…There has been little work on lower-resourced languages. Kolak and Resnik (2005) present a probabilistic edit distance based post-correction model applied to Cebuano and Igbo, and Krishna et al (2018) show improvements on Romanized Sanksrit OCR by adding a copy mechanism to a neural sequence-to-sequence model.…”

Section: Related Workmentioning

confidence: 99%

“…Although OCR post-correction is relatively wellstudied, most existing methods rely on considerable resources in the target language, including a substantial amount of textual data to train a language model (Schnober et al, 2016;Dong and Smith, 2018;Rigaud et al, 2019) or to create synthetic data (Krishna et al, 2018). While readily available for high-resource languages, these resources are severely limited in endangered languages, preventing the direct application of existing post-correction methods in our setting.…”

Section: Introductionmentioning

confidence: 99%

“…AlthoughKrishna et al (2018) use BPE tokenization, preliminary experiments showed that character-level models result in much better performance on our dataset, likely due to the limited data available for training the BPE model.…”

mentioning

confidence: 99%

See 3 more Smart Citations

OCR Post Correction for Endangered Language Texts

Rijhwani¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR postcorrection method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

OCR Post Correction for Endangered Language Texts

Rijhwani¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…ITN is a monotone sequence transduction task where the input and output sequences typically have considerable lexical overlap and generally follow monotonicity in their alignments (Schnober et al, 2016;Krishna et al, 2018). Here, we formulate the task in three different setups.…”

Section: Itn Modelsmentioning

confidence: 99%

Scaling Neural ITN for Numbers and Temporal Expressions in Tamil: Findings for an Agglutinative Low-resource Language

Singhal,

Gopalan,

Krishna

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

View full text Add to dashboard Cite

Inverse Text Normalization (ITN) involves rewriting the verbalised form of text from spoken transcripts to its corresponding written form. The task inherently expects challenges in identifying ITN entries due to spelling variations in words arising out of dialects, transcription errors etc. Additionally, in Tamil, word boundaries between adjacent words in a sentence often get obscured due to Punarchi, i.e. phonetic transformation of these boundaries. Being morphologically rich, the words in Tamil show a high degree of agglutination due to inflection and clitics. The combination of such factors leads to a high degree of surfaceform variations, making scalability with pure rule-based approaches difficult. Instead, we experiment with fine-tuning three pre-trained neural LMs, consisting of a seq2seq model (s2s), a non-autoregressive text editor (NAR) and a sequence tagger + rules combination (tagger). While the tagger approach works best in a fully-supervised setting, s2s performs the best (98.05 F-Score) when augmented with additional data, via bootstrapping and data augmentation (DA&B). S2S reports a cumulative percentage improvement of 20.1 %, and statistically significant gains for all our models with DA&B. Compared to a fully supervised setup, bootstrapping alone reports a percentage improvement as high as 14.12 %, even with a small seed set of 324 ITN entries.

show abstract

Toward a Period-specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Suissa

Zhitomirsky‐Geffet

Elmalech

2022

J. Comput. Cult. Herit.

View full text Add to dashboard Cite

Over the past few decades, large archives of paper-based historical documents, such as books and newspapers, have been digitized using the Optical Character Recognition (OCR) technology. Unfortunately, this broadly used technology is error-prone, especially when an OCRed document was written hundreds of years ago. Neural networks have shown great success in solving various text processing tasks, including OCR post-correction. The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets they require to learn from, especially for morphologically-rich languages like Hebrew. Moreover, it is not clear what are the optimal structure and values of hyperparameters (predefined parameters) of neural networks for OCR error correction in Hebrew due to its unique features. Furthermore, languages change across genres and periods. These changes may affect the accuracy of OCR post-correction neural network models. To overcome these challenges, we developed a new multi-phase method for generating artificial training datasets with OCR errors and hyperparameters’ optimization for building an effective neural network for OCR post-correction in Hebrew. To evaluate the proposed approach, a series of experiments using several literary Hebrew corpora from various periods and genres were conducted. The obtained results demonstrate that (1) training a network on texts from a similar period dramatically improves the network's ability to fix OCR errors, (2) using the proposed error injection algorithm, based on character-level period-specific errors, minimizes the need for manually corrected data and improves the network accuracy by 9%, (3) the optimized network design improves the accuracy by 3% compared to the state-of-the-art network, and (4) the constructed optimized network outperforms neural machine translation models and industry-leading spellcheckers. The proposed methodology may have practical implications for digital humanities projects that aim to search and analyze OCRed documents in Hebrew and potentially other morphologically-rich languages.

show abstract

Upcycle Your

Cited by 6 publications

References 21 publications

OCR Post Correction for Endangered Language Texts

OCR Post Correction for Endangered Language Texts

Scaling Neural ITN for Numbers and Temporal Expressions in Tamil: Findings for an Agglutinative Low-resource Language

Toward a Period-specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Contact Info

Product

Resources

About