A Multiplexed Network for End-to-End, Multilingual OCR

Huang, Jing; Pang, Guangchang; Kovvuri, Rama; Toh, Mandy; Liang, Kevin J; Krishnan, Praveen; Yin, Xi; Hassner, Tal

doi:10.1109/cvpr46437.2021.00452

Cited by 40 publications

(14 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In this section, we experimentally validate our proposed TANGER by comparing the performance with the state-of-theart methods on several public datasets as well as one newly collected multilingual dataset TsiText. First, we examine the performance of TANGER for multilingual scene text recognition in comparison with two end-to-end methods [10,22] and one dictionary-guided method [34]. Then, we compare our model with the vision transformer ViTSTR [5] in three variants, i.e., tiny, small, and base versions for monolingual scene text recognition.…”

Section: Resultsmentioning

confidence: 99%

“…Busta et al [10] design probably the first method, called E2E-MLT, for multilingual scene text recognition on the basis of a single fully convolutional network, which is demonstrated to be competitive even on rotated and vertical text instances. An E2E approach to script identification with different recognition heads, called Multiplexed Multilingual Mask TextSpotter, was proposed in [22], which can support the removal of existing or inclusion of new languages.…”

Section: A Scene Text Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

Yan¹,

Fang²,

Jin³

2023

Preprint

View full text Add to dashboard Cite

While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with ngrams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: A Scene Text Recognitionmentioning

confidence: 99%

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

Yan¹,

Fang²,

Jin³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Our method aims at use cases involving creative self expression and augmented reality (e.g., photo-realistic translation, leveraging multi-lingual OCR technologies [59]). Our method can be used for data generation and augmentation for training future OCR systems, as successfully done by others [49], [60] and in other domains [61], [62].…”

Section: Discussionmentioning

confidence: 99%

TextStyleBrush: Transfer of Text Aesthetics from a Single Example

Krishnan¹,

Kovvuri²,

Pang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We present a novel approach for disentangling the content of a text image from all aspects of its appearance. The appearance representation we derive can then be applied to new content, for one-shot transfer of the source style to new content. We learn this disentanglement in a self-supervised manner. Our method processes entire word boxes, without requiring segmentation of text from background, per-character processing, or making assumptions on string lengths. We show results in different text domains which were previously handled by specialized methods, e.g., scene text, handwritten text. To these ends, we make a number of technical contributions: (1) We disentangle the style and content of a textual image into a non-parametric, fixed-dimensional vector. (2) We propose a novel approach inspired by StyleGAN but conditioned over the example style at different resolution and content. (3) We present novel self-supervised training criteria which preserve both source style and target content using a pre-trained font classifier and text recognizer. Finally, (4) we also introduce Imgur5K, a new challenging dataset for handwritten word images. We offer numerous qualitative photo-realistic results of our method. We further show that our method surpasses previous work in quantitative tests on scene text and handwriting datasets, as well as in a user study.

show abstract

“…Also, seq2seq was used in aspect term extraction, where the source sequence and target sequence is composed of words and labels ,respectively Ma et al (2019). Huang et al Huang et al (2019) proposed an end-to-end approach which can recognize multiple languages in images considering data imbalance between languages. Lewis et al Lewis et al (2020) proposed a denoising autoencoder named BART, which combines Bidirectional and Auto-Regressive Transformers for pretraining sequence-to-sequence models, and BART is effective for text generation after fine-tuning.…”

Section: Sequence To Sequencementioning

confidence: 99%

SSR-TA: Sequence to Sequence based expert recurrent recommendation for ticket automation

Cao¹,

Fang²,

Luo³

et al. 2023

Preprint

View full text Add to dashboard Cite

The ticket automation provides crucial support for the normal operation of IT software systems. An essential task of ticket automation is to assign experts to solve upcoming tickets. However, facing thousands of tickets, inappropriate assignments will make tickets transfer frequently among experts, which causes time delays and wasted resources. Effectively and efficiently finding an appropriate expert in fewer steps is vital to ticket automation. In this paper, we proposed a sequence to sequence based translation model combined with a recurrent recommendation network to recommend appropriate experts for tickets. The sequence to sequence model transforms the ticket description into the corresponding resolution for capturing the potential and useful features of representing tickets. The recurrent recommendation network recommends the appropriate expert based on the assumption that the previous expert in the recommendation sequence cannot solve the expert. To evaluate the performance, we conducted experiments to compare several baselines with SSR-TA on two real-world datasets, and the experimental results show that our proposed model outperforms the baselines. The comparative experiment results also show that SSR-TA has a better performance of expert recommendations for user-generated tickets.

show abstract

A Multiplexed Network for End-to-End, Multilingual OCR

Cited by 40 publications

References 60 publications

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

TextStyleBrush: Transfer of Text Aesthetics from a Single Example

SSR-TA: Sequence to Sequence based expert recurrent recommendation for ticket automation

Contact Info

Product

Resources

About