Sequence-to-Sequence Contrastive Learning for Text Recognition

Aberdam, Aviad; Litman, Ron; Tsiper, Shahar; Anschel, Oron; Slossberg, Ron; Mazor, Shai; Manmatha, R.; Perona, Pietro

doi:10.1109/cvpr46437.2021.01505

Cited by 87 publications

(46 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inspired by success of contrastive learning in other domains, [7] proposed contrastive predictive coding which learned representations by predicting the future in the latent space and showed great advances in various speech recognition task. Also, [16] extended SimCLR model [10] to EEG data. More recently [19] proposed multitask contrastive learning approach which capture temporal and contextual information from time-series.…”

Section: Self-supervised Learning For Time-seriesmentioning

confidence: 99%

See 1 more Smart Citation

Large Scale Time-Series Representation Learning via Simultaneous Low and High Frequency Feature Bootstrapping

Gorade¹,

Singh²,

Mishra³

2022

Preprint

View full text Add to dashboard Cite

Learning representation from unlabeled time series data is a challenging problem. Most existing self-supervised and unsupervised approaches in the time-series domain do not capture low and high frequency features at the same time. Further some of these methods employ large scale models like transformers or rely on computationally expensive techniques such as contrastive learning. To tackle these problems, we propose a non-contrastive self-supervised learning approach which efficiently captures low and high frequency time varying features in a cost effective manner. Our method takes raw time series data as input and creates two different augmented views for two branches of the model, by randomly sampling the augmentations from same family. Following the terminology of BYOL [1], the two branches are called as online and target network which allow bootstrapping of the latent representation. In contrast to BYOL, where a backbone encoder is followed by multilayer perceptron (MLP) heads, the proposed model contains additional temporal convolutional network (TCN) heads. As the augmented views are passed through large kernel convolution blocks of encoder, the subsequent combination of MLP and TCN enables an effective representation of low as well as high frequency time varying features due to the varying receptive fields. The two modules (MLP and TCN) act in a complementary manner. We train online network where each module learns to predict the outcome of respective module of target network branch. To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets. Our method achieved state-of-art performance on all five real-world datasets.

show abstract

Section: Self-supervised Learning For Time-seriesmentioning

confidence: 99%

“…(2) They may not able to capture low and high frequency time varying features which is important given characteristics of time-series data [12], [13]. More recently, some work on contrastive learning for EEG, ECG and time-series has been done but they are mostly data or application specific [14], [15], [16].…”

Section: Introductionmentioning

confidence: 99%

Large Scale Time-Series Representation Learning via Simultaneous Low and High Frequency Feature Bootstrapping

Gorade¹,

Singh²,

Mishra³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Most of these methods follow the "pre-training and fine-tuning" paradigm. However, this learning paradigm has been rarely studied in the STR field and only a few methods (Aberdam et al 2021;Chen et al 2020) have been proposed.…”

Section: Introductionmentioning

confidence: 99%

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Líu

Wang

Bao

et al. 2022

AAAI

View full text Add to dashboard Cite

We introduce Perceiving Stroke-Semantic Context (PerSec), a new approach to self-supervised representation learning tailored for Scene Text Recognition (STR) task. Considering scene text images carry both visual and semantic properties, we equip our PerSec with dual context perceivers which can contrast and learn latent representations from low-level stroke and high-level semantic contextual spaces simultaneously via hierarchical contrastive learning on unlabeled text image data. Experiments in un- and semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our PerSec shows significant performance improvement when fine-tuning the learned representation on the labeled data. Furthermore, we observe that the representation learned by PerSec presents great generalization, especially under few labeled data scenes.

show abstract

“…Scripts of the recognized textual content are detected by a CNN, whereas for detection, they proposed an UrduNet, an integration of CNN and LSTM networks. Aberdam et al [13] proposed architecture for SeqCLR of visual depictions that they employ for recognizing text. To demonstrate the sequence-to-sequence framework, every feature map is separated into different cases where the contrastive loss is calculated.…”

Section: Introductionmentioning

confidence: 99%

Intelligent Deep Learning Empowered Text Detection Model from Natural Scene Images

Devi¹,

CN²

2022

International Journal on Advanced Science, Engineering and Information Technology

View full text Add to dashboard Cite

The scene Text Recognition process has become a hot research topic and a challenging task owing to the complicated background, varying light intensities, colors, font styles, and sizes. Text extraction from natural scene images encompasses two main processes: text detection and text recognition. The latest advancements in Machine Learning (ML) and Deep Learning (DL) concepts can effectually automate the text detection and recognition process by training the model properly. In this view, this paper presents an Automated DL empowered Text Detection model from Natural Scene Images (ADLTD-NSI). The ADLTD-NSI technique includes two important processes: text detection and text recognition. Firstly, a single shot detector (SSD) with Inception-v2 as a baseline model is employed for text detection, an object detector based on the VGG-16 framework for feature map extraction followed by six convolution layers. Secondly, Convolutional Recurrent Neural Network (CRNN) technique is utilized for the text recognition process. Besides, the recurrent layers in the CRNN model utilize long short-term memory (LSTM) for encoding the sequence of feature vectors. Lastly, Connectionist Temporal Classification (CTC) loss is applied to predict text labels equivalent to the sequences from the recurrent layers. A wide range of experiments was carried out on benchmark COCO datasets, and the results are examined in several aspects. The experimental outcomes showcased the better performance of the ADLTD-NSI technique over the other compared methods with a maximum accuracy of 96.78%.

show abstract

Sequence-to-Sequence Contrastive Learning for Text Recognition

Cited by 87 publications

References 38 publications

Large Scale Time-Series Representation Learning via Simultaneous Low and High Frequency Feature Bootstrapping

Large Scale Time-Series Representation Learning via Simultaneous Low and High Frequency Feature Bootstrapping

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Intelligent Deep Learning Empowered Text Detection Model from Natural Scene Images

Contact Info

Product

Resources

About