Visual Text Correction

Mazaheri, Amir; Shah, Mubarak

doi:10.1007/978-3-030-01261-8_10

Cited by 9 publications

(7 citation statements)

References 46 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Real datasets are not easy to collect. Therefore, similar to [26,27], we also created a synthetic dataset. We took 1,000 random background images from the Places dataset [28] and 1000 random foreground images from the Caltech-UCSD Birds 200 dataset to draw our input from.…”

Section: Methodsmentioning

confidence: 99%

Deep Photo Cropper And Enhancer

Ott

Mazaheri

Lobo

et al. 2020

2020 IEEE International Conference on Image Processing (ICIP)

Self Cite

View full text Add to dashboard Cite

This paper introduces a new type of image enhancement problem. Compared to traditional image enhancement methods, which mostly deal with pixel-wise modifications of a given photo, our proposed task is to crop an image which is embedded within a photo and enhance the quality of the cropped image. We split our proposed approach into two deep networks: deep photo cropper and deep image enhancer. In the photo cropper network, we employ a spatial transformer to extract the embedded image. In the photo enhancer, we employ super-resolution to increase the number of pixels in the embedded image and reduce the effect of stretching and distortion of pixels. We use cosine distance loss between image features and ground truth for the cropper and the mean square loss for the enhancer. Furthermore, we propose a new dataset to train and test the proposed method. Finally, we analyze the proposed method with respect to qualitative and quantitative evaluations.

show abstract

Section: Methodsmentioning

confidence: 99%

Deep Photo Cropper And Enhancer

Ott

Mazaheri

Lobo

et al. 2020

2020 IEEE International Conference on Image Processing (ICIP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…introduced a video-based QA dataset along with a two-stream model processing both video and subtitles to pick the correct answer among candidate answers. Some studies are: grounding of spatiotemporal features to answer questions (Lei et al, 2019); a video fill in the blank version of VQA (Mazaheri et al, 2017); other examples include (Kim et al, 2019b,a;Zadeh et al, 2019;Yi et al, 2019;Mazaheri and Shah, 2018).…”

Section: Related Workmentioning

confidence: 99%

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Urooj¹,

Mazaheri²,

Lobo³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

View full text Add to dashboard Cite

We present MMFT-BERT (MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformerbased fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method 1 .

show abstract

“…Other approaches have leveraged reinforcement learning, either by providing entailment rewards (Pasunuru & Bansal, 2017b) , or to address the description generation for multiple fine-grained actions (Wang et al, 2018b). Further, Mazaheri and Shah (2018) proposed a deep network designed to detect inaccuracies in a sentence, and fix them by replacing the inaccurate word(s) with the help of a Visual Text Correction system. Recently, Zhang et al introduced an object relational graph (ORG) based encoder which encapsulates the relation among visual objects to build richer representation and a decoder the integrates the external language model to capture abundant linguistic knowledge for efficient video description generation.…”

Section: Video Description Generation -Introductionmentioning

confidence: 99%

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Mogadala

Kalimuthu

Klakow

2021

jair

View full text Add to dashboard Cite

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.

show abstract

Visual Text Correction

Cited by 9 publications

References 46 publications

Deep Photo Cropper And Enhancer

Deep Photo Cropper And Enhancer

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Contact Info

Product

Resources

About