2018
DOI: 10.1007/978-3-030-01261-8_10
|View full text |Cite
|
Sign up to set email alerts
|

Visual Text Correction

Abstract: This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence, and fix it by replacing the inaccurate word(s). Our method leverages the semantic interdependence of videos and words, as well as the short-term and long-term relations of the words in a sentence. Our proposed formulation can solve the VTC problem employing an End-to-E… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

4
3

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 46 publications
(65 reference statements)
0
7
0
Order By: Relevance
“…Real datasets are not easy to collect. Therefore, similar to [26,27], we also created a synthetic dataset. We took 1,000 random background images from the Places dataset [28] and 1000 random foreground images from the Caltech-UCSD Birds 200 dataset to draw our input from.…”
Section: Methodsmentioning
confidence: 99%
“…Real datasets are not easy to collect. Therefore, similar to [26,27], we also created a synthetic dataset. We took 1,000 random background images from the Places dataset [28] and 1000 random foreground images from the Caltech-UCSD Birds 200 dataset to draw our input from.…”
Section: Methodsmentioning
confidence: 99%
“…introduced a video-based QA dataset along with a two-stream model processing both video and subtitles to pick the correct answer among candidate answers. Some studies are: grounding of spatiotemporal features to answer questions (Lei et al, 2019); a video fill in the blank version of VQA (Mazaheri et al, 2017); other examples include (Kim et al, 2019b,a;Zadeh et al, 2019;Yi et al, 2019;Mazaheri and Shah, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Other approaches have leveraged reinforcement learning, either by providing entailment rewards (Pasunuru & Bansal, 2017b) , or to address the description generation for multiple fine-grained actions (Wang et al, 2018b). Further, Mazaheri and Shah (2018) proposed a deep network designed to detect inaccuracies in a sentence, and fix them by replacing the inaccurate word(s) with the help of a Visual Text Correction system. Recently, Zhang et al introduced an object relational graph (ORG) based encoder which encapsulates the relation among visual objects to build richer representation and a decoder the integrates the external language model to capture abundant linguistic knowledge for efficient video description generation.…”
Section: Video Description Generation -Introductionmentioning
confidence: 99%