Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Wang, Bairui; Ma, Lin; Zhang, Wei; Jiang, Wenhao; Wang, Jingwen; Liu, Wei

doi:10.1109/iccv.2019.00273

Cited by 151 publications

(92 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We propose the DL network framework shown in Fig. 3 composed of ResNet [2], 3D ResNext [8], a feature-fusion module (FFM) [9], and predictive network.…”

Section: B Framework Architecture and Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Tian¹,

Pan²,

Alouini³

2020

Preprint

View full text Add to dashboard Cite

<div>Deep learning (DL) has seen great success in the computer vision (CV) field, and related techniques have been used in security, healthcare, remote sensing, and many other fields. As a parallel development, visual data has become universal in daily life, easily generated by ubiquitous low-cost cameras. Therefore, exploring DL-based CV may yield useful information about objects, such as their number, locations, distribution, motion, etc. Intuitively, DL-based CV can also facilitate and improve the designs of wireless communications, especially in dynamic network scenarios. However, so far, such work is rare in the literature. The primary purpose of this article, then, is to introduce ideas about applying DL-based CV in wireless communications to bring some novel degrees of freedom to both theoretical research and engineering applications. To illustrate how DL-based CV can be applied in wireless communications, an example of using a DL-based CV with a millimeter-wave (mmWave) system is given to realize optimal mmWave multiple-input and multiple-output (MIMO) beamforming in mobile scenarios. In this example, we propose a framework to predict future beam indices from previously observed beam indices and images of street views using ResNet, 3-dimensional ResNext, and a long short-term memory network. The experimental results show that our frameworks achieve much higher accuracy than the baseline method, and that visual data can significantly improve the performance of the MIMO beamforming system. Finally, we discuss the opportunities and challenges of applying DL-based CV in wireless communications.</div>

show abstract

“…We propose the DL network framework shown in Fig. 3 composed of ResNet [2], 3D ResNext [8], a feature-fusion module (FFM) [9], and predictive network.…”

Section: B Framework Architecture and Methodsmentioning

confidence: 99%

“…It can replace the 2D LSTM network. Much CV research has shown that if these techniques are jointly applied to make full use of the visual data, better results can be obtained [9], [11].…”

Section: B Selecting CV Techniquesmentioning

confidence: 99%

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Tian¹,

Pan²,

Alouini³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…3. It is composed of ResNet [2], 3D ResNext [8], feature fusion module (FFM) [9] and predictive network which will be elaborated as below.…”

Section: B Framework Architecture and Methodsmentioning

confidence: 99%

“…It can be used to replace the 2D LSTM network. Many CV pieces of research have shown that if these techniques can be jointly applied to make full use of the visual data, better results can be obtained [9], [11]. So, a single proper CV technique or an adequate combination of several CV techniques are required to deal with a specific problem in wireless systems.…”

Section: B the Selection Of CV Techniquesmentioning

confidence: 99%

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Tian¹,

Pan²,

Alouini³

2020

Preprint

View full text Add to dashboard Cite

<div>Deep learning (DL) has obtained great success in computer vision (CV) field, and the related techniques have been widely used in security, healthcare, remote sensing, etc. On the other hand, visual data is universal in our daily life, which is easily generated by prevailing but low-cost cameras. Therefore, DL-based CV can be explored to obtain and forecast some useful information about the objects, e.g., the number, locations, distribution, motion, etc. Intuitively, DL-based CV can facilitate and improve the designs of wireless communications, especially in dynamic network scenarios. However, so far, it is rare to see such kind of works in the existing literature. Then, the primary purpose of this article is to introduce ideas of applying DL-based CV in wireless communications to bring some novel degrees of freedom for both theoretical researches and engineering applications. To illustrate how DL-based CV can be applied in wireless communications, an example of using DL-based CV to millimeter wave (mmWave) system is given to realize optimal mmWave multiple-input and multiple-output (MIMO) beamforming in mobile scenarios. In this example, we proposed a framework to predict the future beam indices from the previously-observed beam indices and images of street views by using ResNet, 3-dimensional ResNext, and long short term memory network. Experimental results show that our frameworks can achieve much higher accuracy than the baseline method, and visual data can help significantly improve the performance of MIMO beamforming system. Finally, we discuss the opportunities and challenges of applying DL-based CV in wireless communications.</div>

show abstract

“…Besides, to make the generated captions more diverse and accurate, Deshpande et al leveraged the quantized Part-of-Speech (POS) tag sequence sampled from a given benchmark to condition word prediction at the decoding recurrent model [6]. Wang et al tried to predict the POS sequence tag by tag from the input video, and then embeded them as a global POS representation to gate the inputs of the sentence decoder for syntax control [32]. With manually altering the predicted POS tag sequence, Wang et al showed that they can obtain captions with different syntaxes.…”

Section: Controllable Captioning With Auxiliary Information Guidancementioning

confidence: 99%

Controllable Video Captioning with an Exemplar Sentence

Yuan

Ma²,

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture. The proposed SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network with respect to the encoded syntactic information of the given exemplar sentence. Therefore, SMCG is able to control the states for word prediction and achieve the syntax customized caption generation. We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets. Extensive experimental results demonstrate the effectiveness of our approach on generating syntax controllable and semantic preserved video captions. By providing different exemplar sentences, our approach is capable of producing different captions with various syntactic structures, thus indicating a promising way to strengthen the diversity of video captioning.

show abstract

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Cited by 151 publications

References 50 publications

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Applying Deep-Learning-Based Computer Vision to Wireless Communications: Methodologies,Opportunities, and Challenges

Controllable Video Captioning with an Exemplar Sentence

Contact Info

Product

Resources

About