Zineng Tang scite author profile

Zineng Tang

5Publications

71Citation Statements Received

195Citation Statements Given

How they've been cited

How they cite others

214

195

Affiliations

University of North Carolina at Chapel Hill, University of North Carolina Health Care

Publications

Order By: Most citations

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Tang¹,

Lei²,

Bansal³

2021

View full text Add to dashboard Cite

Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many videoand-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DECEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss. 1

show abstract

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Kim¹,

Tang²,

Bansal³

2020

View full text Add to dashboard Cite

Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of duallevel attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-and-Out Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the stateof-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies. 1Local Gate Frame Score Margin Inside Frames Outside FramesFrame-Level Att.

show abstract

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Kim

Tang

Bansal

2020

Preprint

View full text Add to dashboard Cite

show abstract

PERCEIVER-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

Tang

Cho

Lei

et al. 2023

View full text Add to dashboard Cite

TVLT: Textless Vision-Language Transformer

Tang¹,

Cho²,

Nie³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-andlanguage representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks, such as visual question answering, image retrieval, video retrieval, and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text. 1 * equal contribution 1 Our code and checkpoints are available at: https://github.com/zinengtang/TVLT 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zineng Tang

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

PERCEIVER-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

TVLT: Textless Vision-Language Transformer

Contact Info

Product

Resources

About