Xuekun Jiang scite author profile

Images are an important carrier for emotional expression. Human can understand emotions in image easily and quickly, whereas it is a very challenging task for machines to extract accurate emotions. In this study, we propose a novel spatial and channel-wise attention-based emotion prediction model, SCEP, to assist computers in recognizing the emotions of images more accurately. SCEP integrates both spatial attention and channel-wise weight mechanisms into a classical convolutional neural network (CNN) layer structure to predict image emotions, on the grounds that the spatial attention mechanism can enhance the contrast between salient regions and potentially irrelevant regions, and that the channel-wise weight mechanism can emphasize informative features while suppressing less useful features. The SCEP model outputs emotion values in a continuous 2-D valence and arousal space, so that more emotions can be expressed than by simply discretely classifying emotions. To validate the effectiveness of our model, we use an existing image dataset with a widespread emotion distribution for testing. Extensive experiments show that when compared to base models (i.e. VGG and ResNet) without spatial attention or channel-wise mechanisms, SCEP can improve the accuracy of emotion prediction (evaluated by concordance correlation coefficient) by ~3%-5% in the arousal domain, and by ~3-6% in the valence domain. Therefore, we conclude that using SCEP can bring higher accuracy in emotion prediction.

show abstract

Jointly Learning the Attributes and Composition of Shots for Boundary Detection in Videos

Jiang

Jin

Rao

et al. 2022

IEEE Trans. Multimedia

View full text Add to dashboard Cite

SVD of Shot Boundary Detection Based on Accumulative Difference

Wang

Pang

Jiang

et al. 2020

View full text Add to dashboard Cite

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Rao

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods. 1 1 The dataset and related codes are released here in compliance with regulations.

show abstract

Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows

Rao¹,

Jiang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The ability to choose an appropriate camera view among multiple cameras plays a vital role in TV shows delivery. But it is hard to figure out the statistical pattern and apply intelligent processing due to the lack of high-quality training data. To solve this issue, we first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests, where each scenario contains 6 synchronized tracks recorded by different cameras. It contains 88-hour raw videos that contribute to the 14-hour edited videos. Based on this benchmark, we further propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions and predict which view to be used. Extensive experiments show that our method outperforms existing methods on the proposed multi-camera editing benchmark. 1 * Corresponding author 1 A shot is a series of continuous frames recorded by a camera and a track refers to the video recorded by one camera from a specific view.

show abstract

A GCN-Based Framework for Generating Trailers

Jin

Jiang

2022

View full text Add to dashboard Cite

Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production

Rao¹,

Jiang²,

Guo³

et al. 2023

Preprint

View full text Add to dashboard Cite

Dynamic Storyboard 2 Dynamic Storyboard 1 Virtual Environment 📷 Camera Scripts 2: Dolly Full High-angle 📷 Camera scripts 1: Push Medium Eye-level 📖 Story scripts: Jane and Jack are arguing in living room Figure 1. We present Virtual Dynamic Storyboard (VDS) that takes user input story and camera scripts and automatically composes dynamic storyboards in an engine-based virtual environment for pre-visualization. Here we show two results produced by VDS with top-ranked scores of quality. Video demos can be found in the supplementary.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xuekun Jiang

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

SCEP—A New Image Dimensional Emotion Recognition Model Based on Spatial and Channel-Wise Attention Mechanisms

Jointly Learning the Attributes and Composition of Shots for Boundary Detection in Videos

SVD of Shot Boundary Detection Based on Accumulative Difference

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows

A GCN-Based Framework for Generating Trailers

Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production

Contact Info

Product

Resources

About