Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bimodal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.
Trajectory prediction is a highly desirable feature for safe navigation or autonomous vehicle in complex traffic. In this paper, we consider the practical environment of predicting trajectory in the heterogeneous traffic ecology. The proposed method has various applications in trajectory prediction problems and also in applied fields beyond tracking. One challenge stands out of the trajectory prediction-heterogeneous environment. Particularly, many factors should be considered in the environments, i.e., multiple types of road-agents, social interactions and terrains. The information is complicated and large that may result in inaccurate trajectory prediction. We propose two social and visual enforced attention modules to circumvent the problem and a variant of an Info-GAN structure to predict the trajectory with multi-modal behaviors. Experimental results show that the proposed method significantly outperforms state-of-the-art methods in both heterogeneous and homogeneous real environments. CCS CONCEPTS • Computing methodologies → Computer vision; • Computer systems organization → Robotic autonomy.
Visual domain adaptation addresses the problem of adapting the sample distribution of the source domain to the target domain, where the recognition task is intended but the data distributions are different. In this paper, we present a low-rank reconstruction method to reduce the domain distribution disparity. Specifically, we transform the visual samples in the source domain into an intermediate representation such that each transformed source sample can be linearly reconstructed by the samples of the target domain. Unlike the existing work, our method captures the intrinsic relatedness of the source samples during the adaptation process while uncovering the noises and outliers in the source domain that cannot be adapted, making it more robust than previous methods. We formulate our problem as a constrained nuclear norm and 2,1 norm minimization objective and then adopt the Augmented Lagrange Multiplier (ALM) method for the optimization. Extensive experiments on various visual adaptation tasks show that the proposed method consistently and significantly beats the state-of-theart domain adaptation methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.