Person re-identification (re-ID) aims at recognizing the same person from images taken across different cameras. On the other hand, cross-dataset/domain re-ID focuses on leveraging labeled image data from source to target domains, while target-domain training data are without label information. In order to introduce discriminative ability and to generalize the re-ID model to the unsupervised target domain, our proposed Pose Disentanglement and Adaptation Network (PDA-Net) learns deep image representation with pose and domain information properly disentangled. Our model allows pose-guided image recovery and translation by observing images from either domain, without predefined pose category nor identity supervision. Our qualitative and quantitative results on two benchmark datasets confirm the effectiveness of our approach and its superiority over state-of-the-art cross-dataset re-ID approaches.
Audio-visual event localization requires one to identify the event which is both visible and audible in a video (either at a frame or video level). To address this task, we propose a deep neural network named Audio-Visual sequenceto-sequence dual network (AVSDN). By jointly taking both audio and visual features at each time segment as inputs, our proposed model learns global and local event information in a sequence to sequence manner, which can be realized in either fully supervised or weakly supervised settings. Empirical results confirm that our proposed method performs favorably against recent deep learning approaches in both settings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.