Accurate temporal action proposals play an important role in detecting actions from untrimmed videos. The existing approaches have difficulties in capturing global contextual information and simultaneously localizing actions with different durations. To this end, we propose a Relation-aware pyramid Network (RapNet) to generate highly accurate temporal action proposals. In RapNet, a novel relation-aware module is introduced to exploit bi-directional long-range relations between local features for context distilling. This embedded module enhances the RapNet in terms of its multi-granularity temporal proposal generation ability, given predefined anchor boxes. We further introduce a two-stage adjustment scheme to refine the proposal boundaries and measure their confidence in containing an action with snippet-level actionness. Extensive experiments on the challenging ActivityNet and THUMOS14 benchmarks demonstrate our RapNet generates superior accurate proposals over the existing state-of-the-art methods.
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choicequery interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https: //github.com/Huntersxsx/RaNet.
The detection of rail surface defects is an important tool to ensure the safe operation of rail transit. Due to the complex diversity of track surface defect features and the small size of the defect area, it is difficult to obtain satisfying detection results by traditional machine vision methods. The existing deep learning-based methods have the problems of large model sizes, excessive parameters, low accuracy and slow speed. Therefore, this paper proposes a new method based on an improved YOLOv4 (You Only Look Once, YOLO) for railway surface defect detection. In this method, MobileNetv3 is used as the backbone network of YOLOv4 to extract image features, and at the same time, deep separable convolution is applied on the PANet layer in YOLOv4, which realizes the lightweight network and real-time detection of the railway surface. The test results show that, compared with YOLOv4, the study can reduce the amount of the parameters by 78.04%, speed up the detection by 10.36 frames per second and decrease the model volume by 78%. Compared with other methods, the proposed method can achieve a higher detection accuracy, making it suitable for the fast and accurate detection of railway surface defects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.