Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module, and conclude an efficient VTR framework. DSL is proposed to avoid the oneway optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC. Further, with both of them, the performance is advanced to a great extent, surpassing the previous SOTA methods for around 4.6% R@1 in MSR-VTT. The code will be available soon at https://github.com/starmemda/CAMoE/
The development of convolutional neural networks has promoted the progress of computeraided diagnostic systems. Details in medical image, such as the texture and tissue structure, are crucial features for diagnosis. Therefore, large input images combined with deep convolution neural networks are adopted to boost the performance in recent research of chest X-ray diagnosis. Meanwhile, due to the variable sizes of thoracic diseases, many researchers have worked to introduce additional module to capture multi-scale feature of images in CNN. However, these efforts hardly consider the computational costs of large inputs and introduced additional modules. This paper aims to automatically diagnose diseases on chest X-rays images quickly and effectively. We propose the multi-kernel depthwise convolution(MD-Conv) which contains depthwise convolution kernels with different filter sizes in one depthwise convolution layer. MD-Conv has high calculation efficiency and few parameters. Because its ability to learn multi-scale feature based on the multi-size kernels, it is appropriate for medical images diagnosis tasks in which abnormalities varied in sizes. In addition, larger depthwise convolution kernels are adopted in MD-Conv to obtain a larger receptive field efficiently, which can ensure sufficient receptive field for high resolution inputs. MD-Conv can be easily applied in modern lightweight networks to replace the normal depthwise convolution layer. We conduct experiments on the Chest X-ray 14 Dataset, which is the largest available chest x-ray dataset, and obtain competitive results. We also evaluate the MD-Conv on the new released dataset for pediatric pneumonia diagnosis. We obtain a better performance of 98.3% AUC than original paper (96.8%) for recognize pneumonia versus normal. Meanwhile we compare the FLOPs and Params of different models to show their efficiency for chest X-rays recognition. INDEX TERMS Chest x-ray recognition, lightweight networks, multi-kernels depthwise convolution.
No abstract
Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones. The code and models are available at https://github.com/linhezheng19/CAT.
The task of multi-label image classification is to recognize all the object labels presented in an image. Though advancing for years, small objects, similar objects and objects with high conditional probability are still the main bottlenecks of previous convolutional neural network(CNN) based models, limited by convolutional kernels' representational capacity. Recent vision transformer networks utilize the self-attention mechanism to extract the feature of pixel granularity, which expresses richer local semantic information, while is insufficient for mining global spatial dependence. In this paper, we point out the three crucial problems that CNN-based methods encounter and explore the possibility of conducting specific transformer modules to settle them. We put forward a Multi-label Transformer architecture(MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention, particularly improving the performance of multilabel image classification tasks. The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, NUS-WIDE with 88.5%, 95.8%, 65.5% respectively. The code will be available soon at https://github.com/starmemda/MlTr/
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.