Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Schmid, Florian; Koutini, Khaled; Widmer, Gerhard

doi:10.48550/arxiv.2211.04772

Cited by 2 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In conclusion, the Transformer model had the following advantages over the convolutional models: firstly, it constructed long-distance feature relationships, and the Transformer used the attention mechanism to obtain the contextual information of the feature map, which made up for the slow process of expanding from the local features to the global features through layer-by-layer down sampling operations [ 60 , 61 , 62 ]. Secondly, it had the capability of multimodal fusion.…”

Section: Related Workmentioning

confidence: 99%

Table Tennis Track Detection Based on Temporal Feature Multiplexing Network

Liu

et al. 2023

Sensors

View full text Add to dashboard Cite

Recording the trajectory of table tennis balls in real-time enables the analysis of the opponent’s attacking characteristics and weaknesses. The current analysis of the ball paths mainly relied on human viewing, which lacked certain theoretical data support. In order to solve the problem of the lack of objective data analysis in the research of table tennis competition, a target detection algorithm-based table tennis trajectory extraction network was proposed to record the trajectory of the table tennis movement in video. The network improved the feature reuse rate in order to achieve a lightweight network and enhance the detection accuracy. The core of the network was the “feature store & return” module, which could store the output of the current network layer and pass the features to the input of the network layer at the next moment to achieve efficient reuse of the features. In this module, the Transformer model was used to secondarily process the features, build the global association information, and enhance the feature richness of the feature map. According to the designed experiments, the detection accuracy of the network was 96.8% for table tennis and 89.1% for target localization. Moreover, the parameter size of the model was only 7.68 MB, and the detection frame rate could reach 634.19 FPS using the hardware for the tests. In summary, the network designed in this paper has the characteristics of both lightweight and high precision in table tennis detection, and the performance of the proposed model significantly outperforms that of the existing models.

show abstract

Section: Related Workmentioning

confidence: 99%

Table Tennis Track Detection Based on Temporal Feature Multiplexing Network

Liu

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…We extract sound tags to provide more context. We use an audio tagging model (Schmid et al, 2022) to classify the entire audio stream. We select the top 3 predicted tags that have a higher confidence value than the threshold (0.3).…”

Section: Visual Descriptions and Utterances (Chronologically)mentioning

confidence: 99%

“…Video-to-Text Prompting. During the prompting stage, we use BLIP-2 (Li et al, 2023a), Intern-Video (Wang et al, 2022a), Whisper, ChatGPT (OpenAI, 2023, and an audio-tagging model from Schmid et al (2022). We use the coco-pretrained BLIP-2 model with nucleus sampling.…”

Section: A Experimental Detailsmentioning

confidence: 99%

Can Language Models Laugh at YouTube Short-form Videos?

Ko,

Lee,

Kim

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains such as speeches or sitcoms, and mostly focus on verbal cues. We curate a usergenerated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.

show abstract

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Cited by 2 publications

References 14 publications

Table Tennis Track Detection Based on Temporal Feature Multiplexing Network

Table Tennis Track Detection Based on Temporal Feature Multiplexing Network

Can Language Models Laugh at YouTube Short-form Videos?

Contact Info

Product

Resources

About