2020
DOI: 10.1109/taslp.2019.2957889
|View full text |Cite
|
Sign up to set email alerts
|

Weakly Supervised Representation Learning for Audio-Visual Scene Analysis

Abstract: Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only videolevel event labels without any timing information. An important feature of our method is its capa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 27 publications
(26 citation statements)
references
References 60 publications
0
26
0
Order By: Relevance
“…Further post-challenge research was carried by some other authors. For example in [44], the authors used an audio-visual approach to match co-occurrences of images and sounds to locate the target sound from the weak labels. The authors in [45] used a multi-level attention model to focus on the target sound indicated by the weakly labels.…”
Section: Systemmentioning
confidence: 99%
“…Further post-challenge research was carried by some other authors. For example in [44], the authors used an audio-visual approach to match co-occurrences of images and sounds to locate the target sound from the weak labels. The authors in [45] used a multi-level attention model to focus on the target sound indicated by the weakly labels.…”
Section: Systemmentioning
confidence: 99%
“…• A (TSP): temporal segment proposals, • A (NCP): NMF component proposals, • A (TSP, NCP): all TSPs and NCPs are put together into the same bag and fed to the audio network. While systems using only TSP give state-of-the-art results [1], they serve as a strong baseline for establishing the usefulness of NCPs in classification. For source enhancement we compare with the following NMF related methods:…”
Section: Setupmentioning
confidence: 99%
“…Unfortunately, due to availability issues, videos could not be obtained for all of the audio files in the original data set. Ultimately, we ended up creating a slightly reduced collection of audio and video files consisting of 48883 training, 465 validation and 1050 test samples: about 5% of samples of the original data was discarded in the process [14].…”
Section: Data Setmentioning
confidence: 99%
“…The micro-averaged F1 score achieved by this model is reported in Table 3. Table 3: F1 score of prior audiovisual classification model model F1 score two-stream audiovisual neural network [14] 64.2% The multimodal audiovisual transformers strongly improve upon the two-stream audiovisual neural network. As both of these types of models use nearly the same data and employ similar externally pretrained embeddings, this truly illustrates the power of the transformer architecture for the task at hand.…”
Section: 14mentioning
confidence: 99%