2021
DOI: 10.1016/j.neunet.2020.10.003
|View full text |Cite
|
Sign up to set email alerts
|

DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…Such an addition could introduce new multi-modal possibilities for improvements in detection, localisation and classification. This is similar to the DMMAN network described by Hu et al 66 , which would not only improve the performance of ORCA-SPY, but would also help with target differentiation for context dependent analysis with towed and stationary observation. ORCA-SPY generalizes in a way that it allows researchers to simulate and verify various array geometries and setups under assumed realistic real-world noise conditions, which is not just important in the field, but also in preparation for any fieldwork studies.…”
Section: Discussionmentioning
confidence: 70%
“…Such an addition could introduce new multi-modal possibilities for improvements in detection, localisation and classification. This is similar to the DMMAN network described by Hu et al 66 , which would not only improve the performance of ORCA-SPY, but would also help with target differentiation for context dependent analysis with towed and stationary observation. ORCA-SPY generalizes in a way that it allows researchers to simulate and verify various array geometries and setups under assumed realistic real-world noise conditions, which is not just important in the field, but also in preparation for any fieldwork studies.…”
Section: Discussionmentioning
confidence: 70%
“…Eq. (5) shows that number of units in the convolution layer is defined as the half size for full connection for each layer. Through several levels of the cascade architecture, the fusion feature a f t finally passes through the convolution layer as the output layer to calculate the predicted density map pred D .…”
Section: Multi-modal Fusion Modulementioning
confidence: 99%
“…Crowd counting is taken as the computer-vision task, which is used in various fields such as intelligent transportation [1], industrial manufacturing [2] and security systems [3]. Different from the other computer vision tasks such as image classification [4] and scene understanding [5] and so on, the crowd counting models equipped by the convolutional neural network (CNN) should recognize arbitrarily sized people in various situations, including scenes with the extreme conditions such as high-level noise, low-level illumination and high-level occlusion. Consequently, the performance of the vision-driven model can be easily broken and maybe not very appropriate to deal with the crowd counting problem under extreme conditions.…”
Section: Introductionmentioning
confidence: 99%