2023
DOI: 10.1109/access.2023.3294476
|View full text |Cite
|
Sign up to set email alerts
|

DeepLabV3+ Vision Transformer for Visual Bird Sound Denoising

Abstract: Audio denoising is a task to improve the perceptual quality of noisy audio signals. There is still residual noise after the denoising of noisy signals, which will affect the quality of audio data. Traditional and deep learning-based methods are still limited to the manual addition of artificial noise or low-frequency noise. Recently, audio denoising has been transformed into an image segmentation problem, and deep neural networks have been applied to solve this problem. However, its performance is limited to s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 60 publications
0
1
0
Order By: Relevance
“…Among them, the encoder module is the cornerstone of the whole structure, which is responsible for extracting deep semantic features from images. The main body of the encoder is a Deep Convolutional Neural Network (DCNN) [ 36 , 37 ], which can employ a variety of backbone networks such as ResNet [ 38 ], Xception [ 39 ], or MobileNetV2 [ 40 ], which provide the basic feature extraction capabilities for the model. In order to capture a wider range of contextual information without losing spatial resolution, the encoder uses inflated convolution instead of traditional convolution.…”
Section: Methodsmentioning
confidence: 99%
“…Among them, the encoder module is the cornerstone of the whole structure, which is responsible for extracting deep semantic features from images. The main body of the encoder is a Deep Convolutional Neural Network (DCNN) [ 36 , 37 ], which can employ a variety of backbone networks such as ResNet [ 38 ], Xception [ 39 ], or MobileNetV2 [ 40 ], which provide the basic feature extraction capabilities for the model. In order to capture a wider range of contextual information without losing spatial resolution, the encoder uses inflated convolution instead of traditional convolution.…”
Section: Methodsmentioning
confidence: 99%