L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment

Guizzo, Eric; Marinoni, C.; Pennese, Marco; Ren, Xinlei; Zheng, Xiguang; Zhang, Chen; Masiero, Bruno; Uncini, Aurelio; Comminiello, Danilo

doi:10.1109/icassp43922.2022.9746872

Cited by 31 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We demonstrate the improved performance, low data bias, and environment bias of the proposed model through various simulated datasets. Performance comparison with state-ofthe-art models on three datasets (spatialized WSJCAM0 [8], spatialized DNS challenge [5], and L3DAS22 [52]) confirms that our proposed model has lower computational complexity and higher performance enhancement. To further investigate the real-world applicability and scalability of our model, we conduct experiments on real noisy and reverberant speech recorded in an office environment.…”

Section: Introductionmentioning

confidence: 56%

“…Since the original DNS challenge dataset contains single-channel data, we spatialized both speeches and noises using a similar procedure as in the spatialized WSJCAM0 dataset described in [5]. 3) L3DAS22 Challenge dataset: The last dataset used for evaluation is the L3DAS22 challenge [52] dataset proposed as part of the recent ICASSP 2022 challenges. This dataset includes speech recordings simulated in 3D office environments with varying speaker positions.…”

Section: A Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

DeFT-AN: Dense Frequency-Time Attentive Network for Multichannel Speech Enhancement

Lee

Choi

2023

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F-and T-transformers extracting temporal and spectral relations, we introduce crossattention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dualpath feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-theart multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on realworld data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.

show abstract

Section: Introductionmentioning

confidence: 56%

Section: A Datasetsmentioning

confidence: 99%

DeFT-AN: Dense Frequency-Time Attentive Network for Multichannel Speech Enhancement

Lee

Choi

2023

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

show abstract

“…We believe this shortage of studies to be at least in part due to the lack of an architecture capable of incorporating the scene's metadata, which is addressed by our proposed DI-NN. We also refer to the recent L3DAS22 challenge [24], where practitioners were invited to develop 3D PSSL algorithms for a realistic office environment containing a pair of microphone arrays.…”

Section: Neural-based Methodsmentioning

confidence: 99%

Graph Neural Networks for Sound Source Localization on Distributed Microphone Networks

Eric

Brookes

Naylor

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In many signal processing applications, metadata may be advantageously used in conjunction with a high dimensional signal to produce a desired output. In the case of classical Sound Source Localization (SSL) algorithms, information from a high dimensional, multichannel audio signals received by many distributed microphones is combined with information describing acoustic properties of the scene, such as the microphones' coordinates in space, to estimate the position of a sound source. We introduce Dual Input Neural Networks (DI-NNs) as a simple and effective way to model these two data types in a neural network. We train and evaluate our proposed DI-NN on scenarios of varying difficulty and realism and compare it against an alternative architecture, a classical Least-Squares (LS) method as well as a classical Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN significantly outperforms the baselines, achieving a five times lower localization error than the LS method and two times lower than the CRNN in a test dataset of real recordings.

show abstract

“…For the second edition of this project, L3DAS22 [6], we maintained a similar setting to that proposed in L3DAS21 but with some substantial improvements. Firstly, we generated a new dataset containing an augmented number of datapoints, increasing the total length of the dataset from 65 to more than 94 hours.…”

Section: Introductionmentioning

confidence: 99%

L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

Gramaccioni,

Marinoni,

Chen

et al. 2024

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

The primary goal of the L3DAS (Learning 3D Audio Sources) project is to stimulate and support collaborative research studies concerning machine learning techniques applied to 3D audio signal processing. To this end, the L3DAS23 Challenge, presented at IEEE ICASSP 2023, focuses on two spatial audio tasks of paramount interest for practical uses: 3D speech enhancement (3DSE) and 3D sound event localization and detection (3DSELD). Both tasks are evaluated within augmented reality applications. The aim of this paper is to describe the main results obtained from this challenge. We provide the L3DAS23 dataset, which comprises a collection of first-order Ambisonics recordings in reverberant simulated environments. Indeed, we maintain some general characteristics of the previous L3DAS challenges, featuring a pair of first-order Ambisonics microphones to capture the audio signals and involving multiple-source and multiple-perspective Ambisonics recordings. However, in this new edition, we introduce audio-visual scenarios by including images that depict the frontal view of the environments as captured from the perspective of the microphones. This addition aims to enrich the challenge experience, giving participants tools for exploring a combination of audio and images for solving the 3DSE and 3DSELD tasks. In addition to a brand-new dataset, we provide updated baseline models designed to take advantage of audio-image pairs. To ensure accessibility and reproducibility, we also supply supporting API for an effortless replication of our results. We support the dataset download and the use of the baseline models via extensive instructions provided on the official GitHub repository at https://github.com/l3das/L3DAS23. Lastly, we present the results achieved by the participants of the L3DAS23 Challenge. For more comprehensive information and in-depth details about the challenge, we invite the reader to visit the L3DAS Project website at http://www.l3das.com/icassp2023.

show abstract

L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment

Cited by 31 publications

References 26 publications

DeFT-AN: Dense Frequency-Time Attentive Network for Multichannel Speech Enhancement

DeFT-AN: Dense Frequency-Time Attentive Network for Multichannel Speech Enhancement

Graph Neural Networks for Sound Source Localization on Distributed Microphone Networks

L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

Contact Info

Product

Resources

About