Pasi Pertilä scite author profile

This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multichannel audio better when they are presented as separate layers of a volume. Using the proposed spatial features over monaural features on the same network gives an absolute F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and 2.7% on the TUT-SED 2009 dataset that is fifteen times larger.

show abstract

Robust direction estimation with convolutional neural networks based steered response power

Pertilä

Çakır

2017

View full text Add to dashboard Cite

Closed-form self-localization of asynchronous microphone arrays

Pertilä

Mieskolainen

Hämäläinen

2011

View full text Add to dashboard Cite

The utilization of distributed microphone arrays in many speech processing applications such as beamforming and speaker localization rely on the precise knowledge of microphone locations. Several selflocalization approaches have been presented in the literature but still a simple, accurate, and robust method for asynchronous devices is lacking. This work presents an analytical solution for estimating the positions and rotations of asynchronous loudspeaker equipped microphone arrays or devices. The method is based on emitting and receiving calibration signals from each device, and extracting the time of arrival (TOA) values. Utilizing the knowledge of array geometry in the TOA estimation is proposed to improve accuracy of translation. Results with measurements using four devices on a table surface demonstrates a mean translation error of 11 mm with standard deviation of 6 mm and mean z-axis rotation error of 0.11 (rad) with a standard deviation of 0.14 (rad) in contrast to computer vision annotations with 200 rotations and translation estimates.

show abstract

Distant speech separation using predicted time–frequency masks from spatial features

Pertilä

Nikunen

2015

Speech Communication

View full text Add to dashboard Cite

Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output.Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output.The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).

show abstract

Measurement Combination for Acoustic Source Localization in a Room Environment

Pertilä¹,

Korhonen²,

Visa³

2008

EURASIP Journal on Audio, Speech, and Music Processing

View full text Add to dashboard Cite

Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array

Pertilä

Hämäläinen

Mieskolainen

2013

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Shooter localization and bullet trajectory, caliber, and speed estimation based on detected firing sounds

Mäkinen

Pertilä

2010

Applied Acoustics

View full text Add to dashboard Cite

Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking

Pertilä

Parviainen

2019

View full text Add to dashboard Cite

12 3 4 5 6

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pasi Pertilä

Sound event detection using spatial features and convolutional recurrent neural network

Robust direction estimation with convolutional neural networks based steered response power

Closed-form self-localization of asynchronous microphone arrays

Distant speech separation using predicted time–frequency masks from spatial features

Measurement Combination for Acoustic Source Localization in a Room Environment

Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array

Shooter localization and bullet trajectory, caliber, and speed estimation based on detected firing sounds

Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking

Contact Info

Product

Resources

About