Mice emit ultrasonic vocalizations (USV) to transmit socially-relevant information. To detect and classify these USVs, here we describe the development of VocalMat. VocalMat is a software that uses image-processing and differential geometry approaches to detect USVs in audio files, eliminating the need for user-defined parameter tuning. VocalMat also uses computational vision and machine learning methods to classify USVs into distinct categories. In a dataset of >4,000 USVs emitted by mice, VocalMat detected more than >98% of the USVs and accurately classified ≈86% of USVs when considering the most likely label out of 11 different USV types. We then used Diffusion Maps and Manifold Alignment to analyze the probability distribution of USV classification among different experimental groups, providing a robust method to quantify and qualify the vocal repertoire of mice. Thus, VocalMat allows accurate and highly quantitative analysis of USVs, opening the opportunity for detailed and high-throughput analysis of this behavior. * Present address: Interdepartmental Neuroscience Program, Yale School of Medicine, New Haven -CT, United States of America 2 Results
Detection of mouse USVs using imaging processingVocalMat uses multiple steps to analyze USVs from vocalizing mice in audio files (see Figure 1A for the general workflow). Initially, the audio recordings are converted into high-resolution spectrograms through a short-time Fourier transformation (see Methods and Materials). The resulting spectrogram consists of a matrix, wherein each element corresponds to an intensity value (power spectrum represented in decibels) for each time-frequency component. The spectrogram is then analyzed in terms of its time-frequency plane, where high-intensity values are represented by brighter pixels in a gray-scale image ( Figure 1B). The gray-scale image undergoes contrast enhancement and adaptive thresholding for binarization (see Methods and materials). The segmented objects are further refined via morphological operations ( Figure 1C and Figure S1), thus resulting in a list of segmented blobs (hereafter referred to as USV candidates) with their corresponding spectral features ( Figure 1D).This list of USV candidates may contain noise (i.e., detected particles that are not part of any USV) and multiple candidates that correspond to the same USV. To address this, a minimum of 10 ms interval between two successive and distinct syllables is assumed based on experimental observations [9]. To reduce the amount of data stored for each USV, the features extracted from detected candidates are represented by a mean frequency and intensity every 0.5 ms. The means are calculated for all the individual candidates, including the ones overlapping in time, hence preserving relevant features such as duration, frequency, intensity, and harmonic components ( Figure 1D).Harmonic components are also referred to as nonlinear components or composite [31,30]. Here, we did not consider harmonic components as a different syllable, but rather as an extra fea...