librosa/librosa: 0.8.1rc2

McFee, Brian; Metsai, Alexandros; McVicar, Matt; Balke, Stefan; Thomé, Carl; Raffel, Colin; Zalkow, Frank; Ayoub, Malek; Dana,; Lee, Kyungyun; Nieto, Oriol; Ellis, Dan; Mason, Jack; Battenberg, Eric; Seyfarth, Scott; Yamamoto, Ryōichi; viktorandreevichmorozov,; Choi, Keunwoo; Moore, Josh; Bittner, Rachel M.; Hidaka, Shunsuke; Wei, Ziyao; nullmightybofo,; Hereñú, Darío; Stöter, Fabian-Robert; Friesch, Pius; Weiss, Adam; Vollrath, Matt; Kim, Tae-Woon; Thassilo,

doi:10.5281/zenodo.4792298

Cited by 18 publications

(14 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The lossy communication channel has Gaussian white noise with a signal-to-noise ratio (SNR) of 30 dB, unless otherwise stated. During training, the channel applies Gaussian-sampled time stretch and pitch shift using Librosa (McFee et al, 2021), with variance 0.4 and 0.3, respectively. The channel also masks up to 15% of the mel-spectrogram time-axis during training.…”

Section: Methodsmentioning

confidence: 99%

Towards Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

Eloff¹,

Pretorius²,

Räsänen³

et al. 2021

Preprint

View full text Add to dashboard Cite

While multi-agent reinforcement learning has been used as an effective means to study emergent communication between agents, existing work has focused almost exclusively on communication with discrete symbols. Human communication often takes place (and emerged) over a continuous acoustic channel; human infants acquire language in large part through continuous signalling with their caregivers. We therefore ask: Are we able to observe emergent language between agents with a continuous communication channel trained through reinforcement learning? And if so, what is the impact of channel characteristics on the emerging language? We propose an environment and training methodology to serve as a means to carry out an initial exploration of these questions. We use a simple messaging environment where a "speaker" agent needs to convey a concept to a "listener". The Speaker is equipped with a vocoder that maps symbols to a continuous waveform, this is passed over a lossy continuous channel, and the Listener needs to map the continuous signal to the concept. Using deep Q-learning, we show that basic compositionality emerges in the learned language representations. We find that noise is essential in the communication channel when conveying unseen concept combinations. And we show that we can ground the emergent communication by introducing a caregiver predisposed to "hearing" or "speaking" English. Finally, we describe how our platform serves as a starting point for future work that uses a combination of deep reinforcement learning and multi-agent systems to study our questions of continuous signalling in language learning and emergence.

show abstract

Section: Methodsmentioning

confidence: 99%

Towards Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

Eloff¹,

Pretorius²,

Räsänen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In doing so, we reduce the size of the data and training time without loss of essential information as the characteristic sounds of a cough are associated with frequencies below 10 kHz [26]. The anti-aliasing filter is a precomputed low-pass filter provided by librosa [27], a python package for music and audio analysis. The filter is designed using a Kaiser window with β = 8.56 and a roll-off frequency of 0.85 * f nyquist , where f nyquist = 11.025kHz is the Nyquist frequency, i.e.…”

Section: B Data Pre-processingmentioning

confidence: 99%

“…Furthermore, we established a human baseline with which we compared our model's performance. We instructed each of our 19 human raters (13 male, 6 female, age [22][23][24][25][26][27][28][29][30][31] to perform the same verification tests as our model. More precisely, for each of the ten coughers in the test set, the rater was first allowed to listen to the same enrollment samples presented to our model as often as desired.…”

Section: Idmentioning

confidence: 99%

TripletCough: Cougher Identification and Verification From Contact-Free Smartphone-Based Audio Recordings Using Metric Learning

Jokic

Cleres

Rassouli

et al. 2022

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

Cough, a symptom associated with many prevalent respiratory diseases, can serve as a potential biomarker for diagnosis and disease progression. Consequently, the development of cough monitoring systems and, in particular, automatic cough detection algorithms have been studied since the early 2000s. Recently, there has been an increased focus on the efficiency of such algorithms, as implementation on consumer-centric devices such as smartphones would provide a scalable and affordable solution for monitoring cough with contact-free sensors. Current algorithms, however, are incapable of discerning between coughs of different individuals and, thus, cannot function reliably in situations where potentially multiple individuals have to be monitored in shared environments. Therefore, we propose a weakly supervised metric learning approach for cougher recognition based on smartphone audio recordings of coughs. Our approach involves a triplet network architecture, which employs convolutional neural networks (CNNs). The CNNs of the triplet network learn an embedding function, which maps Mel spectrograms of cough recordings to an embedding space where they are more easily distinguishable. Using audio recordings of nocturnal coughs from asthmatic patients captured with a smartphone, our approach achieved a mean accuracy of 88% (± 10% SD) on two-way identification tests with 12 enrollment samples and accuracy of 80% and an equal error rate (EER) of 20% on verification tests. Furthermore, our approach outperformed human raters with re-The first and last author contributed equally to this work. Asterisk indicates corresponding author.

show abstract

“…Following our former work in [7], we process music as barwise spectrograms, with a fixed number of frames per bar. Practically, spectrograms are computed using librosa [16] with a low hop length of 32 frames at a sampling rate of 44.1kHz, and downbeats are estimated with the madmom toolbox [14]. This allows us to split the original spectrogram in b barwise spectrograms (b being the number of bars in this song) each containing n b frames.…”

Section: Barwise Music Processingmentioning

confidence: 99%

Barwise Compression Schemes for Audio-Based Music Structure Analysis

Marmoret¹,

Cohen²,

Bimbot³

2022

Preprint

View full text Add to dashboard Cite

Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections. We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song.More specifically, under the hypothesis that MSA is correlated with similarities occurring at the bar scale, linear and non-linear compression schemes can be applied to barwise audio signals. Compressed representations capture the most salient components of the different bars in the song and are then used to infer the song structure using a dynamic programming algorithm.This work explores both low-rank approximation models such as Principal Component Analysis or Nonnegative Matrix Factorization and "piece-specific" Auto-Encoding Neural Networks, with the objective to learn latent representations specific to a given song. Such approaches do not rely on supervision nor annotations, which are well-known to be tedious to collect and possibly ambiguous in MSA description.In our experiments, several unsupervised compression schemes achieve a level of performance comparable to that of stateof-the-art supervised methods (for 3s tolerance) on the RWC-Pop dataset, showcasing the importance of the barwise compression processing for MSA.

show abstract

librosa/librosa: 0.8.1rc2

Cited by 18 publications

References 0 publications

Towards Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

Towards Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

TripletCough: Cougher Identification and Verification From Contact-Free Smartphone-Based Audio Recordings Using Metric Learning

Barwise Compression Schemes for Audio-Based Music Structure Analysis

Contact Info

Product

Resources

About