2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014
DOI: 10.1109/icassp.2014.6853861
|View full text |Cite
|
Sign up to set email alerts
|

An interactive audio source separation framework based on non-negative matrix factorization

Abstract: Though audio source separation offers a wide range of applications in audio enhancement and post-production, its performance has yet to reach the satisfactory especially for single-channel mixtures with limited training data. In this paper we present a novel interactive source separation framework that allows end-users to provide feedback at each separation step so as to gradually improve the result. For this purpose, a prototype graphical user interface (GUI) is developed to help users annotating time-frequen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
18
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
4
3
1

Relationship

4
4

Authors

Journals

citations
Cited by 20 publications
(18 citation statements)
references
References 13 publications
0
18
0
Order By: Relevance
“…Such information can be e.g., user-"hummed" sounds that mimic the sources in the mixture [13] or source activity annotation along time [14] or in a time-frequency plane [15]; the annotation information is then used, instead of training data, to guide the separation process. Furthermore, recent publications disclose an interactive strategy [16], [17] where the user can perform annotations on the spectrogram of intermediate separation results to gradually correct the remaining errors. Note however that most of the existing approaches need to use prior information which may not be easy to acquire in advance (e.g., musical score, text transcript), is difficult to produce (e.g., user-hummed examples), or simply requires very experienced users while being very time consuming (e.g., time-frequency annotations).…”
Section: Introductionmentioning
confidence: 99%
“…Such information can be e.g., user-"hummed" sounds that mimic the sources in the mixture [13] or source activity annotation along time [14] or in a time-frequency plane [15]; the annotation information is then used, instead of training data, to guide the separation process. Furthermore, recent publications disclose an interactive strategy [16], [17] where the user can perform annotations on the spectrogram of intermediate separation results to gradually correct the remaining errors. Note however that most of the existing approaches need to use prior information which may not be easy to acquire in advance (e.g., musical score, text transcript), is difficult to produce (e.g., user-hummed examples), or simply requires very experienced users while being very time consuming (e.g., time-frequency annotations).…”
Section: Introductionmentioning
confidence: 99%
“…In summary, this strategy consists in first updating locally all the entries of one matrix using the corresponding update among (12), (13) and (15), and in choosing one entry per column yielding the highest likelihood while setting to zero all the other entries (see [20] for more details). This strategy guarantees a local optimization of the cost (11) in the sense that the cost is guaranteed to remain non-increasing after each update.…”
Section: Updates With Structural Constraintsmentioning
confidence: 99%
“…It consists in using some auxiliary information about the sources and/or the mixing process to guide the separation. For example, score-informed approaches rely on musical score to guide the separation in music recordings [3][4][5][6], separation-by-humming (SbH) algorithms exploit a sound "hummed" by the user mimicking the source of interest [7,8], and user-guided approaches take into account knowledge about, e.g., user-selected F0 track [9] or userannotated source activity patterns along the spectrogram of the mixture [10,11] and/or that of the estimated sources [12,13]. In line with this direction, there are also speech separation systems informed, e.g., by speaker gender [14], by corresponding video [15], or by the natural language structure [16].…”
Section: Introductionmentioning
confidence: 99%
“…It is well adapted to scenarios where the original sources are not available but high separation quality is nevertheless required. The additional information can be of different types: spatial and spectral information about the sources [5], [6], language structure [7], visual information [8], information about the recording/mixing conditions [9], musical scores [10]- [13], or user input [14]- [21]. For instance, the user can provide relevant information by drawing the fundamental frequency curve [18], by uttering the same sentence [16], by humming the melody [14], or even by selecting specific areas in the spectrogram of the mixture [17].…”
Section: Introductionmentioning
confidence: 99%