Speech and music separation approaches - a survey

Mirbeygi, Mohaddeseh; Mahabadi, Aminollah; Ranjbar, Akbar

doi:10.1007/s11042-022-11994-1

Cited by 5 publications

(5 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The possible speed and correct computing of big data, careful control of data propagation, and proper monitoring of information diffusion in various large complex networks for modern text, visible image, and acoustic data types naturally increase the time complexity and memory usage for big text data [78], big image data [6], big acoustic data [69], and the possible combination of exclusive video and audio for multimedia applications in uncertain and high-risk environments [14], different topology streaming [15], and various channel utilization [16] for automatic decision-making. This potential problem routinely requires modern definitions and economic modeling for the subsequent definition of big image data generation in a reliable form suitable for various online interactions of intelligent multimedia applications, from camera imaging to knowledge extraction of sequential images [17,5].…”

Section: Big Data Oceansmentioning

confidence: 99%

“…Big image data streams have become ubiquitous because a considerable number of online multimedia applications naturally generate massive amounts of various types of data at an incredible velocity in 2D and 3D forms. Multimedia applications combine different data types in text, speech, sound, music, image, and video formats [5]. This work has to be directly managed by new devices in big data streams because of the built-in dynamic characteristics of different types of data, with an incredible speed of presented mining tools, applied technologies, designed methods, heterogeneous hardware, and hybrid techniques from starting data construction to ending useful information production for reasonable speed of knowledge extraction in decision-making at various data stream network levels [6,7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Big Image Data 15Vs Model for Intelligent Data Ocean of Multimedia Things

Mahabadi,

Abdolazimi

2024

Preprint

Self Cite

View full text Add to dashboard Cite

The classification of large image data using traditional methods has been based on changes in image gray features, extraction of edge and contour feature information, or conversion between image coordinate sets. However, due to the growth of new image size and big image data (BID) in real-time multimedia communication and future online big data (BD) applications, these methods have become increasingly complex and resulting in poor real-time performance. These complex methods suffer from complex algorithms, massive data communication, slow processing speed, unintelligent predictive modeling, weak data classification, limited accuracy, uncleaned usage data, combined destructive artificial and natural noise, exchanging data values over time, and exhausting updating data for accurate operation of media storage. In the era of the Internet of Multimedia Things (IoMT), most modern devices can accurately capture vast amounts of observational data in the form of valuable images, text, and acoustic recordings. This reliable data must be kept valid for local data processing at different times and for globally extracting knowledge from information at various data levels to address forthcoming challenges related to big image data. The challenges of secure communication require accurate, reliable, fast, and effective information gathering for local decision-making and global knowledge mining of BID to support cyber-physical systems for speedy data virtualization of smart city devices. Complex and uncontrollable problems persist on the outskirts of grids, fogs, and cloud networks regarding data cleansing and privacy protection for intelligent data ocean management. This study developed a new 15Vs model empirically to examine distributed big image data processing based on a modern BID connectivity approach. The model accurately extracts textured features from visible data images and overcomes a unique set of key challenges. The model introduces a strategic analyzer, uses intelligent local agents, and recruits clever global bots to provide a suitable platform for generous support of bot-oriented BD processing. This instantly forms a hierarchical data level for better data acquisition and cleansing, safe privacy protection, suitable information diffusion, and speedy knowledge extraction. Our study presents a novel definition of BID to address the ultimate challenges of data management in a high-risk environment for further big image data modeling research needs.

show abstract

Section: Big Data Oceansmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Big Image Data 15Vs Model for Intelligent Data Ocean of Multimedia Things

Mahabadi,

Abdolazimi

2024

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The recent separation techniques, however, fall well short of the capabilities of human hearing. It is challenging to resolve the existing SVS because of the instruments utilized and the spectral overlap between the speech and background music [ 11 , 18 , 19 , 20 , 21 ]. In daily life, human listeners generally have the remarkable ability to distinguish sound streams from a mixture of sounds, but this continues to be a difficult task for machines, particularly in the monaural case because it lacks the spatial cues that can be learned when two or more microphones are used.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Wang

2023

Sensors

View full text Add to dashboard Cite

Singing-voice separation is a separation task that involves a singing voice and musical accompaniment. In this paper, we propose a novel, unsupervised methodology for extracting a singing voice from the background in a musical mixture. This method is a modification of robust principal component analysis (RPCA) that separates a singing voice by using weighting based on gammatone filterbank and vocal activity detection. Although RPCA is a helpful method for separating voices from the music mixture, it fails when one single value, such as drums, is much larger than others (e.g., the accompanying instruments). As a result, the proposed approach takes advantage of varying values between low-rank (background) and sparse matrices (singing voice). Additionally, we propose an expanded RPCA on the cochleagram by utilizing coalescent masking on the gammatone. Finally, we utilize vocal activity detection to enhance the separation outcomes by eliminating the lingering music signal. Evaluation results reveal that the proposed approach provides superior separation outcomes than RPCA on ccMixter and DSD100 datasets.

show abstract

“…Separation of speech, music and environmental sounds is an important task for many speech applications and automatic machine hearing, such as the automatic speech recognition and music applications in edge devices [1]. Its quality has been significantly improved with the introduction of deep learning.…”

Section: Introductionmentioning

confidence: 99%

A 1.6-mW Sparse Deep Learning Accelerator for Speech Separation

Yang

Chang

2023

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Low power deep learning accelerators on the speech processing enable real-time applications on edge devices. However, most of the existing accelerators suffer from high power consumption and focus on image applications only. This paper presents a low power accelerator for speech separation through algorithm and hardware optimizations. At the algorithm level, the model is compressed with structured sensitivity as well as unstructured pruning, and further quantized to the shifted 8-bit floating-point format instead of the 32-bit floating-point format. The computations with the zero kernel and zero activation values are skipped by decomposition of the dilated and transposed convolutions. At the hardware level, the compressed model is then supported by an architecture with eight independent multipliers and accumulators (MACs) with a simple zero-skipping hardware to take advantage of the activation sparsity and low power processing. The proposed approach reduces the model size by 95.44% and computation complexity by 93.88%. The final implementation with the TSMC 40 nm process can achieve real-time speech separation and consumes 1.6 mW power when operated at 150 MHz. The normalized energy efficiency and area efficiency are 2.344 TOPS/W and 14.42 GOPS/mm 2 , respectively.

show abstract

Speech and music separation approaches - a survey

Cited by 5 publications

References 77 publications

Big Image Data 15Vs Model for Intelligent Data Ocean of Multimedia Things

Big Image Data 15Vs Model for Intelligent Data Ocean of Multimedia Things

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

A 1.6-mW Sparse Deep Learning Accelerator for Speech Separation

Contact Info

Product

Resources

About