“…Fig. 3 reveals that the estimation of negative α is more accurate than that of the positive, which is consistent with the conclusion in [6,7] that lowing pitch is easier to detect than raising pitch. Besides, the tiny α is prone to be estimatied as zero, resulting a linear deviation in the neighbourhood of zero.…”
Section: Evaluation Of Estimation Accuracysupporting
confidence: 86%
“…Prior works and limitaion: Early works [5][6][7] typically estimate the approximate range of pitch shifting rather than the precise degree of disguise, rendering them incapable of accurately restoring the pitch-shifted voice. Later, Pilia et al propose a method achieving more accurate estimation results than previous work [8].…”
Pitch scaling algorithms have a significant impact on the security of Automatic Speaker Verification (ASV) systems. Although numerous anti-spoofing algorithms have been proposed to identify the pitch-shifted voice and even restore it to the original version, they either have poor performance or require the original voice as a reference, limiting the prospects of applications. In this paper, we propose a no-reference approach termed PSVRF 1 for high-quality restoration of pitchshifted voice. Experiments on AISHELL-1 and AISHELL-3 demonstrate that PSVRF can restore the voice disguised by various pitch-scaling techniques, which obviously enhances the robustness of ASV systems to pitch-scaling attacks. Furthermore, the performance of PSVRF even surpasses that of the state-of-the-art reference-based approach.
“…Fig. 3 reveals that the estimation of negative α is more accurate than that of the positive, which is consistent with the conclusion in [6,7] that lowing pitch is easier to detect than raising pitch. Besides, the tiny α is prone to be estimatied as zero, resulting a linear deviation in the neighbourhood of zero.…”
Section: Evaluation Of Estimation Accuracysupporting
confidence: 86%
“…Prior works and limitaion: Early works [5][6][7] typically estimate the approximate range of pitch shifting rather than the precise degree of disguise, rendering them incapable of accurately restoring the pitch-shifted voice. Later, Pilia et al propose a method achieving more accurate estimation results than previous work [8].…”
Pitch scaling algorithms have a significant impact on the security of Automatic Speaker Verification (ASV) systems. Although numerous anti-spoofing algorithms have been proposed to identify the pitch-shifted voice and even restore it to the original version, they either have poor performance or require the original voice as a reference, limiting the prospects of applications. In this paper, we propose a no-reference approach termed PSVRF 1 for high-quality restoration of pitchshifted voice. Experiments on AISHELL-1 and AISHELL-3 demonstrate that PSVRF can restore the voice disguised by various pitch-scaling techniques, which obviously enhances the robustness of ASV systems to pitch-scaling attacks. Furthermore, the performance of PSVRF even surpasses that of the state-of-the-art reference-based approach.
“…(1) Frequency domain features are used to identify audio post-processing operations. Wang et al [25] used the features of audio after STFT transformation as the input of the convolutional neural network (CNN) to identify the post-processing operation of audio pitch transformation. (2) ENF is applied for audio recapture detection.…”
Section: Detection Methods Based On Deep Featuresmentioning
Digital audio tampering detection can be used to verify the authenticity of digital audio. However, most current methods use standard electronic network frequency (ENF) databases for visual comparison analysis of ENF continuity of digital audio or perform feature extraction for classification by machine learning methods. ENF databases are usually tricky to obtain, visual methods have weak feature representation, and machine learning methods have more information loss in features, resulting in low detection accuracy. This paper proposes a fusion method of shallow and deep features to fully use ENF information by exploiting the complementary nature of features at different levels to more accurately describe the changes in inconsistency produced by tampering operations to raw digital audio. Firstly, the audio signal is band-pass filtered to obtain the ENF component. Then, the discrete Fourier transform (DFT) and Hilbert transform are performed to obtain the phase and instantaneous frequency of the ENF component. Secondly, the mean value of the sequence variation is used as the shallow feature; the feature matrix obtained by framing and reshaping of the ENF sequence is used as the input of the convolutional neural network; the characteristics of the fitted coefficients are obtained by curve fitting. Then, the local details of ENF are obtained from the feature matrix by the convolutional neural network, and the global information of ENF is obtained by fitting coefficient features through deep neural network (DNN). The depth features of ENF are composed of ENF global information and local information together. The shallow and deep features are fused using an attention mechanism to give greater weights to features useful for classification and suppress invalid features. Finally, the tampered audio is detected by downscaling and fitting with a DNN containing two fully connected layers, and classification is performed using a Softmax layer. The method achieves 97.03% accuracy on three classic databases: Carioca 1, Carioca 2, and New Spanish. In addition, we have achieved an accuracy of 88.31% on the newly constructed database GAUDI-DI. Experimental results show that the proposed method is superior to the state-of-the-art method.
“…Frequency domain features are used to identify audio post-processing operations. Wang et al [25] used the features of audio after STFT and CQT transformation as the input of CNN of the convolutional neural network to identify the post-processing operation of audio pitch transformation. 2).…”
Section: Detection Methods Based On Deep Featuresmentioning
Digital Audio tampering detection can be applied to verify the authenticity of digital audio. However, the current methods are mostly based on visual comparison analysis of the continuity of electronic network frequency (ENF) of digital audio with a standard ENF database. It is usually tricky to obtain the ENF database, and the feature expression of the visualization method is weak, which leads to low detection accuracy. In order to solve this problem, this paper proposed an audio tampering detection method based on the fusion of shallow and deep features. Firstly, the band-pass filtering process is performed on the audio signal to obtain the ENF components, and then the discrete Fourier transform and Hilbert transform are applied to obtain the phase and instantaneous frequency of the ENF components. Secondly, the shallow features are extracted by performing framing and fitting operations on the estimated phase and instantaneous frequency. Then, the designed convolutional neural network is used to obtain deep features, and the attention mechanism is applied to fuse shallow features and deep features. Finally, after dimensionality reduction through the fully connected layer, the Softmax layer is used for classification to detect the tampering audio. The method achieves 97.03% accuracy on three classic databases, which are Carioca 1, Carioca 2, and New Spanish. In addition, we have achieved an accuracy of 88.31% on the newly constructed database GAUDI-DI. Experimental results show that the proposed method is superior to the state-of-the-art method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.