Mean opinion score ratings of reproduced sound quality typically pool all contributing perceptual factors into a single rating of basic audio quality. In order to improve understanding of the trade-offs between selected sound quality degradations that might arise in systems for the delivery of high quality multichannel audio, it was necessary to evaluate the influence of timbral and spatial fidelity changes on basic audio quality grades. The relationship between listener ratings of degraded multichannel audio quality on one timbral and two spatial fidelity scales was exploited to predict basic audio quality ratings of the same material using a regression model. It was found that timbral fidelity ratings dominated but that spatial fidelity predicted a substantial proportion of the basic audio quality.
The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions. The three scenes were subject to classification: (1) music ensemble (a group of musical sources) located in the front, (2) music ensemble located at the back, and (3) music ensemble distributed around a listener. In the listening test, undertaken remotely over the Internet, human listeners reached the classification accuracy of 42.5%. For the listeners who passed the post-screening test, the accuracy was greater, approaching 60%. The above classification task was also undertaken automatically using four machine learning algorithms: convolutional neural network, support vector machines, extreme gradient boosting framework, and logistic regression. The machine learning algorithms substantially outperformed human listeners, with the classification accuracy reaching 84%, when tested under the binaural-room-impulse-response (BRIR) matched conditions. However, when the algorithms were tested under the BRIR mismatched scenario, the accuracy obtained by the algorithms was comparable to that exhibited by the listeners who passed the post-screening test, implying that the machine learning algorithms capability to perform in unknown electro-acoustic conditions needs to be further improved.
The preferences of a large number of naïve listeners were elicited in response to a selection of multichannel audio items that had been degraded in quality by using band-limiting and down-mixing algorithms. Relationships were sought between these preference ratings and the quality judgements of experienced listeners in an attempt to determine whether one could be predicted from the other. Results suggest that a simple regression model can be used to do this with adequate results, but that a better prediction can be successfully based on experienced listener ratings of timbral and spatial fidelity. There is a difference between naïve and experienced listeners in the weightings of the fidelities and their relationship to overall quality.
The aim of the study was to develop a method for automatic classification of the three spatial audio scenes, differing in horizontal distribution of foreground and background audio content around a listener in binaurally rendered recordings of music. For the purpose of the study, audio recordings were synthesized using thirteen sets of binaural-room-impulse-responses (BRIRs), representing room acoustics of both semi-anechoic and reverberant venues. Head movements were not considered in the study. The proposed method was assumption-free with regards to the number and characteristics of the audio sources. A least absolute shrinkage and selection operator was employed as a classifier. According to the results, it is possible to automatically identify the spatial scenes using a combination of binaural and spectro-temporal features. The method exhibits a satisfactory classification accuracy when it is trained and then tested on different stimuli but synthesized using the same BRIRs (accuracy ranging from 74% to 98%), even in highly reverberant conditions. However, the generalizability of the method needs to be further improved. This study demonstrates that in addition to the binaural cues, the Mel-frequency cepstral coefficients constitute an important carrier of spatial information, imperative for the classification of spatial audio scenes.
Binaural technology becomes increasingly popular in the multimedia systems. This paper identifies a set of features of binaural recordings suitable for the automatic classification of the four basic spatial audio scenes representing the most typical patterns of audio content distribution around a listener. Moreover, it compares the five artificial-intelligencebased methods applied to the classification of binaural recordings. The results show that both the spatial and the spectro-temporal features are essential to accurate classification of binaurally rendered acoustic scenes. The spectro-temporal features appear to have a stronger influence on the classification results than the spatial metrics. According to the obtained results, the method based on the support vector machine, exploiting the features identified in the study, yields the classification accuracy approaching 84%.
Spatial audio processes (SAPs) commonly encountered in consumer audio reproduction systems are known to generate a range of impairments to spatial quality. Two listening tests (involving two listening positions, six 5-channel audio recordings, and 48 SAPs) indicate that the degree of quality degradation is determined largely by the nature of the SAP but that the effect of a particular SAP can depend on program material and on listening position. Combining off-center listening with another SAP can reduce spatial quality significantly compared to auditioning that SAP centrally. These findings, and the associated listening test data, can guide the development of an artificial-listener-based spatial audio quality evaluation system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.