With the growing popularity of social networks, cloud services and online applications, people are becoming concerned about the way companies store their data and the ways in which the data can be applied. Privacy with devices and services operated by the voice are of particular interest. To enable studies in privacy, this paper presents a database which quantifies the experience of privacy users have in spoken communication. We focus on the effect of the acoustic environment on that perception of privacy. Speech signals are recorded in scenarios simulating real-life situations, where the acoustic environment has an effect on the experience of privacy. The acoustic data is complemented with measures of the speakers' experience of privacy, recorded using a questionnaire. The presented corpus enables studies in how acoustic environments affect peoples' experience of privacy, which in turn, can be used to develop speech operated applications which are respectful of their right to privacy.
Voice user interfaces can offer intuitive interaction with our devices, but the usability and audio quality could be further improved if multiple devices could collaborate to provide a distributed voice user interface. To ensure that users' voices are not shared with unauthorized devices, it is however necessary to design an access management system that adapts to the users' needs. Prior work has demonstrated that a combination of audio fingerprinting and fuzzy cryptography yields a robust pairing of devices without sharing the information that they record. However, the robustness of these systems is partially based on the extensive duration of the recordings that are required to obtain the fingerprint. This paper analyzes methods for robust generation of acoustic fingerprints in short periods of time to enable the responsive pairing of devices according to changes in the acoustic scenery and can be integrated into other typical speech processing tools.
Voice user interfaces have increased in popularity, as they enable natural interaction with different applications using one's voice. To improve their usability and audio quality, several devices could interact to provide a unified voice user interface. However, with devices cooperating and sharing voice-related information, user privacy may be at risk. Therefore, access management rules that preserve user privacy are important. State-of-the-art methods for acoustic pairing of devices provide fingerprinting based on the time-frequency representation of the acoustic signal and error-correction. We propose to use such acoustic fingerprinting to authorise devices which are acoustically close. We aim to obtain fingerprints of ambient audio adapted to the requirements of voice user interfaces. Our experiments show that the responsiveness and robustness is improved by combining overlapping windows and decorrelating transforms.
Existing hardware with microphones can potentially be used as sensor networks to capture speech and audio signals for the benefit of better signal quality than possible with a single microphone. A central pre-requisite for such ad-hoc acoustic wireless sensor networks (ASWNs) is an efficient communication protocol with which to transmit audio data between nodes. For that purpose, we present the world's-first speech and audio codec especially designed for ASWNs, which has competitive quality also in single-channel operation. To ensure quality in the single-channel scenario, it closely resembles conventional codecs of the TCX-type, but extended with features to facilitate multi-device operation, including dithered quantization, delay estimation and compensation, as well as multi-channel postfiltering. The codec is intended to become a baseline for future research and we therefore provide it as an open-access library. Our experiments confirm that performance is in the same range as recent commercial single-channel codecs and that added devices improve quality.Index Terms-speech and audio coding, ad-hoc acoustic sensor networks, time difference of arrival estimation, delay compensation, multi-channel post-filtering
Voice based devices and virtual assistants are widely integrated into our daily life, but the growing popularity has also raised concerns about data privacy in processing and storage. While improvements in technology and data protection regulations have been made to provide users a more secure experience, the concept of privacy continues to be subject to enormous challenges. We can observe that people intuitively adjust their way of talking in a human-to-human conversation, an intuition that devices could benefit from to increase their level of privacy. In order to enable devices to quantify privacy in an acoustic scenario, this paper focuses on how people perceive privacy with respect to environmental noise. We measured privacy scores on a crowdsourcing platform with a paired comparison listening test and obtained reliable and consistent results. Our measurements show that the experience of privacy varies depending on the acoustic features of the ambient noise. Furthermore, multiple probabilistic choice models were fitted to the data to obtain a meaningful ordering of noise scenarios conveying listeners' preferences. A preference tree model was found to fit best, indicating that subjects change their decision strategy depending on the scenarios under test.
The effect that advances in voice interface technologies have on privacy has not yet received the attention it deserves. Systems in which multiple devices collaborate to provide a unified user-interface amplify those worries about privacy. We discuss ethical implications of voice enabled devices on privacy in typical scenarios at home, office, in a car and in the public. From our findings, it follows that the reach of voice can be exploited as a feature to intuitively define the extent of privacy. In particular, the acoustic reach of speech signals can serve as a feature for designing privacy-gentle voice user-interfaces which are intuitive to use. We argue that this approach poses reasonable technological requirements and establishes a natural experience of privacy which confirms intuitive perception.
In scenarios such as remote work, open offices and call centers, multiple people may simultaneously have independent spoken interactions with their devices in the same room. The speech of competing speakers will however be picked up by all microphones, both reducing the quality of audio and exposing speakers to breaches in privacy. We propose a cooperative cross-talk cancellation solution breaking the single active speaker assumption employed by most telecommunication systems. The proposed method applies source separation on the microphone signals of independent devices, to extract the dominant speaker in each device. It is realized using a localization estimator based on a deep neural network, followed by a time-frequency mask to separate the target speech from the interfering one at each time-frequency unit referring to its orientation. By experimental evaluation, we confirm that the proposed method effectively reduces crosstalk and exceeds the baseline expectation maximization method by 10 dB in terms of interference rejection. This performance makes the proposed method a viable solution for cross-talk cancellation in near-field conditions, thus protecting the privacy of external speakers in the same acoustic space.
The proliferation of acoustic human-computer interaction raises privacy concerns since it allows Voice User Interfaces (VUI) to overhear human speech and to analyze and share content of overheard conversation in cloud datacenters and with third parties. This process is non-transparent regarding when and which audio is recorded, the reach of the speech recording, the information extracted from a recording and the purpose for which it is used. To return control over the use of audio content to the individual who generated it, we promote intuitive privacy for VUIs, featuring a lightweight consent mechanism as well as means of secure verification (proof of consent) for any recorded piece of audio. In particular, through audio fingerprinting and fuzzy cryptography, we establish a trust zone, whose area is implicitly controlled by voice loudness with respect to environmental noise (Signal-to-Noise Ratio (SNR)). Secure keys are exchanged to verify consent on the use of an audio sequence via digital signatures. We performed experiments with different levels of human voice, corresponding to various trust situations (e.g. whispering and group discussion). A second scenario was investigated in which a VUI outside of the trust zone could not obtain the shared secret key.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.