Distributed speech processing in miPad's multimodal user interface

Deng, Li; Wang, Kuansan; Acero, Alex; Hon, Hsiao‐Wuen; Droppo, Jasha; Boulis, Constantinos; Wang, Ye-Yi; Jacoby, Derek; Mahajan, Milind; Chelba, Ciprian; Huang, Xuedong

doi:10.1109/tsa.2002.804538

Cited by 42 publications

(28 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Before deep learning methods were adopted, there had already been numerous efforts in multimodal and multitask learning. For example, a prototype called MiPad for multimodal interactions involving capturing, leaning, coordinating, and rendering a mix of speech, touch, and visual information was developed and reported in [113,164]. In [165,166], mixed sources of information from multiple-sensory microphones with separate bone-conductive and air-born paths were exploited to de-noise speech.…”

Section: B) a Selected Review On Deep Learning For Multimodal Processingmentioning

confidence: 99%

Deep learning: from speech recognition to language and multimodal processing

Deng

2016

SIP

View full text Add to dashboard Cite

While artificial neural networks have been in existence for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This invited paper, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will I . I N T R O D U C T I O NThe main theme of this paper is to reflect on the recent history of how deep learning has profoundly revolutionized the field of automatic speech recognition (ASR) and to elaborate on what kind of lessons we can learn to not only further advance ASR technology but also to impact the related, arguably more important, applications in language and multimodal processing. Language processing concerns "downstream" analysis and distillation of information from the ASR systems' outputs. Semantic analysis of language and multimodal processing involving speech, text, and image, both experiencing rapid advances based on deep learning over the past few years, holds the potential to solve some difficult and remaining ASR problems and present new challenges for the deep learning technology.A message to be conveyed in this paper is the importance of broadening deep learning from deep neural networks (DNNs) to include deep generative models as well. In fact, a brief historical review conducted in Section II will touch Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA Corresponding author: Li Deng Email: deng@microsoft.com on how the development of deep (and dynamic) generative models of speech played a role in the inroads of DNNs into modern ASR. Since 2011, the DNN has taken over the dominating (shallow) generative model of speech, the Gaussian Mixture Model (GMM), as the output distribution in the Hidden Markov Model (HMM). This purely discriminative DNN has been well-known to the ASR community, which can be considered as a shallow network unfolding in space. When the unfolding occurs in time, we have the recurrent neural network (RNN). On the other hand, deep generative models have distinct advantages over discriminative DNNs, including the strengths of model interpretability, of embedding domain knowledge and causal relationships, and of modeling uncertainty. Deep generative and discriminative models represent two apparently opposing approaches yet with highly complementary strengths and weaknesses. The further success of deep learning is likely to lie in how to seamlessly integrate the two approaches in a practically effective and theoretically appealing fashion, and to achieve the best of both worlds.The remainder of this paper is organized as follows. In Section II, some brief history is provided on how deep learning made inroad into speech recognition, and a number of enabling factors are discussed. Outstanding achievements of deep learning both in academic world and in industry to 1 https://www.cambridge.org/core/terms. https://doi

show abstract

Section: B) a Selected Review On Deep Learning For Multimodal Processingmentioning

confidence: 99%

Deep learning: from speech recognition to language and multimodal processing

Deng

2016

SIP

View full text Add to dashboard Cite

show abstract

“…MiPad is a prototype of wireless Personal Digital Assistant (PDA) that enables users to accomplish many common tasks using a multimodal spoken language interface (speech + pen + display). MiPad, as a case study for speech-centric multimodal HCI application, has been described in detail in our recent publication [2]. In this paper, we will present a second case study based on a new system built within our research group more recently, called MapPointS.…”

Section: Introductionmentioning

confidence: 99%

“…Many prototype systems have also been built based on the use of multiple modalities [1,2,7,9,14], most of which have focused on the special advantage of the speech input for mobile or wireless computing as in multimodal PDA's. Both of our prototype systems, MiPad and MapPointS, have such mobile computing in the special design consideration.…”

Section: Introductionmentioning

confidence: 99%

A Speech-Centric Perspective for Human-Computer Interface: A Case Study

Deng

2005

J VLSI Sign Process Syst Sign Image Video Technol

View full text Add to dashboard Cite

Speech technology has been playing a central role in enhancing human-machine interactions, especially for small devices for which graphical user interface has obvious limitations. The speech-centric perspective for human-computer interface advanced in this paper derives from the view that speech is the only natural and expressive modality to enable people to access information from and to interact with any device. In this paper, we describe some recent work conducted at Microsoft Research, aimed at the development of enabling technologies for speechcentric multimodal human-computer interaction. In particular, we present a case study of a prototype system, called MapPointS, which is a speech-centric multimodal map-query application for North America. This prototype navigation system provides rich functionalities that allow users to obtain map-related information through speech, text, and pointing devices. Users can verbally query for state maps, city maps, directions, places, nearby businesses and other useful information within North America. They can also verbally control applications such as changing the map size and panning the map moving interactively through speech. In the current system, the results of the queries are presented back to users through graphical user interface. An overview and major components of the MapPointS system will be presented in detail first. This will be followed by software design engineering principles and considerations adopted in developing the MapPointS system, and by a description of some key robust speech processing technologies underlying general speech-centric human-computer interaction systems.

show abstract

“…The total cost for a cluster system including maintenance is obviously lower than that of a DSP-based system in research and development stage. From the point of view of acoustical application using network communication, distributed speech processing for a personal digital assistant (PDA) is discussed [6]. The literature proposes that the speech signal is transmitted from the PDA to the remote server for automatic speech recognition.…”

Section: Introductionmentioning

confidence: 99%

Pitch detection using real-time processing system based on the cluster system

Chisaki

Nakashima

Nakagawa

et al. 2004

Acoust. Sci. & Tech.

View full text Add to dashboard Cite

A high performance pitch detection algorithm, called harmonic wavelet transform method, was proposed. Since the algorithm is based on a continuous wavelet transform, the cost of computation is high. However, real-time processing of the algorithm is required for some acoustical applications, such as multi-modal interface which can take into account of human emotion. Digital Signal Processor (DSP) is suitable for implementation due to its compactness. However, implementaion of the algorithm on DSP costs too much with respect to both time and funds. In order to release the issues, one of other devices is a cluster system. The cluster system can be constructed with ease because the computer node has recently becomes inexpensive. Moreover, software packages for parallel and distributed computing can be obtained without difficulty. From the viewpoint of acoustical signal processing services on the Internet, the implementaion on network connected systems, such as the cluster system, becomes an important issue for ubiquitous and grid computing. This paper proposes the parallel algorithm of the harmonic wavelet transform method. Furthermore, the proposed algorithm is implemented on a signal processing system based on cluster system. As a result, the proposed parallel algorithm is executed in real-time due to both the proposed parallel algorithm and the constructed real-time signal processing system.

show abstract

Distributed speech processing in miPad's multimodal user interface

Cited by 42 publications

References 5 publications

Deep learning: from speech recognition to language and multimodal processing

Deep learning: from speech recognition to language and multimodal processing

A Speech-Centric Perspective for Human-Computer Interface: A Case Study

Pitch detection using real-time processing system based on the cluster system

Contact Info

Product

Resources

About