Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

Kumar, Yaman; Sahrawat, Dhruva; Maheshwari, Sachin; Mahata, Debanjan; Stent, Amanda; Yin, Yifang; Shah, Rajiv Ratn; Zimmermann, Roger

doi:10.1609/aaai.v34i03.5649

Cited by 12 publications

(14 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In 2020, the long-tail item recommendation method began to the deep learning method. For example, Bai et al [36] use stacked denoising autoencoders (SDAE) to realize online long-tail item recom-mendation, and Kumar et al [12] proposed to realize longtail item recommendation using few shot learning. Bai et al [36] proposed a deep learning framework for long-tail item recommendation (DLTSR).…”

Section: Multiobjective Optimization-based Long-tail Itemmentioning

confidence: 99%

“…The content-based recommendation method and collaborative filtering recommendation method [2,3] are classic methods in the recommender system. Machine learning and deep learning have great advantages in learning the inherent laws and representation levels of sample data and have made many research achievements in image classification [4][5][6][7], object detection [8][9][10][11], speech recognition [12,13], and emotion recognition [14]. Therefore, researchers combine machine learning, deep learning, knowledge graph, and other technologies in these basic methods, allowing recommender systems to be widely used in news, tourism, e-commerce, and other fields.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Long‐Tail Item Recommendation Methods

Qin

2021

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

Recommender systems represent a critical field of AI technology applications. The core function of a recommender system is to recommend items of interest to users, but if it is only user history-based (purchasing or browsing data), it can only recommend similar products to a user, which makes the user feel fatigued (creating so-called “Information Cocoons”). Besides, transaction data (purchasing or browsing data) in various fields usually follow Pareto distributions. Accordingly, 20% of products are purchased or viewed a greater number of times (short-head items), while the remaining 80% of products are purchased or viewed less frequently (long-tail items). Using the traditional recommendation method, considering only the accuracy of recommendations, the coverage rate is relatively low, and most of the recommended items are short-head items. The long-tail item recommendation method not only considers the recommendation of short-head items but also considers recommending more long-tail items to users, thus improving the coverage and diversity of the recommendation results. Long-tail item recommendation research has become a frontier issue in recommendation systems in recent years. While the current research paper is still scarce, there have been related research achievements in top-level conferences in the field of computers, such as VLDB and IJCAI. Due to the fact that there is no review literature in this field, to allow readers to better understand the research status of the long-tail item recommendation method, this paper summarizes the progress of the research on long-tail item recommendation methods (from clustering-based, which began in 2008, to deep learning-based methods, which began in 2020) and the future directions associated with this research.

show abstract

Section: Multiobjective Optimization-based Long-tail Itemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Survey of Long‐Tail Item Recommendation Methods

Qin

2021

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

show abstract

“…A related task to video frame interpolation is talking face generation. Here, given an audio waveform, the task is to synthesize a talking face [4,5,6]. In recent times, these approaches have become popular for both academic and non-academic purposes [7].…”

Section: Introductionmentioning

confidence: 99%

“…In recent times, these approaches have become popular for both academic and non-academic purposes [7]. While, on the one hand, they are being used to extend speechreading models to low resource languages [6], on the other, many of them are also used to generate fake news and paid content as well.…”

Section: Introductionmentioning

confidence: 99%

“…Speech as a natural signal is composed of three parts [10]: visual modality, audio modality, and the context in which it was spoken (crudely, the role played by language). Correspondingly, there are three tasks for modeling speech: speech-reading (or popularly known as lipreading) [6,11,12], speech recognition (or ASR) [13] and language modeling [14]. The part of speech which is closest to the speech video generation task is the visual modality of speech; and visemes are the fundamental units of this part of speech.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LIFI: Towards Linguistically Informed Frame Interpolation

Mathur

Batra

Kumar

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Here we explore the problem of speech video interpolation. With close to 70% of web traffic, such content today forms the primary form of online communication and entertainment. Despite high performance on conventional metrics like MSE, PSNR, and SSIM, we find that the state-of-the-art frame interpolation models fail to produce faithful speech interpolation. For instance, we observe the lips stay static while the person is still speaking for most interpolated frames. With this motivation, using the information of words, subwords, and visemes, we provide a new set of linguistically informed metrics targeted explicitly to the problem of speech video interpolation. We release several datasets to test video interpolation models of their speech understanding. We also design linguistically informed deep learning video interpolation algorithms to generate the missing frames.

show abstract

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

show abstract

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

Cited by 12 publications

References 24 publications

A Survey of Long‐Tail Item Recommendation Methods

A Survey of Long‐Tail Item Recommendation Methods

LIFI: Towards Linguistically Informed Frame Interpolation

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Contact Info

Product

Resources

About