Multiview Shared Subspace Learning Across Speakers and Speech Commands

Somandepalli, Krishna; Kumar, Naveen; Jati, Arindam; Georgiou, Panayiotis G.; Narayanan, Shrikanth

doi:10.21437/interspeech.2019-3130

Cited by 5 publications

(6 citation statements)

References 23 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Schroff et al [19] have introduced FaceNet and the triplet-loss for projecting images onto a latent space that quantifies similarity in a supervised-learning manner. Recently, Somandepalli et al [20] used tracking of faces in a photo-realistic video, followed by clustering and verification using MvCorr [21] and Improved Triplet [22] to adapt available face representation data to perform better on racially diverse images following [23]. Aneja et al [24] have suggested DeepExpr model for facial expression recognition for multiple styles.…”

Section: Related Workmentioning

confidence: 99%

CAST: Character labeling in Animation using Self-supervision by Tracking

Nir¹,

Rapoport²,

Shamir³

2022

Preprint

View full text Add to dashboard Cite

Cartoons and animation domain videos have very different characteristics compared to real-life images and videos. In addition, this domain carries a large variability in styles. Current computer vision and deep-learning solutions often fail on animated content because they were trained on natural images. In this paper we present a method to refine a semantic representation suitable for specific animated content. We first train a neural network on a large-scale set of animation videos and use the mapping to deep features as an embedding space. Next, we use self-supervision to refine the representation for any specific animation style by gathering many examples of animated characters in this style, using a multi-object tracking. These examples are used to define triplets for contrastive loss training. The refined semantic space allows better clustering of animated characters even when they have diverse manifestations. Using this space we can build dictionaries of characters in an animation videos, and define specialized classifiers for specific stylistic content (e.g., characters in a specific animation series) with very little user effort. These classifiers are the basis for automatically labeling characters in animation videos. We present results on a collection of characters in a variety of animation styles. Code and resources are available at: https://github.com/oronnir/CAST.

show abstract

Section: Related Workmentioning

confidence: 99%

CAST: Character labeling in Animation using Self-supervision by Tracking

Nir¹,

Rapoport²,

Shamir³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We first review the multi-view correlation (mv-corr) objective developed by Somandepalli et. al., (2019a;2019b).…”

Section: Proposed Approachmentioning

confidence: 99%

“…recordings from over 1800 speakers saying one or more of 30 commands such as "On" and "Off". The application of mv-corr for spoken-word recognition and text-dependent speaker recognition in SCD was studied by Somandepalli et al (2019a) for speaker recognition task compared to the SoA in this domain (Snyder et al, 2017). Building upon their work, in this paper, we analyze spoken-word recognition on SCD in a greater detail.…”

Section: Speech Commands Datasetmentioning

confidence: 99%

“…Speaker verification methods based on joint factor analysis (Dehak et al, 2009) and total variability modeling (Dehak et al, 2011) have explored the ideas of factoring out the speaker-dependent factors and speaker-independent factors to obtain robust speaker representations in the context of domain adaption. Recently, Somandepalli et al (2019a) showed that a more robust speech representation can be obtained by explicitly modeling multiple utterances of a word as corresponding views.…”

Section: Domain Adaptation In a Multi-view Paradigmmentioning

confidence: 99%

“…In this paper, we build upon the work by Somandepalli et. al., (2019a;2019b) where a multi-view correlation objective (mv-corr) was proposed to learn shared subspaces across multiple views.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Generalized Multiview Shared Subspace Learning Using View Bootstrapping

Somandepalli

Narayanan

2021

IEEE Trans. Signal Process.

Self Cite

View full text Add to dashboard Cite

A key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks. In this context, two open research questions remain: How can we model hundreds of views per event? Can we learn robust multi-view embeddings without any knowledge of how these views are acquired? We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training. To provide an upper bound on the number of views to subsample for a given embedding dimension, we analyze the error of the bootstrapped multi-view correlation objective using matrix concentration theory. Our experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views. Results underscore the applicability of our method for a view-agnostic learning setting.

show abstract