AI Choreographer: Music Conditioned 3D Dance Generation with AIST++

Li, Ruilong; Yang, Shan; Ross, David A.; Kanazawa, Angjoo

doi:10.1109/iccv48922.2021.01315

Cited by 288 publications

(270 citation statements)

References 61 publications

Supporting

Mentioning

270

Contrasting

Order By: Relevance

“…( 4) FID, which is an extension of the original Frechet Inception Distance that calculates the distribution distance between estimated motions and the GT. FID is a standard metric in motion generation literature to evaluate the quality of generated motions [30,54,55,91]. Following prior work [55], we compute FID using the well-designed kinetic motion feature extractor in the fairmotion library [20].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras

Yuan¹,

Iqbal²,

Molchanov³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

“…FID is a standard metric in motion generation literature to evaluate the quality of generated motions [30,54,55,91]. Following prior work [55], we compute FID using the well-designed kinetic motion feature extractor in the fairmotion library [20].…”

Section: Methodsmentioning

confidence: 99%

GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras

Yuan¹,

Iqbal²,

Molchanov³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The self attention mechanism of transformers provides a natural bridge to connect multimodal signals. Applications include audio enhancement [17,63], speech recognition [26], image segmentation [63,73], cross-modal sequence generation [21,37,38], video retrieval [20] and image/video captioning/classification [28,29,36,44,60,61]. A common paradigm (which we also adapt) is to use the output representations of single modality convolutional networks as inputs to the transformer [20,35].…”

Section: Related Workmentioning

confidence: 99%

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Kazakos¹,

Huh²,

Nagrani³

et al. 2021

Preprint

View full text Add to dashboard Cite

In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN. IntroductionAction recognition in egocentric video streams from sources like EPIC-KITCHENS poses a number of challenges that differ substantially from those of conventional third-person action recognition -where training and evaluation is on 10 second video clips and classes are quite high-level [31]. Actions are fine-grained (e.g. 'open bottle') and noticeably short, often one second or shorter. Along with the challenge, the footage offers an under-explored opportunity, as actions are captured in long untrimmed videos of well-defined and at-times predictable sequences. For example the action 'wash aubergine' can be part of the following sequence -you first 'take the aubergine', 'turn on the tap', 'wash the aubergine' and finally 'turn off the tap' (Fig. 1). Furthermore, the objects (the aubergine and tap in this case) are persistent over some of the neighbouring actions.

show abstract

“…by conditioning network weights on phase, but they focus on cyclic motions. More recent methods [34,39,47,56] use attention [57].…”

Section: Related Workmentioning

confidence: 99%