2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00509
|View full text |Cite
|
Sign up to set email alerts
|

Generating Diverse and Natural 3D Human Motions from Text

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
75
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 154 publications
(105 citation statements)
references
References 30 publications
0
75
0
Order By: Relevance
“…Other studies learned a joint embedding projection for both modalities [Ahuja and Morency 2019;Ghosh et al 2021] and generated motions using a decoder. Some research applied auto-regressive methods [Guo et al 2022a], encoding text and generating motion frames sequentially. Recent approaches, such as [Petrovich et al 2022], use stochastic for diverse generations.…”
Section: Human Motion Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…Other studies learned a joint embedding projection for both modalities [Ahuja and Morency 2019;Ghosh et al 2021] and generated motions using a decoder. Some research applied auto-regressive methods [Guo et al 2022a], encoding text and generating motion frames sequentially. Recent approaches, such as [Petrovich et al 2022], use stochastic for diverse generations.…”
Section: Human Motion Generationmentioning
confidence: 99%
“…Our objective is to identify the best automated metric for evaluating language-conditioned human motion generations, with "best" referring to the metric most closely correlated with human judgments. While various automated metrics have been proposed [Ahuja and Morency 2019;Ghosh et al 2021;Guo et al 2022a] and some works have conducted comparative human evaluations [Guo et al 2022a;Petrovich et al 2022], none have directly addressed this question. Developing appropriate automated metrics correlated with human judgments has been vital in fields such as machine translation [Papineni et al 2002;Zhang et al 2019], and we believe it is essential for advancing text-to-motion methods.…”
Section: Introductionmentioning
confidence: 99%
“…Early works on translating text description to human motion adopt a deterministic encoder-decoder architecture [2,16]. Since motions in nature are stochastic, recent works have started to use deep generative models such as GANs, VAEs [18,42], or diffusion models [47,61,77] to generate motions. Note that these motion generation methods are trained on large motion datasets and are typically limited to human motion generation.…”
Section: Motion Generationmentioning
confidence: 99%
“…We use the official implementation 3 of Human Motion Generation Model [61] to generate motion sequences. They use the CLIP model to encode natural language descriptions and use a transformer denoiser over a 263-dimensional feature space proposed in [18]. The generated feature is then rendered into SMPL mesh [30].…”
Section: A Implementation Detailsmentioning
confidence: 99%
“…AV Letters [116] 2002 ∼19k Video-Text English Link AV Digits [117] 2002 ∼5k Video-Text English Link Aoyama Gakuin [118] 2017 ∼1k Videos-Text-Audio-Skelton(2D) Japanese Not Available P2PSTORY [119] 2018 ∼13k Video-Text-Audio Multiple Link AMASS [120] 2019 ∼18k video-text-Skelton(3D) English Link BoLD [121] 2020 ∼10k Video-Text-Audio-Skelton(3D) English Link PATS [122] 2020 ∼84k Videos-Text-Audio-Skelton(2D) English Link BABEL [123] 2021 ∼28k video-text-Skelton(3D) English Link HumanML3D [124] 2022 ∼15k video-text-Skelton English Link BEAT [125] 2023 ∼3k Video-Text-Audio-Skelton(3D) Multiple Link graphics, and HCI. It is usually an animated character that appears on a screen and can simulate various facial expressions, head actions, and speech with synchronized lip movements [17], [126], [127].…”
Section: Othersmentioning
confidence: 99%