A survey on multimodal-guided visual content synthesis

Zhang, Ziqi; Li, Zeyu; Pan, Siduo; Deng, Cheng

doi:10.1016/j.neucom.2022.04.126

Cited by 8 publications

(1 citation statement)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A deep network effectively learns the representations of visual content and captures complex patterns within the data. It also enables end-toend learning by mapping the source data into the expected target without the need for handcrafted features as was the case with prior techniques [5], [17]. AlignDraw [18] and Speech2V id [19] are the two deep-learning-based pioneering research in text-to-vision and audio-to-vision generation, respectively.…”

Section: Introductionmentioning

confidence: 99%

A Survey of Cross-Modal Visual Content Generation

Nazarieh,

Feng,

Awais

et al. 2024

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.

show abstract

Section: Introductionmentioning

confidence: 99%