Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective

Li, Kai; Min, Martin Renqiang; Fu, Yun

doi:10.1109/iccv.2019.00368

Cited by 110 publications

(66 citation statements)

References 40 publications

(57 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Originating from non-realistic rendering [18], image style transfer is closely related to texture synthesis [5,7,6]. Gatys et al [8] were the first to formulate style transfer as the matching of multi-level deep features extracted from a pre-trained deep neural network, which has been widely used in various tasks [20,21,22]. Lots of improvements have been proposed based on the works of Gatys et al [8].…”

Section: Related Workmentioning

confidence: 99%

Multimodal Style Transfer via Graph Cuts

Zhang

Chen²,

Wang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

InputsAdaIN WCT LST MST (ours) Inputs CNNMRF DFR AvatarNet MST (ours) Figure 1: Gram matrix based style transfer methods (AdaIN [11], WCT [24], and LST [23]) may fail to distinguish style patterns (1st and 2nd rows). Patch-swap based methods (CNNMRF [19], DFR [10], and AvatarNet [36]) may copy some less desired style patterns (labeled with red arrows) to the results (3rd and 4th rows). Our MST alleviates all these limitations. AbstractAn assumption widely used in recent neural style transfer methods is that image styles can be described by global statics of deep features like Gram or covariance matrices. Alternative approaches have represented styles by decomposing them into local pixel or neural patches. Despite the recent progress, most existing methods treat the semantic patterns of style image uniformly, resulting unpleasing results on complex styles. In this paper, we introduce a more flexible and general universal style transfer technique: multimodal style transfer (MST). MST explicitly considers the matching of semantic patterns in content and style images. Specifically, the style image features are clustered into substyle components, which are matched with local content features under a graph cut formulation. A reconstruction network is trained to transfer each sub-style and render the final stylized result. We also generalize MST to improve some existing methods. Extensive experiments demonstrate the superior effectiveness, robustness, and flexibility of MST.

show abstract

Section: Related Workmentioning

confidence: 99%

Multimodal Style Transfer via Graph Cuts

Zhang

Chen²,

Wang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In other words, it addresses multi-class learning problems when some classes do not have sufficient training data. However, during the learning process, additional visual and semantic features such as word embeddings [132], visual attributes [133], or descriptions [134] can be assigned to both seen and unseen classes. In the context of multimodality, a multimodal mapping scheme typically combines visual and semantic attributes using only data related to the seen classes.…”

Section: Zero-shot Learningmentioning

confidence: 99%

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

show abstract

“…The application of zero-shot in super-resolution, i.e, the ZSSR prescription [23], is among the most widely used models in superresolution and has gained increasing interest recently. In addition, the majority of zero-shot methods that have provided a large number of excellent results in recent years are mainly based on segmentation [39], emotion recognition [40], object detection [41], image retrieval [42][43][44][45], image classification [46][47][48] and intelligent learning in machines or robots [49]. In the ZSSR formalism, LR images are downsampled to generate many lower-resolution images (I = I 0 , I 1 , I 2 , ..., I n ), which serve as the HR supervision information called "HR fathers, " then, each HR father is downscaled by the required scale factor s to obtain the corresponding "LR sons.…”

Section: Zero-shotmentioning

confidence: 99%

A Single Historical Painting Super-Resolution via a Reference-Based Zero-Shot Network

Shi¹,

Xu²,

Zhang³

et al. 2021

IJCIS

View full text Add to dashboard Cite

As an important part of human cultural heritage, many ancient paintings have suffered from various deteriorations that have led to texture blurring, color fading, etc. Single image super-resolution (SISR) which aims to recover a high-resolution (HR) version from a low-resolution (LR) input is actively engaged in the digital preservation of cultural relics. Currently, only traditional superresolution is widely studied and used in cultural heritage, and it is difficult to apply learning-based SISR to unique historical paintings because of the absence of both ground truth and datasets. Fortunately, the recently proposed ZSSR method suggests that it is feasible to generate a small dataset and extract self-supervised information from a single image. However, when applied to the preservations of historical paintings, the performance of ZSSR is highly limited due to the lack of image knowledge. To address the above issues and to unleash the great potential of learning-based methods in heritage conservation, we present Ref-ZSSR, which is the first attempt to combine zero-shot and reference-based methods to achieve SISR. In our model, both global and local multi-scale similar information is fully exploited from the painting itself. In an end-to-end manner, this information provides consistent artistic style image knowledge and helps synthesize SR images with sharp texture details. Compared with the ZSSR method, our approach shows improvement in both precision (approximately 4.68 dB for scale ×2) and visual perception. It is worth mentioning that all image knowledge required in our method can be extracted from the painting itself, i.e., external examples are not required. Therefore, this approach can be easily generalized to any damaged historical paintings, broken murals, noisy old photos, incomplete art works, etc.

show abstract

Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective

Cited by 110 publications

References 40 publications

Multimodal Style Transfer via Graph Cuts

Multimodal Style Transfer via Graph Cuts

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A Single Historical Painting Super-Resolution via a Reference-Based Zero-Shot Network

Contact Info

Product

Resources

About