MUSIQ: Multi-scale Image Quality Transformer

Ke, Junjie; Wang, Qifei; Wang, Yilin; Milanfar, Peyman; Yang, Feng

doi:10.1109/iccv48922.2021.00510

Cited by 231 publications

(118 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the input resolution increases, the performance improves, benefiting from its strong non-local capacity. Also, MaxViT shows better linear correlation compared to the SOTA method [41] which uses multi-resolution inputs.…”

Section: Image Aesthetic Assessmentmentioning

confidence: 96%

See 1 more Smart Citation

MaxViT: Multi-Axis Vision Transformer

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to "see" globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. We will make the code and models publicly available.

show abstract

Section: Image Aesthetic Assessmentmentioning

confidence: 96%

“…Each image in the dataset has a histogram of scores associated with it, which we use as the ground truth label. Similar to [41,75], we split the dataset into train and test sets, such that 20% of the data is used for testing. We train MaxViT for three different input resolutions: 224 × 224, 384 × 384 and 512 × 512.…”

Section: B3 Image Aesthetics Assessmentmentioning

confidence: 99%

MaxViT: Multi-Axis Vision Transformer

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…DBCNN [45] provided a dual bilinear network for NR-IQA. MUSIQ [46] also developed a transformer-based NR-IQA metric for the multi-scale information. Despite there are numerous NR-IQA methods with well-designed extractors and regressors, they almost neglect to investigate the special textural and structural degradation caused by image SR.…”

Section: B General Image Quality Assessmentmentioning

confidence: 99%

Textural-Perceptual Joint Learning for No-Reference Super-Resolution Image Quality Assessment

Liu¹,

Jia²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Image super-resolution (SR) has been widely investigated in recent years. However, it is challenging to fairly estimate the performances of various SR methods, as the lack of reliable and accurate criteria for perceptual quality. Existing SR image quality assessment (IQA) metrics usually concentrate on the specific kind of degradation without distinguishing the visual sensitive areas, which have no adaptive ability to describe the diverse SR degeneration situations. In this paper, we focus on the textural and structural degradation of image SR which acts as a critical role for visual perception, and design a dual stream network to jointly explore the textural and structural information for quality prediction, dubbed TSNet. By mimicking the human vision system (HVS) that pays more attention to the significant areas of the image, we develop the spatial attention mechanism to make the visual-sensitive areas more distinguishable, which improves the prediction accuracy. Feature normalization (F-Norm) is also developed to investigate the inherent spatial correlation of SR features and boost the network representation capacity. Experimental results show the proposed TSNet predicts the visual quality more accurate than the state-ofthe-art IQA methods, and demonstrates better consistency with the human's perspective. The source code will be made available at http://github.com/yuqing-liu-dut/NRIQA SR.

show abstract

“…Then, they borrowed visual transformer's (ViT) architecture to extract the ResNet output features further. J. Ke directly applied the ViT module as the backbone for blind IQA [14]. They kept the image aspect ratio and used multiscale images as the input.…”

Section: A Blind Image Quality Assessmentmentioning

confidence: 99%

A Quality Index Metric and Method for Online Self-Assessment of Autonomous Vehicles Sensory Perception

Zhang¹,

Eskandarian²

2022

Preprint

View full text Add to dashboard Cite

Perception is critical to autonomous driving safety. Camera-based object detection is one of the most important methods for autonomous vehicle perception. Current camerabased object detection solutions for autonomous driving cannot provide feedback on the detection performance for each frame. We propose an evaluation metric, namely the perception quality index (PQI), to assess the camera-based object detection algorithm performance and provide the perception quality feedback frame by frame. The method of the PQI generation is by combining the fine-grained saliency map intensity with the object detection algorithm's output results. Furthermore, we developed a superpixel-based attention network (SPA-NET) to predict the proposed PQI evaluation metric by using raw image pixels and superpixels as input. The proposed evaluation metric and prediction network are tested on three open-source datasets . The proposed evaluation metric can correctly assess the camera-based perception quality under the autonomous driving environment according to the experiment results. The network regression Rsquare values determine the comparison among models. It is shown that a Perception Quality Index is useful in self-evaluating a camera's visual scene perception.

show abstract

MUSIQ: Multi-scale Image Quality Transformer

Cited by 231 publications

References 33 publications

MaxViT: Multi-Axis Vision Transformer

MaxViT: Multi-Axis Vision Transformer

Textural-Perceptual Joint Learning for No-Reference Super-Resolution Image Quality Assessment

A Quality Index Metric and Method for Online Self-Assessment of Autonomous Vehicles Sensory Perception

Contact Info

Product

Resources

About