BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild

Xu, Xiping; Qi, Zhongang; Ma, Jianqi; Zhang, Honglun; Shan, Ying; Qie, Xiaohu

doi:10.1109/cvpr52688.2022.01856

Cited by 10 publications

(9 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Ramesh et al [33] the authors demonstrate that applying the Diffusion Prior and conditioning over the resulting image embeddings attains improved diversity while enabling image variations, interpolations, and editing. Several works have adopted the use of a Diffusion Prior for text-guided video synthesis [13,43] and 3D generation and texturing [26,51]. The use of Diffusion Prior for text-guided synthesis is further analyzed in [1,53].…”

Section: Related Workmentioning

confidence: 99%

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Richardson

Alaluf

Patashnik

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

694

512

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Richardson

Alaluf

Patashnik

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

694

512

View full text Add to dashboard Cite

show abstract

“…In [1], the authors put forward TextRnet and a new text segmentation dataset, where TextRnet utilizes the unique text prior such as texture diversity and non-convex contours to achieve state-of-the-art performance on text segmentation benchmarks. The authors of [7] mainly focus on Bi-Lingual text segmentation, they propose a Bi-Lingual text dataset along with PGTSNet which contains a plug-in text-highlighting module and a text perceptual module to help distinguish between text languages. Most recently, Textformer [8] leverages the similarities between text components by proposing a multi-level transformer framework to enhance the interaction between text components and image features at different granularities.…”

Section: Text Segmentation Methodsmentioning

confidence: 99%

Weakly-Supervised Text Instance Segmentation

Zu¹,

Yu²,

Li³

et al. 2023

Preprint

View full text Add to dashboard Cite

Text segmentation is a challenging vision task with many downstream applications. Current text segmentation methods require pixel-level annotations, which are expensive in the cost of human labor and limited in application scenarios. In this paper, we take the first attempt to perform weaklysupervised text instance segmentation by bridging text recognition and text segmentation. The insight is that text recognition methods provide precise attention position of each text instance, and the attention location can feed to both a text adaptive refinement head (TAR) and a text segmentation head. Specifically, the proposed TAR generates pseudo labels by performing two-stage iterative refinement operations on the attention location to fit the accurate boundaries of the corresponding text instance. Meanwhile, the text segmentation head takes the rough attention location to predict segmentation masks which are supervised by the aforementioned pseudo labels. In addition, we design a mask-augmented contrastive learning by treating our segmentation result as an augmented version of the input text image, thus improving the visual representation and further enhancing the performance of both recognition and segmentation. The experimental results demonstrate that the proposed method significantly outperforms weakly-supervised instance segmentation methods on ICDAR13-FST (18.95% improvement) and TextSeg (17.80% improvement) benchmarks.

show abstract

“…Diffusion model has emerged the most advanced deep generation model and has been applied in a wide range of fields, including image super resolution [RBL*22, SHC*22, DMH21], image inpainting [LDR*22, XZL*23], image editing [MHS*21, KZL*23, YGZ*23, ZHG*23], semantic segmentation [HAZ*22b, BRV*21, BKC*22, GMJS22], video generation [HNM*22, HCS*22, ZCP*22, QCZ*23], natural language processing [AJH*21, GLF*22, HKT22, LTG*22], point cloud completion [LWYL22, LH21,VWG*22, ZDW21] and multi‐modal generation [RLJ*23, TRG*22, PVG*21, SPH*22, ALF22, BNX*23, GCB*22, NDR*21, PJBM22, XWC*23, LGT*23], as well as interdisciplinary applications in fields such as and medical image reconstruction [CSY22, CY22, PGZ*22a, PGZ*22b]. Notably, in the area of high‐resolution image generation, the impact of diffusion models has surpassed that of GANs.…”

Section: Related Workmentioning

confidence: 99%

“…This is because diffusion models are much easier to train and offer better diversity. In addition, it can be observed that diffusion models deliver promising performance on multimodal data [RLJ*23, TRG*22, PVG*21, SPH*22, ALF22, BNX*23, GCB*22, NDR*21, PJBM22, XWC*23, LGT*23] and conditional generations [NSL*23, ZLW*23, BNX*23, GCB*22, NDR*21]. Moreover, several studies have introduced diffusion models into 3D geometry generations such as point cloud [VWG*22], while extending them to directly generate neural implicit representation remains difficult.…”

Section: Introductionmentioning

confidence: 99%

Learning to Generate and Manipulate 3D Radiance Field by a Hierarchical Diffusion Framework with CLIP Latent

Wang,

Zhang,

2023

Computer Graphics Forum

View full text Add to dashboard Cite

Abstract3D‐aware generative adversarial networks (GAN) are widely adopted in generating and editing neural radiance fields (NeRF). However, these methods still suffer from GAN‐related issues including degraded diversity and training instability. Moreover, 3D‐aware GANs consider NeRF pipeline as regularizers and do not directly operate with 3D assets, leading to imperfect 3D consistencies. Besides, the independent changes in disentangled editing cannot be ensured due to the sharing of some shallow hidden features in generators. To address these challenges, we propose the first purely diffusion‐based three‐stage framework for generative and editing tasks, with a series of well‐designed loss functions that can directly handle 3D models. In addition, we present a generalizable neural point field as our 3D representation, which explicitly disentangles geometry and appearance in feature spaces. For 3D data conversion, it simplifies the preparation pipeline of datasets. Assisted by the representation, our diffusion model can separately manipulate the shape and appearance in a hierarchical manner by image/text prompts that are provided by the CLIP encoder. Moreover, it can generate new samples by adding a simple generative head. Experiments show that our approach outperforms the SOTA work in the generative tasks of direct generation of 3D representations and novel image synthesis, and completely disentangles the manipulation of shape and appearance with correct semantic correspondence in the editing tasks.

show abstract

BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild

Cited by 10 publications

References 27 publications

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Weakly-Supervised Text Instance Segmentation

Learning to Generate and Manipulate 3D Radiance Field by a Hierarchical Diffusion Framework with CLIP Latent

Contact Info

Product

Resources

About