Transport based distances, such as the Wasserstein distance and earth mover'sdistance, have been shown to be an effective tool in signal and image analysis. The success of transport based distances is in part due to their Lagrangian nature which allows it to capture the important variations in many signal classes. However these distances require the signal to be nonnegative and normalized. Furthermore, the signals are considered as measures and compared by redistributing (transporting) them, which does not directly take into account the signal intensity. Here we study a transport-based distance, called the TLp distance, that combines Lagrangian and intensity modelling and is directly applicable to general, non-positive and multi-channelled signals. The distance can be computed by existing numerical methods. We give an overview of the basic properties of this distance and applications to classification, with multi-channelled non-positive one-dimensional signals and two-dimensional images, and color transfer.
Transport-based techniques for signal and data analysis have received increased attention recently. Given their abilities to provide accurate generative models for signal intensities and other data distributions, they have been used in a variety of applications including content-based retrieval, cancer detection, image super-resolution, and statistical machine learning, to name a few, and shown to produce state of the art in several applications. Moreover, the geometric characteristics of transport-related metrics have inspired new kinds of algorithms for interpreting the meaning of data distributions. Here we provide an overview of the mathematical underpinnings of mass transport-related methods, including numerical implementation, as well as a review, with demonstrations, of several applications. Software accompanying this tutorial is available at [134].
In this paper, we propose a generative model in the space of diffeomorphic deformation maps. More precisely, we utilize the Kantarovich-Wasserstein metric and accompanying geometry to represent an image as a deformation from templates. Moreover, we incorporate a probabilistic viewpoint by assuming that each image is locally generated from a reference image. We capture the local structure by modelling the tangent planes at reference images. Once basis vectors for each tangent plane are learned via probabilistic PCA, we can sample a local coordinate, that can be inverted back to image space exactly. With experiments using 4 different datasets, we show that the generative tangent plane model in the optimal transport (OT) manifold can be learned with small numbers of images and can be used to create infinitely many 'unseen' images. In addition, the Bayesian classification accompanied with the probabilist modeling of the tangent planes shows improved accuracy over that done in the image space. Combining the results of our experiments supports our claim that certain datasets can be better represented with the Kantarovich-Wasserstein metric. We envision that the proposed method could be a practical solution to learning and representing data that is generated with templates in situatons where only limited numbers of data points are available.
Social networks, such as Twitter, form a heterogeneous information network (HIN) where nodes represent domain entities (e.g., user, content, advertiser, etc.) and edges represent one of many entity interactions (e.g, a user re-sharing content or "following" another). Interactions from multiple relation types can encode valuable information about social network entities not fully captured by a single relation; for instance, a user's preference for accounts to follow may depend on both user-content engagement interactions and the other users they follow. In this work, we investigate knowledge-graph embeddings for entities in the Twitter HIN (TwHIN); we show that these pretrained representations yield significant offline and online improvement for a diverse range of downstream recommendation and classification tasks: personalized ads rankings, account followrecommendation, offensive content detection, and search ranking. We discuss design choices and practical challenges of deploying industry-scale HIN embeddings, including compressing them to reduce end-to-end model latency and handling parameter drift across versions.
We present TwHIN-BERT, a multilingual language model trained on in-domain data from the popular social network Twitter. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on a variety of multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pretrained language models. We will freely opensource TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community 1 .
We present our HABERTOR model for detecting hatespeech in large scale user-generated content. Inspired by the recent success of the BERT model, we propose several modifications to BERT to enhance the performance on the downstream hatespeech classification task. HABERTOR inherits BERT's architecture, but is different in four aspects: (i) it generates its own vocabularies and is pre-trained from the scratch using the largest scale hatespeech dataset; (ii) it consists of Quaternionbased factorized components, resulting in a much smaller number of parameters, faster training and inferencing, as well as less memory usage; (iii) it uses our proposed multisource ensemble heads with a pooling layer for separate input sources, to further enhance its effectiveness; and (iv) it uses a regularized adversarial training with our proposed finegrained and adaptive noise magnitude to enhance its robustness. Through experiments on the large-scale real-world hatespeech dataset with 1.4M annotated comments, we show that HABERTOR works better than 15 state-ofthe-art hatespeech detection methods, including fine-tuning Language Models. In particular, comparing with BERT, our HABERTOR is 4∼5 times faster in the training/inferencing phase, uses less than 1/3 of the memory, and has better performance, even though we pretrain it by using less than 1% of the number of words. Our generalizability analysis shows that HABERTOR transfers well to other unseen hatespeech datasets and is a more efficient and effective alternative to BERT for the hatespeech classification.
Phage Vardy is a lytic siphovirus isolated from creek soil in Cullowhee, NC, using Gordonia rubripertincta NRRL B-16540. Vardy’s 60,144-bp genome contains 90 predicted genes and five copies of a 50-bp motif that may regulate gene expression. Based on gene content similarity, Vardy is assigned to cluster DJ.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.