Learning Universal Authorship Representations

Rivera-Soto, Rafael A.; Miano, Olivia Elizabeth; Ordonez, Juanita; Chen, Barry Y.; Khan, Aleem I.; Bishop, Marcus; Andrews, Nicholas

doi:10.18653/v1/2021.emnlp-main.70

Cited by 11 publications

(8 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate authorship style transfer, we adopt the Confusion metric from the evaluation framework defined by Patel, Andrews, and Callison-Burch (2022), where the authors utilize pretrained style embedders (Wegmann, Schraagen, and Nguyen 2022;Rivera-Soto et al 2021) to measure style transfer success. Confusion, which is similar to style transfer accuracy, is the percentage of the time that the style transfer output is closer to the target author than the source author in representational embedding space.…”

Section: Authorship Style Transfermentioning

confidence: 99%

“…We compute the above metrics for both Style Embeddings (Wegmann, Schraagen, and Nguyen 2022) and Universal Authorship Representations (UAR) (Rivera-Soto et al 2021). Similar to our external style classifier for attribute transfer, UAR provides a holdout embedding space that PARAGUIDE does not directly optimize at inference time.…”

Section: Authorship Style Transfermentioning

confidence: 99%

See 1 more Smart Citation

ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer

Horvitz,

Patel,

Callison-Burch

et al. 2024

AAAI

View full text Add to dashboard Cite

Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g. formality) to authorship (e.g. Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.

show abstract

Section: Authorship Style Transfermentioning

confidence: 99%

Section: Authorship Style Transfermentioning

confidence: 99%

ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer

Horvitz,

Patel,

Callison-Burch

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…The AAVC is a widely recognized corpus specifically designed for authorship verification studies, as evidenced by its utilization in various studies (Boenninghoff et al 2019;Halvani et al 2020;Ishihara 2023;Rivera-Soto et al 2021). Certain aspects of the data, such as genre and document length, are well-controlled.…”

Section: Databasementioning

confidence: 99%

Validation in Forensic Text Comparison: Issues and Opportunities

Ishihara,

Kulkarni,

Carne

et al. 2024

Languages

View full text Add to dashboard Cite

It has been argued in forensic science that the empirical validation of a forensic inference system or methodology should be performed by replicating the conditions of the case under investigation and using data relevant to the case. This study demonstrates that the above requirement for validation is also critical in forensic text comparison (FTC); otherwise, the trier-of-fact may be misled for their final decision. Two sets of simulated experiments are performed: one fulfilling the above validation requirement and the other overlooking it, using mismatch in topics as a case study. Likelihood ratios (LRs) are calculated via a Dirichlet-multinomial model, followed by logistic-regression calibration. The derived LRs are assessed by means of the log-likelihood-ratio cost, and they are visualized using Tippett plots. Following the experimental results, this paper also attempts to describe some of the essential research required in FTC by highlighting some central issues and challenges unique to textual evidence. Any deliberations on these issues and challenges will contribute to making a scientifically defensible and demonstrably reliable FTC available.

show abstract

“…These approaches encompass TF-IDF-based clustering and classification techniques (Agarwal et al, 2019;İzzet Bozkurt et al, 2007), conventional convolutional neural networks (CNNs) (Rhodes, 2015;Shrestha et al, 2017), recurrent neural networks (RNNs) (Zhao et al, 2018;Jafariakinabad et al, 2019;Gupta et al, 2019), and contextualized transformers (Fabien et al, 2020a;Ordoñez et al, 2020;Uchendu et al, 2020;Barlas and Stamatatos, 2021). Moreover, researchers have recently demonstrated the effectiveness of contrastive learning approaches (Gao et al, 2022) for authorship tasks (Rivera-Soto et al, 2021;Ai et al, 2022). These advancements have led to applications in style representational approaches (Hay et al, 2020;Zhu and Jurgens, 2021;Wegmann et al, 2022), which currently represent the state-of-the-art (SOTA) for authorship tasks.…”

Section: Authorship Attribution In Nlpmentioning

confidence: 99%

IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements

Saxena,

Ashpole,

van Dijck

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that many HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators 1 .

show abstract

Learning Universal Authorship Representations

Cited by 11 publications

References 13 publications

ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer

ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer

Validation in Forensic Text Comparison: Issues and Opportunities

IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements

Contact Info

Product

Resources

About