Progressive Generation of Long Text with Pretrained Language Models

Tan, Bowen; Yang, Zichao; Al-Shedivat, Maruan; Xing, Eric P.; Hu, Zhiting

doi:10.18653/v1/2021.naacl-main.341

Cited by 42 publications

(29 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sequence length is one of the characteristics of hexons that added difficulties to its modeling. As reported before, generating long (∼1000s tokens) and coherent texts in a specific small domain is challenging even for fine-tuned large language models like GPT2 (Holtzman et al, 2019; Tan et al, 2020), and generated texts typically suffer from degenerate repetition. To evaluate if the generated sequences can avoid the degenerate repetition artifacts while capturing certain local repetitive patterns observed in natural sequences (Jorda et al, 2010), number of repeated amino acids was calculated in a fixed-length window sliding across all possible positions in each sequence (Figure 2c).…”

Section: Resultsmentioning

confidence: 99%

ProteinVAE: Variational AutoEncoder for Translational Protein Design

Lyu

Sowlati‐Hashjin

Garton

2023

Preprint

View full text Add to dashboard Cite

There have recently been rapid advances in deep learning models for protein design. To demonstrate proof-of-concept, these advancements have focused on small proteins with lots of data for training. This means that they are often not suitable for generating proteins with the most potential for high clinical impact –due to the additional challenges of sparse data and large size many therapeutically relevant proteins have. One major application that fits this category is gene therapy delivery. Viral vectors such as Adenoviruses and AAVs are a common delivery vehicle for gene therapy. However, environmental exposure means that most people exhibit potent pre-existing immune responses to many serotypes. This response, primarily driven by neutralizing antibodies, also precludes repeated administration with the same serotype. Rare serotypes, serotypes targeting other species, and capsid engineering, have all been deployed in the service of reducing neutralization by pre-existing antibodies. However, progress has been very limited using conventional methods and a new approach is urgently needed. To address this, we developed a variational autoencoder that can generate synthetic viral vector serotypes without epitopes for pre-existing neutralizing antibodies. A compact generative computational model was constructed, with only 12.4 million parameters that could be efficiently trained on the limited natural sequences (e.g., 711 natural Adenovirus hexon sequences with average length of 938 amino acids). In contrast to the current state-of-the-art, the model was able to generate high-quality Adenovirus hexon sequences that were folded with high confidence by Alphafold2 to produce structures essentially identical to natural hexon structures. Molecular dynamics simulations confirmed that the structures are stable and protein–protein interfaces are intact. Local secondary structure and local mobility is also comparable with natural serotype behavior. Our model could be used to generate a broad range of synthetic adenovirus serotype sequences without epitopes for pre-existing neutralizing antibodies in the human population. It could be used more broadly to generate different types of viral vector, and any large, therapeutically valuable proteins, where available data is sparse.

show abstract

Section: Resultsmentioning

confidence: 99%

ProteinVAE: Variational AutoEncoder for Translational Protein Design

Lyu

Sowlati‐Hashjin

Garton

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Early approaches to automatic story generation relied on graph-based planning and hand-crafted rules to structure narratives (Meehan, 1977;Callaway and Lester, 2002;Riedl and Young, 2004;Li et al, 2013). More recent works generate stories by finetuning on large-scale PLMs (See et al, 2019) to improve its fluency and incorporating structured knowledge such as planned events Fang et al, 2021;Li et al, 2022), summaries (Yao et al, 2019Tan et al, 2021;Sun et al, 2020), and external knowledge (Guan et al, 2019;Xu et al, 2020b;Guan et al, 2020) to enhance its coherence and consistency. Our story generation models are also finetuned on the large-scale PLMs to generate text following the given summaries.…”

Section: Related Workmentioning

confidence: 99%

“…Large-scale pre-trained language models (PLMs) have demonstrated a remarkable aptitude for generating text with an exceptional degree of fluency and structure Tan et al, 2021), sparking renewed efforts to utilize them for the purpose of generating narrative fiction. Recent work has explored various ways of controlling PLMs, using sentiment (Luo et al, 2019), style (Kong et al, 2021a), and even character information (Liu et al, 2020a), in an attempt to cater the generated text to an author's intentions.…”

Section: Introductionmentioning

confidence: 99%

Fiction-Writing Mode: An Effective Control for Human-Machine Collaborative Writing

Zhong,

Naradowsky,

Takamura

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

We explore the idea of incorporating concepts from writing skills curricula into humanmachine collaborative writing scenarios, focusing on adding writing modes as a control for text generation models. Using crowd-sourced workers, we annotate a corpus of narrative text paragraphs with writing mode labels. Classifiers trained on this data achieve an average accuracy of ∼ 87% on held-out data. We finetune a set of large language models to condition on writing mode labels, and show that the generated text is recognized as belonging to the specified mode with high accuracy.To study the ability of writing modes to provide fine-grained control over generated text, we devise a novel turn-based text reconstruction game to evaluate the difference between the generated text and the author's intention. We show that authors prefer text suggestions made by writing mode-controlled models on average 61.1% of the time, with satisfaction scores 0.5 higher on a 5-point ordinal scale. When evaluated by humans, stories generated via collaboration with writing mode-controlled models achieve high similarity with the professionally written target story. We conclude by identifying the most common mistakes found in the generated stories. The datasets and codes are available at the Github 1 .

show abstract

“…Krishna et al (2021) shows that this approach performs well for text classification and protects against membership inference attacks. However, in narrow domains such as legal contracts, maintaining internal coherency is important for information retrieval tasks, but generating long coherent texts is still a challenging NLP task (Tan et al, 2021).…”

Section: Preserving Privacy In Textsmentioning

confidence: 99%

Towards Task-Agnostic Privacy- And Utility-Preserving Models

Emelyanov¹

2021

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Me

View full text Add to dashboard Cite

Modern deep learning models for natural language processing rely heavily on large amounts of annotated texts. However, obtaining such texts may be difficult when they contain personal or confidential information, for example, in health or legal domains. In this work, we propose a method of de-identifying free-form text documents by carefully redacting sensitive data in them. We show that our method preserves data utility for text classification, sequence labeling, and question answering tasks.

show abstract

Progressive Generation of Long Text with Pretrained Language Models

Cited by 42 publications

References 28 publications

ProteinVAE: Variational AutoEncoder for Translational Protein Design

ProteinVAE: Variational AutoEncoder for Translational Protein Design

Fiction-Writing Mode: An Effective Control for Human-Machine Collaborative Writing

Towards Task-Agnostic Privacy- And Utility-Preserving Models

Contact Info

Product

Resources

About