Long Phan scite author profile

Long Phan

9Publications

69Citation Statements Received

152Citation Statements Given

How they've been cited

How they cite others

126

151

Affiliations

Viet Tri University of Industry, National Cancer Institute, Case Western Reserve University

Publications

Order By: Most citations

CoTexT: Multi-task Learning with Code-Text Transformer

Phan¹,

Tran²,

Le³

et al. 2021

View full text Add to dashboard Cite

We present CoTexT, a pre-trained, transformerbased encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pretrained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate Co-TexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

show abstract

CoTexT: Multi-task Learning with Code-Text Transformer

Phan¹,

Tran²,

Le³

et al. 2021

Preprint

View full text Add to dashboard Cite

Enriching Biomedical Knowledge for Low-resource Language Through Translation

Phan

Dang

Tran

et al. 2022

Preprint

View full text Add to dashboard Cite

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

show abstract

ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

Phan¹,

Tran²,

Nguyen³

et al. 2022

View full text Add to dashboard Cite

We present ViT5, a pretrained Transformerbased encoder-decoder model for the Vietnamese language.With T5-style selfsupervised pretraining, ViT5 is trained on a large corpus of high-quality and diverse Vietnamese texts. We benchmark ViT5 on two downstream text generation tasks, Abstractive Text Summarization and Named Entity Recognition. Although Abstractive Text Summarization has been widely studied for the English language thanks to its rich and large source of data, there has been minimal research into the same task in Vietnamese, a much lower resource language. In this work, we perform exhaustive experiments on both Vietnamese Abstractive Summarization and Named Entity Recognition, validating the performance of ViT5 against many other pretrained Transformer-based encoderdecoder models. Our experiments show that ViT5 significantly outperforms existing models and achieves state-of-the-art results on Vietnamese Text Summarization. On the task of Named Entity Recognition, ViT5 is competitive against previous best results from pretrained encoder-based Transformer models. Further analysis shows the importance of context length during the self-supervised pretraining on downstream performance across different settings.

show abstract

SPBERT: an Efficient Pre-training BERT on SPARQL Queries for Question Answering over Knowledge Graphs

Tran

Phan

Anibal

et al. 2021

View full text Add to dashboard Cite

Scalable clustering with supervised linkage methods

Anibal

Day

Bahadiroglu

et al. 2021

View full text Add to dashboard Cite

Scalable clustering with supervised linkage methods

Anibal

Day

Bahadiroglu

et al. 2021

Preprint

View full text Add to dashboard Cite

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/

show abstract

Hierarchical Transformer Encoders for Vietnamese Spelling Correction

Tran

Dinh

Phan

et al. 2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Long Phan

CoTexT: Multi-task Learning with Code-Text Transformer

CoTexT: Multi-task Learning with Code-Text Transformer

Enriching Biomedical Knowledge for Low-resource Language Through Translation

ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

SPBERT: an Efficient Pre-training BERT on SPARQL Queries for Question Answering over Knowledge Graphs

Scalable clustering with supervised linkage methods

Scalable clustering with supervised linkage methods

Hierarchical Transformer Encoders for Vietnamese Spelling Correction

Contact Info

Product

Resources

About