Using Document Similarity Methods to create Parallel Datasets for Code Translation

Agarwal, Mayank; Talamadupula, Kartik; Martínez, Fernando J.; Houde, Stephanie; Müller, Michael; Richards, John R.; Ross, Steven I; Weisz, Justin D.

doi:10.48550/arxiv.2110.05423

Cited by 1 publication

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhao et al [28] also explored data augmentation in neural machine translation to improve dataset diversification. Notably, Agarwal et al [29] proposed using document similarity methods to create noisy parallel datasets of code, enabling the advancement of machine translation with monolingual datasets.…”

Section: Dataset Synthesismentioning

confidence: 99%

See 1 more Smart Citation

A Transformer-based Approach for Translating Natural Language to Bash Commands

Teng

White

et al. 2021

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

View full text Add to dashboard Cite

Translating natural language into Bash Commands is an emerging research field that has gained attention in recent years. Most efforts have focused on producing more accurate translation models. To the best of our knowledge, only two datasets are available, with one based on the other. Both datasets involve scraping through known data sources (through platforms like stack overflow, crowdsourcing, etc.) and hiring experts to validate and correct either the English text or Bash Commands.This paper provides two contributions to research on synthesizing Bash Commands from scratch. First, we describe a state-of-the-art translation model used to generate Bash Commands from the corresponding English text. Second, we introduce a new NL2CMD dataset that is automatically generated, involves minimal human intervention, and is over six times larger than prior datasets. Since the generation pipeline does not rely on existing Bash Commands, the distribution and types of commands can be custom adjusted. We evaluate the performance of ChatGPT on this task and discuss the potential of using it as a data generator. Our empirical results show how the scale and diversity of our dataset can offer unique opportunities for semantic parsing researchers.

show abstract

Section: Dataset Synthesismentioning

confidence: 99%

“…With dataset generation, transformer-based models have proven effective for parallel corpus mining in the domain of machine translation [30]. Previous research has tried using classification techniques, such as document similarity [29], to identify translations from pre-existing corpora.…”

Section: Dataset Synthesismentioning

confidence: 99%

A Transformer-based Approach for Translating Natural Language to Bash Commands

Teng

White

et al. 2021

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

View full text Add to dashboard Cite

show abstract

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Cited by 1 publication

References 22 publications

A Transformer-based Approach for Translating Natural Language to Bash Commands

A Transformer-based Approach for Translating Natural Language to Bash Commands

Contact Info

Product

Resources

About