2021
DOI: 10.48550/arxiv.2110.05423
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Abstract: Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis by applying natural language processing techniques towards automating the code translation task. However, due to the paucity of parallel data in this domain, supervised techniques have only been applied to a limited set of popular programming languages. To bypass this limitat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…Zhao et al [28] also explored data augmentation in neural machine translation to improve dataset diversification. Notably, Agarwal et al [29] proposed using document similarity methods to create noisy parallel datasets of code, enabling the advancement of machine translation with monolingual datasets.…”
Section: Dataset Synthesismentioning
confidence: 99%
See 1 more Smart Citation
“…Zhao et al [28] also explored data augmentation in neural machine translation to improve dataset diversification. Notably, Agarwal et al [29] proposed using document similarity methods to create noisy parallel datasets of code, enabling the advancement of machine translation with monolingual datasets.…”
Section: Dataset Synthesismentioning
confidence: 99%
“…With dataset generation, transformer-based models have proven effective for parallel corpus mining in the domain of machine translation [30]. Previous research has tried using classification techniques, such as document similarity [29], to identify translations from pre-existing corpora.…”
Section: Dataset Synthesismentioning
confidence: 99%