Leveraging Automated Unit Tests for Unsupervised Code Translation

Rozière, Baptiste; Zhang, Jie M.; Charton, François; Harman, Mark; Synnaeve, Gabriel; Lample, Guillaume

doi:10.48550/arxiv.2110.06773

Cited by 4 publications

(6 citation statements)

References 31 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are many intolerable errors in the result of machine translation, which leads to compilation failure. Roziere et al [34] proposed using automated unit tests to filter out faulty generated results. Therefore, a parallel corpus can be obtained for fine-tuning the unsupervised training translation model.…”

Section: Unsupervised Program Translationmentioning

confidence: 99%

Human-Machine Cooperative Program Translation Based on Abstract Syntax Tree

Zhang,

Liu,

et al. 2023

Preprint

View full text Add to dashboard Cite

Program translation involves converting one programming language into another, which can modernize and migrate applications across platforms. Most of the current program translation methods are sequence-based, which cannot capture the semantic information of the code. To overcome this, abstract syntax tree-based program translation rises up. However, even the state-of-the-art machine translation methods can not guarantee error-free translation results. Consequently, software engineers need to make corrections on the machine's results before using, which is labor-intensive. To solve this problem, we propose the Human-Machine Cooperative Program Translation Based on Abstract Syntax Tree (HMPT-AST), which feeds the human's modification into the model in each iteration and the model updates the result based on human's feedback. It enables two patterns of human participation: First-error-based and Layer-based HMPT-AST. In the former one, the software engineer corrects one error node in each iteration. In the latter one, all error nodes within the same layer are corrected. In addition, the proposed method can also reduce the interactive response time from two aspects: avoiding generating the correct AST subtree repeatedly with parameter cache module, as well as reducing the generation of unnecessary AST subtree with early termination module. We conducted experiments on multiple dataset. The results show that compared with baselines, our method reduces the human effort up to 94.44\% and reduces the response time up to 94.32\%. We have made the code publicly available.\footnote{https://drive.google.com/drive/folders/1878Or6cEsW21fu_cpdpnv_dGevfFr9mB?usp=sharing}

show abstract

Section: Unsupervised Program Translationmentioning

confidence: 99%

Human-Machine Cooperative Program Translation Based on Abstract Syntax Tree

Zhang,

Liu,

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Their approach achieves outstanding effectiveness. Later on, they presented DOBF (Rozière et al 2021a) and TransCoder-ST (Rozière et al 2021b), the former pretrains a model to revert the code obfuscation function by training a sequence-tosequence model; the latter uses automatic test generation techniques to automatically select high-quality translation pairs to fine-tune the pre-trained model. These works use Computational Accuracy (CA), a measure to evaluate the translated code, which is based on the ratio of test cases that have similar outputs between the input program and its translation.…”

Section: Code Translationmentioning

confidence: 99%

Mutation analysis for evaluating code translation

Guizzo,

Zhang,

Sarro

et al. 2023

Empir Software Eng

View full text Add to dashboard Cite

Source-to-source code translation automatically translates a program from one programming language to another. The existing research on code translation evaluates the effectiveness of their approaches by using either syntactic similarities (e.g., BLEU score), or test execution results. The former does not consider semantics, the latter considers semantics but falls short on the problem of insufficient data and tests. In this paper, we propose MBTA (Mutation-based Code Translation Analysis), a novel application of mutation analysis for code translation assessment. We also introduce MTS (Mutation-based Translation Score), a measure to compute the level of trustworthiness of a translator. If a mutant of an input program shows different test execution results from its translated version, the mutant is killed and a translation bug is revealed. Fewer killed mutants indicate better code translation. MBTA is novel in the sense that mutants are compared to their translated counterparts, and not to their original program’s translation. We conduct a proof-of-concept case study with 612 Java-Python program pairs and 75,082 mutants on the code translators TransCoder and j2py to evaluate the feasibility of MBTA. The results reveal that TransCoder and j2py fail to translate 70.44% and 70.64% of the mutants, respectively, i.e., more than two-thirds of all mutants are incorrectly translated by these translators. By analysing the MTS results more closely, we were able to reveal translation bugs not captured by the conventional comparison between the original and translated programs.

show abstract

“…For example, the neural code translation task employs machine translation technology to automatically translate Java language into C# language, reducing the neural code translation work for programmers [1][2] [3]. Some of the previous approaches concentrate on improving performance by refining code representations, such as the pre-trained model CodeBERT [6].…”

Section: Introductionmentioning

confidence: 99%

“…In recent years,we have witnessed a dramatic rise in applying deep source code processing models (a.k.a, code models) to source code processing tasks [1][2] [3][4] [5].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Stealthy Backdoor Attack for Code Models

qu,

Huang,

Chen

et al. 2024

Preprint

View full text Add to dashboard Cite

Recent studies have shown that code models are susceptible to backdoor attacks. When injected with a backdoor, the victim code model can function normally on benign samples but may produce predetermined malicious outputs when triggers are activated. However, previous backdoor attacks on code models have used explicit triggers, and we aim to investigate the vulnerability of code models to stealthy backdoor attacks in this study. To this end, we propose a backdoor attack approach using Abstract Syntax Tree-based Triggers (ASTT) to obtain stealthiness. We evaluate ASTT on deep learning-based code models and three downstream tasks (i.e., code translation, code repair, and defect detection). With the clustering algorithm, we generated triggers based on abstract syntax trees. We find that the average attack success rate of our ASTT can reach 92.71%. Moreover, our ASTT is stealthy and can effectively bypass state-of-the-art defense approaches. Finally, we verify that the time overhead of our proposed ASTT is small and can meet the needs in real scenarios. Our finding demonstrates security weaknesses in code models under stealthy backdoor attacks.

show abstract

Leveraging Automated Unit Tests for Unsupervised Code Translation

Cited by 4 publications

References 31 publications

Human-Machine Cooperative Program Translation Based on Abstract Syntax Tree

Human-Machine Cooperative Program Translation Based on Abstract Syntax Tree

Mutation analysis for evaluating code translation

A Stealthy Backdoor Attack for Code Models

Contact Info

Product

Resources

About