Ming Gong scite author profile

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop Code-BERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both "bimodal" data of NL-PL pairs and "unimodal" data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing. 1

show abstract

Design and construction of the BESIII detector

Ablikim¹,

An²,

Bai³

et al. 2010

Nuclear Instruments and Methods in Physics Research Section A:

1,036

509

View full text Add to dashboard Cite

This paper will discuss the design and construction of BESIII [1], which is designed to study physics in the τ-charm energy region utilizing the new high luminosity BEPCII double ring e + ecollider [2]. The expected performance will be given based on Monte Carlo simulations and results of cosmic ray and beam tests. In BESIII, tracking and momentum measurements for charged particles are made by a cylindrical multilayer drift chamber in a 1 T superconducting solenoid. Charged particles are identified with a time-of-flight system based on plastic scintillators in conjunction with dE/dx (energy loss per unit pathlength) measurements in the drift chamber. Energies of electromagnetic showers are measured by a CsI(Tl) crystal calorimeter located inside the solenoid magnet. Muons are identified by arrays of resistive plate chambers in the steel magnetic flux return. The level 1 trigger system, Data Acquisition system and the event filter system based on networked computers will also be described.

show abstract

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Duan

Fang

et al. 2020

AAAI

606

379

View full text Add to dashboard Cite

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

show abstract

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Feng

Guo

Tang

et al. 2020

Preprint

185

306

View full text Add to dashboard Cite

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns generalpurpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both "bimodal" data of NL-PL pairs and "unimodal" data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

show abstract

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Lu¹,

Guo²,

Ren³

et al. 2021

Preprint

162

View full text Add to dashboard Cite

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems 1 .

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ming Gong

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Design and construction of the BESIII detector

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Contact Info

Product

Resources

About