Learning lexical features of programming languages from imagery using convolutional neural networks

Ott, Jordan; Atchison, Abigail; Harnack, Paul; Best, Natalie; Anderson, Haley; Firmani, Cristiano; Linstead, Erik

doi:10.1145/3196321.3196359

Cited by 28 publications

(18 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We are currently working on curating additional labeled data for a variety of programming languages, including C++, and R and have begun an initial exploration into differentiating between Python and Java code samples embedded in digital images through a model that can differentiate between multiple languages while learning lexical features in the process [28]. Using this data we will train an ensemble of classifiers for identifying these languages in video and images.…”

Section: Discussionmentioning

confidence: 99%

A deep learning approach to identifying source code in images and video

Ott

Atchison

Harnack

et al. 2018

Proceedings of the 15th International Conference on Mining Software Repositories

Self Cite

View full text Add to dashboard Cite

While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools. CCS CONCEPTS • Information systems → Video search; • Computing methodologies → Machine learning approaches; • Computer systems organization → Neural networks; • Software and its engineering → Software libraries and repositories;

show abstract

Section: Discussionmentioning

confidence: 99%

A deep learning approach to identifying source code in images and video

Ott

Atchison

Harnack

et al. 2018

Proceedings of the 15th International Conference on Mining Software Repositories

Self Cite

View full text Add to dashboard Cite

show abstract

“…Image-based PLI has been attempted by others too. Ott et al have shown how to use CNNs to identify video frames that contain Java code within video programming tutorials ( Ott et al, 2018a ) (versus frames not showing code at all) and to distinguish frames containing Java from frames containing Python ( Ott et al, 2018b ). In the present work we consider a much larger set of languages.…”

Section: Related Workmentioning

confidence: 99%

Image-based many-language programming language identification

Bonifro

Gabbrielli

Lategano

et al. 2021

PeerJ Computer Science

View full text Add to dashboard Cite

Programming language identification (PLI) is a common need in automatic program comprehension as well as a prerequisite for deeper forms of code understanding. Image-based approaches to PLI have recently emerged and are appealing due to their applicability to code screenshots and programming video tutorials. However, they remain limited to the recognition of a small amount of programming languages (up to 10 languages in the literature). We show that it is possible to perform image-based PLI on a large number of programming languages (up to 149 in our experiments) with high (92%) precision and recall, using convolutional neural networks (CNNs) and transfer learning, starting from readily-available pretrained CNNs. Results were obtained on a large real-world dataset of 300,000 code snippets extracted from popular GitHub repositories. By scrambling specific character classes and comparing identification performances we also show that the characters that contribute the most to the visual recognizability of programming languages are symbols (e.g., punctuation, mathematical operators and parentheses), followed by alphabetic characters, with digits and indentation having a negligible impact.

show abstract

“…Similar to our work, Ott et al [19] proposed to use a VGG network to identify whether frames in programming tutorial videos contain source code. They also use deep learning techniques to classify images based on programming language [20] and UML diagrams [21]. In our study, we combine deep learning techniques and traditional computer vision techniques to achieve better performance than Ott et al 's approach.…”

Section: Source Code Detection and Extraction In Programming Screencastsmentioning

confidence: 99%

psc2code: Denoising Code Extraction from Programming Screencasts

Bao,

Xing,

Xia

et al. 2021

Preprint

View full text Add to dashboard Cite

Programming screencasts have become a pervasive resource on the Internet, which help developers learn new programming technologies or skills. The source code in programming screencasts is an important and valuable information for developers. But the streaming nature of programming screencasts (i.e., a sequence of screen-captured images) limits the ways that developers can interact with the source code in the screencasts. Many studies use the Optical Character Recognition (OCR) technique to convert screen images (also referred to as video frames) into textual content, which can then be indexed and searched easily. However, noisy screen images significantly affect the quality of source code extracted by OCR, for example, no-code frames (e.g., PowerPoint slides, web pages of API specification), non-code regions (e.g., Package Explorer view, Console view), and noisy code regions with code in completion suggestion popups. Furthermore, due to the code characteristics (e.g., long compound identifiers like ItemListener), even professional OCR tools cannot extract source code without errors from screen images. The noisy OCRed source code will negatively affect the downstream applications, such as the effective search and navigation of the source code content in programming screencasts.In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network (CNN) based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detected sub-windows, it identifies and crops the screen region that is most likely to be a code editor. Finally, psc2code calls the API of a professional OCR tool to extract source code from the cropped code regions and leverages the OCRed cross-frame information in the programming screencast and the statistical language model of a large corpus of source code to correct errors in the OCRed source code.We conduct an experiment on 1,142 programming screencasts from YouTube. We find that our CNN-based image classification technique can effectively remove the non-code and noisy-code frames, which achieves a

show abstract

Learning lexical features of programming languages from imagery using convolutional neural networks

Cited by 28 publications

References 13 publications

A deep learning approach to identifying source code in images and video

A deep learning approach to identifying source code in images and video

Image-based many-language programming language identification

psc2code: Denoising Code Extraction from Programming Screencasts

Contact Info

Product

Resources

About