How should we measure functional sameness from program source code? an exploratory study on Java methods

Higo, Yoshiki; Kusumoto, Shoichi

doi:10.1145/2635868.2635886

Cited by 14 publications

(6 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are a variety of approaches to compute structural similarity of source code [49,50,51]. We used the de-duplication approach explained in [52] for its simplicity and effectiveness for method-level similarity computation.…”

Section: Problem Statementmentioning

confidence: 99%

Replacements and Replaceables: Making the Case for Code Variants

Vinayakarao,

Sondhi,

Keswani

et al. 2020

Preprint

View full text Add to dashboard Cite

There are often multiple ways to implement the same requirement in source code. Different implementation choices can result in code snippets that are similar, and have been defined in multiple ways: code clones, examples, simions and variants. Currently, there is a lack of a consistent and unambiguous definition of such types of code snippets. Here we present a characterization study of code variants -a specific type of code snippets that differ from each other by at least one desired property, within a given code context. We distinguish code variants from other types of redundancies in source code, and demonstrate the significant role that they play: about 25% to 43% of developer discussions (in a set of nine open source projects) were about variants. We characterize different types of variants based on their code context and desired properties. As a demonstration of the possible use of our characterization of code variants, we show how search results can be ranked based on a desired property (e.g., speed of execution).

show abstract

Section: Problem Statementmentioning

confidence: 99%

Replacements and Replaceables: Making the Case for Code Variants

Vinayakarao,

Sondhi,

Keswani

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…To cope with this problem, they have randomly selected 100 Java projects from Source-Forge 5 in what was called a first attempt to run a statistically sound experiment in test data generation. According to them, the resulting benchmark, called SF100, is statistically sound and representative of open source projects 6 . This issue also affects the evaluation of code search techniques: many times the target repositories are not a random sample of software projects, but a specific set of projects, which can introduce bias to the study.…”

Section: A Repositorymentioning

confidence: 99%

“…Although such a property has received a lot of attention in the past, recently researchers have been looking at other types of replication, such as vocabulary or temporal redundancy. The idea is that sometimes two fragments of code can be similar with respect to other aspects besides text (think about different implementations of sorting algorithms, for instance, which can have the same function but different structure [6]). Vocabulary redundancy appears when different pieces of code share similar words [6], while temporal redundancy is concerned with the amount of code commits that are composed of previous commits [7].…”

Section: Introductionmentioning

confidence: 99%

An Exploratory Study of Interface Redundancy in Code Repositories

Paula

Guerra

Lopes

et al. 2016

2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM)

View full text Add to dashboard Cite

An important property of software repositories is their level of cross-project redundancy. For instance, much has been done to assess how much code cloning happens across software corpora. In this paper we study a much less targeted type of replication: Interface Redundancy (IR). IR refers to the level of repetition of whole method interfaces-return type, method name, and parameters types-across a code corpus. Such type of redundancy is important because if two non-trivial methods ever share the same interface it is very likely that they implement analogous functions, even though their code, structure, or vocabulary might be diverse. A certain level of IR is a requirement for approaches that rely on the recurrence of interfaces to fulfill a given task (e.g., interface-driven code search-IDCS). In this paper we report on an experiment to measure IR in a large-scale Java repository. Our target corpus contains more than 380,000 methods from 99 Java projects extracted randomly from an open source repository. Results are promising as they show that the chances of an interface from a non-trivial method to repeat itself across a large repository is around 25% (i.e., approximately 1 /4 of such interfaces are redundant). Also, more than 80% of the target projects contained IR (with the average percentage of redundant interfaces for these projects being above 30%). As additional analyses we investigated the distribution of the different types of redundant interfaces (e.g., intra-vs inter-project); characterized the redundant interfaces and show that such a knowledge can help improve IDCS; and provided evidence that only a very small part of IR refers to method cloning (around 0.002%).

show abstract

“…Beyond this (of course fuzzy) threshold, the diversity and uniqueness of source code appears. Higo and Kusumoto [11] investigate the interplay between structural similarity, vocabulary similarity and method name similarity, to assess functional similarity between methods in Java programs. They show that many contextual factors influence the ability of these similarity measures to spot functional similarity (e.g., the number of methods that share the same name, or the fact that two methods with similar structure are in the same class or not).…”

Section: Related Workmentioning

confidence: 99%

DSpot: Test Amplification for Automatic Assessment of Computational Diversity

Baudry¹,

Allier²,

Rodriguez-Cancio³

et al. 2015

Preprint

View full text Add to dashboard Cite

Context: Computational diversity, i.e., the presence of a set of programs that all perform compatible services but that exhibit behavioral differences under certain conditions, is essential for fault tolerance and security.Objective: We aim at proposing an approach for automatically assessing the presence of computational diversity. In this work, computationally diverse variants are defined as (i) sharing the same API, (ii) behaving the same according to an input-output based specification (a test-suite) and (iii) exhibiting observable differences when they run outside the specified input space.Method: Our technique relies on test amplification. We propose source code transformations on test cases to explore the input domain and systematically sense the observation domain. We quantify computational diversity as the dissimilarity between observations on inputs that are outside the specified domain.Results: We run our experiments on 472 variants of 7 classes from open-source, large and thoroughly tested Java classes. Our test amplification multiplies by ten the number of input points in the test suite and is effective at detecting software diversity. Conclusion:The key insights of this study are: the systematic exploration of the observable output space of a class provides new insights about its degree of encapsulation; the behavioral diversity that we observe originates from areas of the code that are characterized by their flexibility (caching, checking, formatting, etc.).

show abstract

How should we measure functional sameness from program source code? an exploratory study on Java methods

Cited by 14 publications

References 50 publications

Replacements and Replaceables: Making the Case for Code Variants

Replacements and Replaceables: Making the Case for Code Variants

An Exploratory Study of Interface Redundancy in Code Repositories

DSpot: Test Amplification for Automatic Assessment of Computational Diversity

Contact Info

Product

Resources

About