Developers have a lot of freedom in writing comments as well as in choosing identifiers and method names. These are intentional in nature and provide a different relevance of information to understand what a software system implements, and in particular the role of each source file.In this paper we investigate the effectiveness of exploiting lexical information for software system clustering. In particular we explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Their relevance has been weighted by means of a probabilistic model, whose parameters have been estimated by the Expectation-Maximization algorithm. To group source files accordingly we used a hierarchical clustering algorithm. The investigation has been conducted on a dataset of 13 open source Java software systems.
In this paper, we present the results of a manual assessment on the coherence\ud
between the comments and the implementation of 3636 methods in three open source soft-\ud
ware applications (for one of these applications, we considered two different subsequent\ud
versions) implemented in Java. The results of this assessment have been collected in a\ud
dataset we made publicly available on the Web. The creation of this dataset is based on a\ud
protocol that is detailed in this paper. We present that protocol to let researchers evaluate\ud
the goodness of our dataset and to ease its future possible extensions. Another contribution\ud
of this paper consists in preliminarily investigating on the effectiveness of adopting a Vec-\ud
tor Space Model (VSM) with the tf-idf schema to discriminate coherent and non-coherent\ud
methods. We observed that the lexical similarity alone is not sufficient for this distinc-\ud
tion, while encouraging results have been obtained by applying an Support Vector Machine\ud
(SVM) classifier on the whole vector space
The article describes a knowledge-poor approach to the task of extracting Chemical-Disease Relations from PubMed abstracts. A first version of the approach was applied during the participation in the BioCreative V track 3, both in Disease Named Entity Recognition and Normalization (DNER) and in Chemical-induced diseases (CID) relation extraction. For both tasks, we have adopted a general-purpose approach based on machine learning techniques integrated with a limited number of domain-specific knowledge resources and using freely available tools for preprocessing data. Crucially, the system only uses the data sets provided by the organizers. The aim is to design an easily portable approach with a limited need of domain-specific knowledge resources. In the participation in the BioCreative V task, we ranked 5 out of 16 in DNER, and 7 out of 18 in CID. In this article, we present our follow-up study in particular on CID by performing further experiments, extending our approach and improving the performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.