Anna Corazza scite author profile

Developers have a lot of freedom in writing comments as well as in choosing identifiers and method names. These are intentional in nature and provide a different relevance of information to understand what a software system implements, and in particular the role of each source file.In this paper we investigate the effectiveness of exploiting lexical information for software system clustering. In particular we explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Their relevance has been weighted by means of a probabilistic model, whose parameters have been estimated by the Expectation-Maximization algorithm. To group source files accordingly we used a hierarchical clustering algorithm. The investigation has been conducted on a dataset of 13 open source Java software systems.

show abstract

A Probabilistic Based Approach towards Software System Clustering

Corazza

Martino

Scanniello

2010

View full text Add to dashboard Cite

Using tabu search to configure support vector regression for effort estimation

et al. 2011

View full text Add to dashboard Cite

How effective is Tabu search to configure support vector regression for effort estimation?

Corazza

Martino

Ferrucci

et al. 2010

View full text Add to dashboard Cite

Coherence of comments and method implementations: a dataset and an empirical investigation

2016

View full text Add to dashboard Cite

In this paper, we present the results of a manual assessment on the coherence\ud between the comments and the implementation of 3636 methods in three open source soft-\ud ware applications (for one of these applications, we considered two different subsequent\ud versions) implemented in Java. The results of this assessment have been collected in a\ud dataset we made publicly available on the Web. The creation of this dataset is based on a\ud protocol that is detailed in this paper. We present that protocol to let researchers evaluate\ud the goodness of our dataset and to ease its future possible extensions. Another contribution\ud of this paper consists in preliminarily investigating on the effectiveness of adopting a Vec-\ud tor Space Model (VSM) with the tf-idf schema to discriminate coherent and non-coherent\ud methods. We observed that the lexical similarity alone is not sufficient for this distinc-\ud tion, while encouraging results have been obtained by applying an Support Vector Machine\ud (SVM) classifier on the whole vector space

show abstract

LINSEN: An efficient approach to split identifiers and expand abbreviations

Corazza

Martino

Maggio

2012

View full text Add to dashboard Cite

Unsupervised entity and relation extraction from clinical records in Italian

Alicante

Corazza

Isgrò

et al. 2016

Computers in Biology and Medicine

View full text Add to dashboard Cite

A knowledge-poor approach to chemical-disease relation extraction

et al. 2016

View full text Add to dashboard Cite

The article describes a knowledge-poor approach to the task of extracting Chemical-Disease Relations from PubMed abstracts. A first version of the approach was applied during the participation in the BioCreative V track 3, both in Disease Named Entity Recognition and Normalization (DNER) and in Chemical-induced diseases (CID) relation extraction. For both tasks, we have adopted a general-purpose approach based on machine learning techniques integrated with a limited number of domain-specific knowledge resources and using freely available tools for preprocessing data. Crucially, the system only uses the data sets provided by the organizers. The aim is to design an easily portable approach with a limited need of domain-specific knowledge resources. In the participation in the BioCreative V task, we ranked 5 out of 16 in DNER, and 7 out of 18 in CID. In this article, we present our follow-up study in particular on CID by performing further experiments, extending our approach and improving the performance.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anna Corazza

Investigating the use of lexical information for software system clustering

A Probabilistic Based Approach towards Software System Clustering

Using tabu search to configure support vector regression for effort estimation

How effective is Tabu search to configure support vector regression for effort estimation?

Coherence of comments and method implementations: a dataset and an empirical investigation

LINSEN: An efficient approach to split identifiers and expand abbreviations

Unsupervised entity and relation extraction from clinical records in Italian

A knowledge-poor approach to chemical-disease relation extraction

Contact Info

Product

Resources

About