Nathaniel Hoy scite author profile

Context Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

show abstract

Exploring the Generalisability of Fake News Detection Models

Hoy

Koulouri

2022

View full text Add to dashboard Cite

Fake news has been shown to have a growing negative impact on societies around the world, from influencing elections to spreading misinformation about vaccines. To address this problem, current research has proposed techniques for fake news detection, demonstrating promising results in lab conditions, where models tested on an unseen portion of the same dataset perform well. However, the question of the generalisability of these techniques, and their efficacy in the realworld, is less frequently evaluated. Studies that have looked at generalisability argue that models struggle to distinguish between fake and legitimate news across different topics of news, as well as across different time periods, to the ones on which they have been trained. This prompts the more fundamental question of how well fake news models generalise across news of the same topic and time period. As such, through a series of experiments, this study explores how well popular fake news detection models and features (word-level representations and linguistic cues) generalise across similar news. The first experiment reports high accuracies, when these techniques are tested on an unseen portion of the same dataset, replicating the findings in literature. However, the second experiment reveals that these techniques struggle to generalise well, suffering drops in accuracy of around 50%, when tested against different datasets of the same topic and time period. Exploring possible reasons behind such poor generalisability, the analysis points to the issue of dataset size, motivating the need for larger, more diverse datasets to become available. It also suggests that word-level representations lead to more biased, less generalisable models. Finally, the findings provide preliminary support for the effectiveness of linguistic and stylistic features, and for the potential of features beyond the word or language level, such as URL redirections and reverse image searches.

show abstract

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Herbold¹,

Trautsch²,

Ledel³

et al. 2020

Preprint

View full text Add to dashboard Cite

A Systematic Review on the Detection of Fake News Articles

Hoy¹,

Koulouri²

2021

Preprint

View full text Add to dashboard Cite

It has been argued that fake news and the spread of false information pose a threat to societies throughout the world, from influencing the results of elections to hindering the efforts to manage the COVID-19 pandemic. To combat this threat, a number of Natural Language Processing (NLP) approaches have been developed. These leverage a number of datasets, feature extraction/selection techniques and machine learning (ML) algorithms to detect fake news before it spreads. While these methods are well-documented, there is less evidence regarding their efficacy in this domain. By systematically reviewing the literature, this paper aims to delineate the approaches for fake news detection that are most performant, identify limitations with existing approaches, and suggest ways these can be mitigated. The analysis of the results indicates that Ensemble Methods using a combination of news content and socially-based features are currently the most effective. Finally, it is proposed that future research should focus on developing approaches that address generalisability issues (which, in part, arise from limitations with current datasets), explainability and bias.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Nathaniel Hoy

A fine-grained data set and analysis of tangling in bug fixing commits

Exploring the Generalisability of Fake News Detection Models

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

A Systematic Review on the Detection of Fake News Articles

Contact Info

Product

Resources

About