SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets

Baltes, Sebastian; Treude, Christoph; Diehl, Stephan

doi:10.1109/msr.2019.00038

Cited by 61 publications

(26 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this study, we assess SOTorrent [12], a dataset containing historical data from SO, to fill this research gap. Our results show that 93.87% of analysed Python code snippets contain coding style violations, and while there is a correlation between coding style compliance and post score, reputation and coding style compliance seem to be uncorrelated.…”

Section: Imentioning

confidence: 99%

See 1 more Smart Citation

Python Coding Style Compliance on Stack Overflow

Bafatakis

Boecker

Boon

et al. 2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

View full text Add to dashboard Cite

Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between-10 and 20, we found a strong correlation (r = −0.87, p < 10 −7) between the vote score a post received and the average number of violations per statement for snippets in such posts.

show abstract

Section: Imentioning

confidence: 99%

“…SOTorrent [12] is an open dataset based on data from the official SO data dump. The dataset covers all SO posts and user information since the first post in July 2008.…”

Section: B the Sotorrent Datasetmentioning

confidence: 99%

Python Coding Style Compliance on Stack Overflow

Bafatakis

Boecker

Boon

et al. 2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

View full text Add to dashboard Cite

show abstract

“…the Stack Overflow score of the snippet's answer post) or even from other sources (e.g. the reuse rate of Stack Overflow snippets in GitHub [1]) and assess the influence of that information on the effectiveness of the scheme. Furthermore, we could employ different word embedding techniques or even variations of fastText, such as the combination of the In-Out vectors of fastText [19].…”

Section: Resultsmentioning

confidence: 99%

Extracting Semantics from Question-Answering Services for Snippet Reuse

Diamantopoulos

Οικονόμου

Symeonidis

2020

Fundamental Approaches to Software Engineering

View full text Add to dashboard Cite

Nowadays, software developers typically search online for reusable solutions to common programming problems. However, forming the question appropriately, and locating and integrating the best solution back to the code can be tricky and time consuming. As a result, several mining systems have been proposed to aid developers in the task of locating reusable snippets and integrating them into their source code. Most of these systems, however, do not model the semantics of the snippets in the context of source code provided. In this work, we propose a snippet mining system, named StackSearch, that extracts semantic information from Stack Overlow posts and recommends useful and in-context snippets to the developer. Using a hybrid language model that combines Tf-Idf and fastText, our system effectively understands the meaning of the given query and retrieves semantically similar posts. Moreover, the results are accompanied with useful metadata using a named entity recognition technique. Upon evaluating our system in a set of common programming queries, in a dataset based on post links, and against a similar tool, we argue that our approach can be useful for recommending ready-to-use snippets to the developer.

show abstract

“…Stack Overflow uses a reputation heuristic to motivate the community of users to engage with the platform in constructive ways. 5 We use the official Stack Overflow reputation formula to recover the reputation of an asker at the time of a post. Unfortunately, the −1 reputation penalty issued for downvoting an answer and the site association bonus +100 reputation on registration could not be factored into our calculation, as that data is omitted from the MSR Challenge dataset to protect user anonymity.…”

Section: B Measure Calculationmentioning

confidence: 99%

“…Further, we measure the similarity between duplicate and root questions and their associated sets of answers. Through analysis of the MSR Challenge dataset [5], we address the following research questions: Fig. 1: An overview of our approach to study the MSR Challenge dataset two-tailed Mann-Whitney U test).…”

Section: Introductionmentioning

confidence: 99%

Can Duplicate Questions on Stack Overflow Benefit the Software Development Community?

Abric

Clark

Caminiti

et al. 2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

View full text Add to dashboard Cite

Duplicate questions on Stack Overflow are questions that are flagged as being conceptually equivalent to a previously posted question. Stack Overflow suggests that duplicate questions should not be discussed by users, but rather that attention should be redirected to their previously posted counterparts. Roughly 53% of closed Stack Overflow posts are closed due to duplication. Despite their supposed overlapping content, user activity suggests duplicates may generate additional or superior answers. Approximately 9% of duplicates receive more views than their original counterparts despite being closed.In this paper, we analyze duplicate questions from two perspectives. First, we analyze the experience of those who post duplicates using activity and reputation-based heuristics. Second, we compare the content of duplicates both in terms of their questions and answers to determine the degree of similarity between each duplicate pair. Through analysis of the MSR challenge dataset, we find that although duplicate questions are more likely to be created by inexperienced users, they often receive dissimilar answers to their original counterparts. Indeed, supplementary textual analysis using Natural Language Processing (NLP) techniques suggests duplicate questions provide additional information about the underlying concepts being discussed. We recommend that the Stack Overflow's duplication policy be revised to account for the benefits that leaving duplicate questions open may have for the developer community.

show abstract

SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets

Cited by 61 publications

References 16 publications

Python Coding Style Compliance on Stack Overflow

Python Coding Style Compliance on Stack Overflow

Extracting Semantics from Question-Answering Services for Snippet Reuse

Can Duplicate Questions on Stack Overflow Benefit the Software Development Community?

Contact Info

Product

Resources

About