2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) 2019
DOI: 10.1109/msr.2019.00038
|View full text |Cite
|
Sign up to set email alerts
|

SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets

Abstract: Stack Overflow (SO) is the most popular questionand-answer website for software developers, providing a large amount of copyable code snippets. Like other software artifacts, code on SO evolves over time, for example when bugs are fixed or APIs are updated to the most recent version. To be able to analyze how code and the surrounding text on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole po… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 61 publications
(26 citation statements)
references
References 16 publications
0
26
0
Order By: Relevance
“…In this study, we assess SOTorrent [12], a dataset containing historical data from SO, to fill this research gap. Our results show that 93.87% of analysed Python code snippets contain coding style violations, and while there is a correlation between coding style compliance and post score, reputation and coding style compliance seem to be uncorrelated.…”
Section: Imentioning
confidence: 99%
See 1 more Smart Citation
“…In this study, we assess SOTorrent [12], a dataset containing historical data from SO, to fill this research gap. Our results show that 93.87% of analysed Python code snippets contain coding style violations, and while there is a correlation between coding style compliance and post score, reputation and coding style compliance seem to be uncorrelated.…”
Section: Imentioning
confidence: 99%
“…SOTorrent [12] is an open dataset based on data from the official SO data dump. The dataset covers all SO posts and user information since the first post in July 2008.…”
Section: B the Sotorrent Datasetmentioning
confidence: 99%
“…the Stack Overflow score of the snippet's answer post) or even from other sources (e.g. the reuse rate of Stack Overflow snippets in GitHub [1]) and assess the influence of that information on the effectiveness of the scheme. Furthermore, we could employ different word embedding techniques or even variations of fastText, such as the combination of the In-Out vectors of fastText [19].…”
Section: Resultsmentioning
confidence: 99%
“…Stack Overflow uses a reputation heuristic to motivate the community of users to engage with the platform in constructive ways. 5 We use the official Stack Overflow reputation formula to recover the reputation of an asker at the time of a post. Unfortunately, the −1 reputation penalty issued for downvoting an answer and the site association bonus +100 reputation on registration could not be factored into our calculation, as that data is omitted from the MSR Challenge dataset to protect user anonymity.…”
Section: B Measure Calculationmentioning
confidence: 99%
“…Further, we measure the similarity between duplicate and root questions and their associated sets of answers. Through analysis of the MSR Challenge dataset [5], we address the following research questions: Fig. 1: An overview of our approach to study the MSR Challenge dataset two-tailed Mann-Whitney U test).…”
Section: Introductionmentioning
confidence: 99%