On the endogenesis of Twitter's Spritzer and Gardenhose sample streams

Kergl, Dennis; Roedler, Robert; Seeber, Sebastian

doi:10.1109/asonam.2014.6921610

Cited by 30 publications

(26 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lazer et al [41] revealed that Google does not store the search term typed by the user but the search term selected based on suggestions, which has tremendous implications for the analysis of human behavior based on those data. Our work focuses on issues resulting from sampling [42,43] of Twitter data. Since Twitter does not reveal how data sampling is performed, the use of Twitter data is generally regarded as highly problematic, especially in the social sciences [42,[44][45][46].…”

Section: Related Workmentioning

confidence: 99%

“…Our work focuses on issues resulting from sampling [42,43] of Twitter data. Since Twitter does not reveal how data sampling is performed, the use of Twitter data is generally regarded as highly problematic, especially in the social sciences [42,[44][45][46]. Several studies discuss working, compositions and possible biases of data [47,48] and a "reverse-engineered" model has been developed for the Sample API, which indicates that the sampling is based on a millisecond time window and that the timestamp at which the Tweet arrived at Twitter's servers is coded into the Tweet's ID [42,43].…”

Section: Related Workmentioning

confidence: 99%

“…Since Twitter does not reveal how data sampling is performed, the use of Twitter data is generally regarded as highly problematic, especially in the social sciences [42,[44][45][46]. Several studies discuss working, compositions and possible biases of data [47,48] and a "reverse-engineered" model has been developed for the Sample API, which indicates that the sampling is based on a millisecond time window and that the timestamp at which the Tweet arrived at Twitter's servers is coded into the Tweet's ID [42,43]. Although it has been shown that Twitter creates nonrepresentative samples with non-transparent and highly fluctuating sample rates of the overall Twitter activity [49], this has had no effect on its popularity amongst researchers [50].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Tampering with Twitter’s Sample API

Mayer

2018

EPJ Data Sci.

115

View full text Add to dashboard Cite

Social media data is widely analyzed in computational social science. Twitter, one of the largest social media platforms, is used for research, journalism, business, and government to analyze human behavior at scale. Twitter offers data via three different Application Programming Interfaces (APIs). One of which, Twitter's Sample API, provides a freely available 1% and a costly 10% sample of all Tweets. These data are supposedly random samples of all platform activity. However, we demonstrate that, due to the nature of Twitter's sampling mechanism, it is possible to deliberately influence these samples, the extent and content of any topic, and consequently to manipulate the analyses of researchers, journalists, as well as market and political analysts trusting these data sources. Our analysis also reveals that technical artifacts can accidentally skew Twitter's samples. Samples should therefore not be regarded as random. Our findings illustrate the critical limitations and general issues of big data sampling, especially in the context of proprietary data and undisclosed details about data handling.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Tampering with Twitter’s Sample API

Mayer

2018

EPJ Data Sci.

115

View full text Add to dashboard Cite

show abstract

“…Figure 1(a) shows a typical Twitter-Day for English content of our data set. We are able to show the amount of total tweets in the firehose (i.e., the stream of all public tweets), taking advantage of the nature of Twitter's sample stream that we described in [19] and also captured for this analysis. Also the proportion of our data set can be derived from the figure.…”

Section: Data Sourcementioning

confidence: 99%

Towards Internet Scale Quality-of-Experience Measurement with Twitter

Kergl

Roedler

Rodosek

2017

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. At present, Quality of Experience (QoE) measurements are accomplished by interrogating users for the perceived quality of a service they just have used. Influenced by many factors and often limited by domain or geographical region, this technique has several drawbacks when a general state of QoE for the internet as a whole is prospected. To achieve such a general metric, we leverage user complaints that we observe in real-time in social media. Such approaches have been successfully applied for the monitoring of specific and single services. We aim to extend existing methods in order to create an overall metric, define an internet wide QoE baseline, monitor changes and hence, provide a context for assessing smaller scale findings against a ground truth. The contribution of this work is to demonstrate the feasibility of using social media analysis for generating a meaningful value for quantifying the actual QoE of the internet.

show abstract

“…For our study, we accessed tweets archived from a Twitter feed licensed to the University of Sheffield from July 2009 to September 2014 inclusive. These comprise a random 10% sample of all tweets (Kergl et al, 2014) and are kept in hourly or daily files. The sample was searched for terms related to mephedrone by using Aho-Corasick (1975) search first to losslessly reduce the number of records processed in detail.…”

Section: Twitter Data Resourcementioning

confidence: 99%

Novel psychoactive substances: An investigation of temporal trends in social media and electronic health records

et al. 2016

View full text Add to dashboard Cite

BackgroundPublic health monitoring is commonly undertaken in social media but has never been combined with data analysis from electronic health records. This study aimed to investigate the relationship between the emergence of novel psychoactive substances (NPS) in social media and their appearance in a large mental health database.MethodsInsufficient numbers of mentions of other NPS in case records meant that the study focused on mephedrone. Data were extracted on the number of mephedrone (i) references in the clinical record at the South London and Maudsley NHS Trust, London, UK, (ii) mentions in Twitter, (iii) related searches in Google and (iv) visits in Wikipedia. The characteristics of current mephedrone users in the clinical record were also established.ResultsIncreased activity related to mephedrone searches in Google and visits in Wikipedia preceded a peak in mephedrone-related references in the clinical record followed by a spike in the other 3 data sources in early 2010, when mephedrone was assigned a ‘class B’ status. Features of current mephedrone users widely matched those from community studies.ConclusionsCombined analysis of information from social media and data from mental health records may assist public health and clinical surveillance for certain substance-related events of interest. There exists potential for early warning systems for health-care practitioners.

show abstract

On the endogenesis of Twitter's Spritzer and Gardenhose sample streams

Cited by 30 publications

References 14 publications

Tampering with Twitter’s Sample API

Tampering with Twitter’s Sample API

Towards Internet Scale Quality-of-Experience Measurement with Twitter

Novel psychoactive substances: An investigation of temporal trends in social media and electronic health records

Contact Info

Product

Resources

About