More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.
Summary
Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model‐based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real‐world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt‐in panels is cost‐effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article.
In this paper we report the current status of a new research program. The primary goal of the "Vanishing & Appearing Sources during a Century of Observations" (VASCO) project is to search for vanishing and appearing sources using existing survey data to find examples of exceptional astrophysical transients. The implications of finding such objects extend from traditional astrophysics fields to the more exotic searches for evidence of technologically advanced civilizations. In this first paper we present new, deeper observations of the tentative candidate discovered by Villarroel et al. (2016). We then perform the first searches for vanishing objects throughout the sky by comparing 600 million objects from the US Naval Observatory Catalogue (USNO) B1.0 down to a limiting magnitude of ∼ 20 − 21 with the recent Pan-STARRS Data Release-1 (DR1) with a limiting magnitude of ∼ 23.4. We find about 150,000 preliminary candidates that do not have any Pan-STARRS counterpart within a 30 arcsec radius. We show that these objects are redder and have larger proper motions than typical USNO objects. We visually examine the images for a subset of about 24,000 candidates, superseding the 2016 study with a sample ten times larger. We find about ∼ 100 point sources visible in only one epoch in the red band of the USNO which may be of interest in searches for strong M dwarf flares, high-redshift supernovae or other catagories of unidentified red transients.
In this paper, the third in a series illustrating the power of generalized linear models (GLMs) for the astronomical community, we elucidate the potential of the class of GLMs which handles count data. The size of a galaxy's globular cluster population (N GC ) is a prolonged puzzle in the astronomical literature. It falls in the category of count data analysis, yet it is usually modelled as if it were a continuous response variable. We have developed a Bayesian negative binomial regression model to study the connection between N GC and the following galaxy properties: central black hole mass, dynamical bulge mass, bulge velocity dispersion, and absolute visual magnitude. The methodology introduced herein naturally accounts for heteroscedasticity, intrinsic scatter, errors in measurements in both axes (either discrete or continuous), and allows modelling the population of globular clusters on their natural scale as a non-negative integer variable. Prediction intervals of 99 per cent around the trend for expected N GC comfortably envelope the data, notably including the Milky Way, which has hitherto been considered a problematic outlier. Finally, we demonstrate how random intercept models can incorporate information of each particular galaxy morphological type. Bayesian variable selection methodology allows for automatically identifying galaxy types with different productions of GCs, suggesting that on average S0 galaxies have a GC population 35 per cent smaller than other types with similar brightness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.