Temporal update dynamics under blind sampling

Li, Xiaoyong; Cline, Daren B. H.; Loguinov, Dmitri

doi:10.1109/infocom.2015.7218543

Cited by 5 publications

(3 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The problem of predicting the time between changes of web pages under blind sampling is addressed by Li et al [16]. A stochastic modeling framework where updates and sampling follow independent point processes is proposed.…”

Section: Related Workmentioning

confidence: 99%

A Methodological Approach for Time Series Analysis and Forecasting of Web Dynamics

Calzarossa

Vedova

Massari

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The web is a complex information ecosystem that provides a large variety of content changing over time as a consequence of the combined effects of management policies, user interactions and external events. These highly dynamic scenarios challenge technologies dealing with discovery, management and retrieval of web content. In this paper, we address the problem of modeling and predicting web dynamics in the framework of time series analysis and forecasting. We present a general methodological approach that allows the identification of the patterns describing the behavior of the time series, the formulation of suitable models and the use of these models for predicting the future behavior. Moreover, to improve the forecasts, we propose a method for detecting and modeling the spiky patterns that might be present in a time series. To test our methodological approach, we analyze the temporal patterns of page uploads of the Reuters news agency website over one year. We discover that the upload process is characterized by a diurnal behavior and by a much larger number of uploads during weekdays with respect to weekend days. Moreover, we identify several sudden spikes and a daily periodicity. The overall model of the upload process -obtained as a superposition of the models of its individual components -accurately fits the data, including most of the spikes.

show abstract

Section: Related Workmentioning

confidence: 99%

A Methodological Approach for Time Series Analysis and Forecasting of Web Dynamics

Calzarossa

Vedova

Massari

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Since the web crawler cannot continuously monitor every page, there is only partial information available on the change process. Cho and Garcia-Molina (2003b), and more recently Li, Cline, and Loguinov (2017), have proposed estimators of the rate of change given partial observations. However, the problem of learning the refresh rates of items while also trying to optimise the objective of keeping the cache as up-to-date for incoming requests as possible seems very challenging.…”

Section: Introductionmentioning

confidence: 99%

Learning to Crawl

Upadhyay¹,

Busa-Fekete

Kotłowski

et al. 2020

AAAI

View full text Add to dashboard Cite

Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. (2018) under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follows a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an O(√T) regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of parameters.

show abstract

“…Since the web crawler cannot continuously monitor every page, there is only partial information available on the change process. Cho and Garcia-Molina [2003b], and more recently Li et al [2017], have proposed estimators of the rate of change given partial observations. However, the problem of learning the refresh rates of items while also trying to optimise the objective of keeping the cache as up-to-date for incoming requests as possible seems very challenging.…”

Section: Introductionmentioning

confidence: 99%

Learning to Crawl

Upadhyay¹,

Busa-Fekete²,

Kotłowski³

et al. 2019

Preprint

View full text Add to dashboard Cite

Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follows a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an O( √ T ) regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of the parameters.

show abstract

Temporal update dynamics under blind sampling

Cited by 5 publications

References 38 publications

A Methodological Approach for Time Series Analysis and Forecasting of Web Dynamics

A Methodological Approach for Time Series Analysis and Forecasting of Web Dynamics

Learning to Crawl

Learning to Crawl

Contact Info

Product

Resources

About