Detecting age of page content

Jatowt, Adam; Kawai, Yukiko; Tanaka, Katsumi

doi:10.1145/1316902.1316925

Cited by 20 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, changes in the ODP homepage were often due to the updates in the count of visitors. The above characteristics of the Open Directory Project's homepage are consistent with the results that we have obtained using our age detection tool [10] mentioned in Sect. 3.…”

Section: Comparing Histories Of Multiple Pagessupporting

confidence: 90%

See 1 more Smart Citation

Page History Explorer: Visualizing and Comparing Page Histories

Jatowt

Kawai

Tanaka

2011

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

Section: Comparing Histories Of Multiple Pagessupporting

confidence: 90%

“…In another work [10], we have proposed using Web archive data for determining the age of content on pages. For an arbitrary page, the system approximately estimates the creation dates of the content that a user encounters on a page.…”

Section: Related Workmentioning

confidence: 99%

Page History Explorer: Visualizing and Comparing Page Histories

Jatowt

Kawai

Tanaka

2011

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

“…Despite nearly two decades of Web history, there has not been much research conducted for mining Web archive data. The benefit of utilizing the Web archives for knowledge discovery has been discussed many times [9,6,10]. Below, we outline some of the approaches that have been used for mining the past Web using data in Web archives.…”

Section: Related Workmentioning

confidence: 99%

“…This work is a part of the LAWA project (Longitudinal Analytics of Web Archive data), a focused research project for managing Web archive data and performing largescale data analytics on Web archive collections. Jatowt et al [10] also utilized the public archival repositories for automatically detecting the age of Web content through the past snapshots of pages.…”

Section: Related Workmentioning

confidence: 99%

Detecting Off-Topic Pages in Web Archives

AlNoamany

Weigle

Nelson

2015

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract. Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting off-topic pages in Web archive collections. We evaluate six different methods to detect when the page has gone off-topic through subsequent captures. Those predicted off-topic pages will be presented to the collection's curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three ArchiveIt collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold −0.85 performs the best with accuracy = 0.987, F1 score = 0.906, and AUC = 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting the off-topic pages is 0.92.

show abstract

“…This implementation required adapting the method to the Web data, for which it is difficult to correctly detect document timestamps in contrast to news articles. Although some attempts were proposed for estimating the age of page content, they are rather costly and have varying precision [5]. Lastly, one needs to remember that the credibility and relevance of news articles are usually much higher than that of arbitrary Web pages.…”

Section: Future-related Information Analysismentioning

confidence: 99%

Supporting analysis of future-related information in news archives and the web

Jatowt

Kanazawa

Oyama

et al. 2009

Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries

Self Cite

View full text Add to dashboard Cite

A lot of future-related information is available in news articles or Web pages. This information can however differ to large extent and may fluctuate over time. It is therefore difficult for users to manually compare and aggregate it, and to re-construct the most probable course of future events. In this paper we approach a problem of automatically generating summaries of future events related to queries using data obtained from news archive collections or from the Web. We propose two methods, explicit and implicit future-related information detection. The former is based on analyzing the context of future temporal expressions in documents, while the latter relies on detecting periodical patterns in historical document collections. We present a graph-based visualization of future-related information and demonstrate its usefulness through several examples.

show abstract

Detecting age of page content

Cited by 20 publications

References 19 publications

Page History Explorer: Visualizing and Comparing Page Histories

Page History Explorer: Visualizing and Comparing Page Histories

Detecting Off-Topic Pages in Web Archives

Supporting analysis of future-related information in news archives and the web

Contact Info

Product

Resources

About