Proceedings of the 22nd International Conference on World Wide Web 2013
DOI: 10.1145/2487788.2487934
|View full text |Cite
|
Sign up to set email alerts
|

Search the past with the portuguese web archive

Abstract: The web was invented to quickly exchange data between scientists, but it became a crucial communication tool to connect the world. However, the web is extremely ephemeral. Most of the information published online becomes quickly unavailable and is lost forever. There are several initiatives worldwide that struggle to archive information from the web before it vanishes. However, search mechanisms to access this information are still limited and do not satisfy their users who demand performance similar to live-w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0
1

Year Published

2014
2014
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(12 citation statements)
references
References 9 publications
0
11
0
1
Order By: Relevance
“…Obviously, browsing is only useful if one knows the exact URL of the desired content. Since this is often not the case, search is an obvious solution [4,6], but unfortunately, most web archives do not support full-text search. There has been academic work on searching timestamped collections (such as web archives) [12,10,2,8], but these systems have not been deployed in production to our knowledge.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Obviously, browsing is only useful if one knows the exact URL of the desired content. Since this is often not the case, search is an obvious solution [4,6], but unfortunately, most web archives do not support full-text search. There has been academic work on searching timestamped collections (such as web archives) [12,10,2,8], but these systems have not been deployed in production to our knowledge.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Since this is often not the case, temporal search capabilities are the next most desired feature in web archives [7]. Unfortunately, most web archives do not support full-text search, with a few exceptions such as the Portuguese Web Archive [9], the British Library, and the Internet Archive's Archive-It service. There has been academic work on searching timestamped collections [18,12,3,11,22], but these systems have not been deployed in production at scale.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Similarly, a web page should only receive an inbound link from a contemporaneous page. We solve the problem in a manner similar to Gomes et al [9] by leveraging the fact that most web content is gathered in periodic (e.g., monthly) crawls, and thus we can process the webgraph from each crawl separately. We can easily determine where these "break points" should be with a Hadoop job to build a histogram of document timestamps.…”
Section: Scalable Analyticsmentioning
confidence: 99%
“…It adapts Nutch by searching against Web archives rather than crawling the Web. The Portuguese Web Archive information retrieval system [63] adapted NutchWAX to support versions of the same URL across time and to improve the ranking of results.…”
Section: Related Workmentioning
confidence: 99%