Proceedings of the 8th Annual ACM International Workshop on Web Information and Data Management 2006
DOI: 10.1145/1183550.1183560
|View full text |Cite
|
Sign up to set email alerts
|

Efficient, automatic web resource harvesting

Abstract: There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled ("the counting problem") and the human-readable format of the resources are not always suitable for machine processing ("the representation problem"). We introduce an approach that solves these two problems by implementing support for both the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21 Digital Item Dec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2007
2007
2021
2021

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 30 publications
0
9
0
Order By: Relevance
“…This work was supported by the R&D programme of the Swedish Armed Forces and by Recorded Future AB, and is also a supporting work for the Alert4All research project 4 , which is funded under the European Union Seventh Framework Programme under contract no 261732.…”
Section: Acknowledgementsmentioning
confidence: 99%
“…This work was supported by the R&D programme of the Swedish Armed Forces and by Recorded Future AB, and is also a supporting work for the Alert4All research project 4 , which is funded under the European Union Seventh Framework Programme under contract no 261732.…”
Section: Acknowledgementsmentioning
confidence: 99%
“…Thus, an item can contain more than one record, each containing metadata in a specific metadata format pertaining to one resource. This hierarchy is further clarified by Figure 1 taken from [46]. Sets are repository-defined collections of these items.…”
Section: Ii41 "Regular Oai-pmh"mentioning
confidence: 99%
“…It provides selective harvesting through from-until parameters, so that only the desired metadata may be harvested from a data provider. This feature has enhanced efficiency by facilitating harvest of data from the previous repository update instead of a complete repository harvest with redundant copies of records that may already exist in the harvesters data collection [78,46]. This is achieved by set-based harvesting or by date-based harvesting of records.…”
Section: Ii42 Resource Harvesting Within the Oai-pmh Using Mpeg-21 mentioning
confidence: 99%
“…We can use this format as a container, or CRATE, for our preservation-prepared resource. Since OAI-PMH was already implemented as an Apache web server module [4], mod oai, we used it as our experimental prototype. Like other Apache modules, mod oai activity is controlled through the web server configuration file (httpd.conf).…”
Section: The Crate Modelmentioning
confidence: 99%