Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries 2005
DOI: 10.1145/1065385.1065407
|View full text |Cite
|
Sign up to set email alerts
|

Downloading textual hidden web content through keyword queries

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
108
0
3

Year Published

2005
2005
2017
2017

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 156 publications
(116 citation statements)
references
References 11 publications
0
108
0
3
Order By: Relevance
“…If queries are not selected properly, most of the documents may be redundant. Therefore, query selection is modelled as a set covering [4] or dominating vertex [5] problem, so that the queries can return less redundancies. Since set covering problem or dominating vertex problem is NP-hard, the optimal solution cannot be found, especially because the problem size is very big that involves thousands or even more of documents and terms.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…If queries are not selected properly, most of the documents may be redundant. Therefore, query selection is modelled as a set covering [4] or dominating vertex [5] problem, so that the queries can return less redundancies. Since set covering problem or dominating vertex problem is NP-hard, the optimal solution cannot be found, especially because the problem size is very big that involves thousands or even more of documents and terms.…”
Section: Related Workmentioning
confidence: 99%
“…If we regard all the documents in a data source as the universe, each query is a subset of the documents it can match, the query selection problem is to find the subsets (the queries) to cover all the documents with minimal cost. Since the entire set of documents is not available, the queries have to be selected from a sample of partially downloaded documents [4,5,6,7]. In particular, [7,8] demonstrates that the queries selected from a sample set of documents can also work well for the entire data set.…”
Section: Introductionmentioning
confidence: 99%
“…The negative effects of this, like introducing terms in a server's resource description that the server does not contain or underestimating term frequencies, are offset by the overall improvement in performance. Ntoulas et al (2005) have the goal of crawling as much content as possible from a hidden web resource and present an adaptive approach for term selection: they try to identify those terms that are most likely to return the most additional documents. This outperforms other approaches in terms of the number of documents seen.…”
Section: Can We Do Better Than Random?mentioning
confidence: 99%
“…As in the cases of TextRunner and WebTables our goal was to develop techniques that would apply efficiently on large numbers of forms. This is in contrast with much prior work that have either addressed the problem by constructing mediator systems one domain at a time [12,13,26], or have needed site-specific wrappers or extractors to extract documents from text databases [5,22]. As we discuss, the pages we surface contain tables from which additional data can be extracted for the Web knowledge base.…”
Section: Accessing Deep-web Databasesmentioning
confidence: 99%
“…The set of all candidate keywords can then be pruned to select a smaller subset that ensures diversity of the exposed database contents. Similar iterative probing approaches have been used in the past to extract text documents from specific databases [5,10,18,22].…”
Section: Surfacing Deep-web Databasesmentioning
confidence: 99%