11th International Database Engineering and Applications Symposium (IDEAS 2007) 2007
DOI: 10.1109/ideas.2007.4318093
|View full text |Cite
|
Sign up to set email alerts
|

CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection

Abstract: With the explosion of the Web, focused web crawlers are gaining attention. Focused web crawlers aim at finding web pages related to the pre-defined topic. CINDI Robot is a focused web crawler devoted to finding computer science and software engineering academic documents. We propose a multi-level inspection scheme to discover relevant web pages. Through this multi-level inspection scheme, the text feature of the content contributes to the classification; furthermore other web characteristics, such as URL patte… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2008
2008
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 6 publications
0
4
0
Order By: Relevance
“…In the CINDI Robot, a revised context graph [5] is proposed to classify all web pages into three categories: web pages that are directly related to our topic (referred as Layer 1 web pages in following), web pages which are indirectly related to our topic but may lead to relevant web pages (referred as Layer 2 web pages in following) and web pages that are totally useless. In contrast with the classic context graph [6], our strategy gets rid of the strict link distance requirements, reduces classification times, increases classification accuracy rates and significantly increases the opportunities of discovering relevant web regions via indirectly related web pages.…”
Section: Relevancy-based Crawling Processmentioning
confidence: 99%
See 2 more Smart Citations
“…In the CINDI Robot, a revised context graph [5] is proposed to classify all web pages into three categories: web pages that are directly related to our topic (referred as Layer 1 web pages in following), web pages which are indirectly related to our topic but may lead to relevant web pages (referred as Layer 2 web pages in following) and web pages that are totally useless. In contrast with the classic context graph [6], our strategy gets rid of the strict link distance requirements, reduces classification times, increases classification accuracy rates and significantly increases the opportunities of discovering relevant web regions via indirectly related web pages.…”
Section: Relevancy-based Crawling Processmentioning
confidence: 99%
“…For the second kind of classifier, Naïve Bayes classifier is exploited as the Layer 2 web page classifier. In addition, a novel tunneling technique is implemented based on the Layer 2 web page classifier in order to reach more relevant regions by allowing the CINDI Robot to visit some low relevancy intermediates [5].…”
Section: Relevancy-based Crawling Processmentioning
confidence: 99%
See 1 more Smart Citation
“…As one of the most influential inventions of humanity, text has played an important role in human life. Specifically, rich and precise semantic information carried by text is important in a wide range of vision-based application scenarios, such as image search [1], intelligent inspection [2], industrial automation [3], robot navigation [4], and instant translation [5]. Therefore, text recognition in natural scenes has drawn the attention of researchers and practitioners, as indicated by the emergence of recent "ICDAR Robust Reading Competitions" [6], [7], [8], [9], [10], [11], [12].…”
Section: Introductionmentioning
confidence: 99%