CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection

Chen, Rui; Desai, Bipin C.; Zhou, Cong

doi:10.1109/ideas.2007.4318093

Cited by 6 publications

(4 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the CINDI Robot, a revised context graph [5] is proposed to classify all web pages into three categories: web pages that are directly related to our topic (referred as Layer 1 web pages in following), web pages which are indirectly related to our topic but may lead to relevant web pages (referred as Layer 2 web pages in following) and web pages that are totally useless. In contrast with the classic context graph [6], our strategy gets rid of the strict link distance requirements, reduces classification times, increases classification accuracy rates and significantly increases the opportunities of discovering relevant web regions via indirectly related web pages.…”

Section: Relevancy-based Crawling Processmentioning

confidence: 99%

“…For the second kind of classifier, Naïve Bayes classifier is exploited as the Layer 2 web page classifier. In addition, a novel tunneling technique is implemented based on the Layer 2 web page classifier in order to reach more relevant regions by allowing the CINDI Robot to visit some low relevancy intermediates [5].…”

Section: Relevancy-based Crawling Processmentioning

confidence: 99%

“…These two working modes guarantee the CINDI Robot can achieve a high precision while trying to keep recall as high as possible. URL pattern inspection is implemented to include 5 main aspects: computer science department web site speculation, stop-directory and to-be-avoided directory filtering [5], protocol and file format filtering, seven directory level exclusion and the Robot Exclusion Protocol [7]. URL pattern inspection of the CINDI Robot can speculate if a web page is from a computer science department by examining its URL.…”

Section: Relevancy-based Crawling Processmentioning

confidence: 99%

See 2 more Smart Citations

An enhanced web robot for the CINDI system

Chen

Desai

2008

Proceedings of the 2008 C3S2E Conference on - C3S2E '08

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Relevancy-based Crawling Processmentioning

confidence: 99%

Section: Relevancy-based Crawling Processmentioning

confidence: 99%

Section: Relevancy-based Crawling Processmentioning

confidence: 99%

See 1 more Smart Citation

An enhanced web robot for the CINDI system

Chen

Desai

2008

Proceedings of the 2008 C3S2E Conference on - C3S2E '08

Self Cite

View full text Add to dashboard Cite

show abstract

“…As one of the most influential inventions of humanity, text has played an important role in human life. Specifically, rich and precise semantic information carried by text is important in a wide range of vision-based application scenarios, such as image search [1], intelligent inspection [2], industrial automation [3], robot navigation [4], and instant translation [5]. Therefore, text recognition in natural scenes has drawn the attention of researchers and practitioners, as indicated by the emergence of recent "ICDAR Robust Reading Competitions" [6], [7], [8], [9], [10], [11], [12].…”

Section: Introductionmentioning

confidence: 99%

Text Recognition in the Wild: A Survey

Chen,

Jin,

Zhu

et al. 2020

Preprint

View full text Add to dashboard Cite

The history of text can be traced back over thousands of years. Rich and precise semantic information carried by text is important in a wide range of vision-based application scenarios. Therefore, text recognition in natural scenes has been an active research field in computer vision and pattern recognition. In recent years, with the rise and development of deep learning, numerous methods have shown promising in terms of innovation, practicality, and efficiency. This paper aims to (1) summarize the fundamental problems and the state-of-the-art associated with scene text recognition; (2) introduce new insights and ideas; (3) provide a comprehensive review of publicly available resources; (4) point out directions for future work. In summary, this literature review attempts to present the entire picture of the field of scene text recognition. It provides a comprehensive reference for people entering this field, and could be helpful to inspire future research. Related resources are available at our Github repository: https://github.com/HCIILAB/Scene-Text-Recognition.

show abstract

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218 This article is categorized under: Algorithmic Development > Web Mining Fundamental Concepts of Data and Knowledge > Information Repositories Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining

show abstract

CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection

Cited by 6 publications

References 6 publications

An enhanced web robot for the CINDI system

An enhanced web robot for the CINDI system

Text Recognition in the Wild: A Survey

A survey of Web crawlers for information retrieval

Contact Info

Product

Resources

About