Searching the Web

Arasu, Arvind; Cho, Junghoo; García-Molina, Héctor; Paepcke, Andreas; Raghavan, S. V.

doi:10.1145/383034.383035

Cited by 437 publications

(221 citation statements)

References 39 publications

Supporting

Mentioning

210

Contrasting

Unclassified

Order By: Relevance

“…A popular approach for focused resource discovery on the web is the best-first search (BSFS) algorithm where unvisited pages are stored in a priority queue, known as frontier, and they are reordered periodically based on a criterion. So, a typical topic-oriented crawler performs keeps two queues of URLs; one containing the already visited links (from here on AF) and another having the references of the first queue also called crawl frontier (from here on CF) [5]. The challenging task is ordering the links in the CF efficiently.…”

Section: Web Information Retrievalmentioning

confidence: 99%

Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

Almpanidis

Kotropoulos

Pitas

2005

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract. Vertical search engines and web portals are gaining ground over the general-purpose engines due to their limited size and their high precision for the domain they cover. The number of vertical portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the size limitations of the initial training data while maintaining a high recall/precision ratio.

show abstract

Section: Web Information Retrievalmentioning

confidence: 99%

Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

Almpanidis

Kotropoulos

Pitas

2005

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

show abstract

“…Search engines use many of techniques developed over the last decades for full-text document retrieval, but are also quite different in many aspects [12]. Users interact with these systems in a very different way: queries tend to be much shorter and only the first or second results pages are examined in most cases.…”

Section: Web Search Evaluationmentioning

confidence: 99%

An Initial Proposal for Cooperative Evaluation on Information Retrieval in Portuguese

Aires

Aluísio

Quaresma

et al. 2003

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Perfectly and adeptly determined near duplicates are relied on different web mining applications, for example, document clustering [3], collaborative filtering [25], detection of replicated web collections [26], discovering large dense graphs [34], detecting plagiarism [31] and community mining in a social network site [32]. The removal of the near duplicate pages [33] helps in reduced storage costs and improved quality of search indexes in addition to considerable bandwidth conservation. Above all, the crawled web pages are preprocessed using document parsing, that eliminates the HTML tags and java scripts present in the web documents, and which is followed by the removal of common words or stop words from the crawled pages.…”

Section: Introductionmentioning

confidence: 99%

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Kumar¹,

Govindarajulu²

2013

IJCIS

View full text Add to dashboard Cite

Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are "similar" to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.

show abstract

Searching the Web

Cited by 437 publications

References 39 publications

Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

An Initial Proposal for Cooperative Evaluation on Information Retrieval in Portuguese

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Contact Info

Product

Resources

About