SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis Server

Kaur, Sawroop; Geetha, G.

doi:10.1109/access.2020.3004756

Cited by 10 publications

(8 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Along with jasmine directory and amazon, 20 real websites from Alexa's list of top sites are exhaustively crawled to check at which depth most web pages are found. Our observation is similar to [37]. Below the depth of 6, the crawler was not able to find a considerable percentage of forms.…”

Section: Path Learningsupporting

confidence: 87%

“…The links from W are kept in frontier for seed URLs. Further links are kept in fetched link frontier [37]. The proposed crawler is focused on property, book, flight, hotel, music, premier and product domains.…”

Section: Framework Of Ichwmentioning

confidence: 99%

“…L be another document in a class hierarchy. The formula for similarity computation is used similarly as defined in [37]. It is computed between new-found URL and already discovered URLs.…”

Section: Similarity Computationmentioning

confidence: 99%

“…To stop crawler from unproductive exhaustive crawling, stopping rules such as maximum crawl depth = 3 and the threshold is designed. While the assumptions are the same as in [37]. The problem with the database-driven web is that the crawlers keep crawling the data under the infinite loop and actual valuable web page are usually skipped by the crawler.…”

Section: Stopping Criteria and Thresholdmentioning

confidence: 99%

See 3 more Smart Citations

IHWC: intelligent hidden web crawler for harvesting data in urban domains

Kaur

Singh

Geetha³

et al. 2021

Complex Intell. Syst.

Self Cite

View full text Add to dashboard Cite

Due to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.

show abstract

Section: Path Learningsupporting

confidence: 87%

Section: Framework Of Ichwmentioning

confidence: 99%

“…L be another document in a class hierarchy. The formula for similarity computation is used similarly as defined in [37]. It is computed between new-found URL and already discovered URLs.…”

Section: Similarity Computationmentioning

confidence: 99%

Section: Stopping Criteria and Thresholdmentioning

confidence: 99%

See 2 more Smart Citations

IHWC: intelligent hidden web crawler for harvesting data in urban domains

Kaur

Singh

Geetha³

et al. 2021

Complex Intell. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…A summary of existing research related to distributed crawling is shown in [32]. Other distributed crawling approaches include crawling the hidden web [33], a web crawling solution deployed by a cloud service [34], and a crawler that extracts information only regarding certain topic by classifying the crawled articles [35].…”

Section: Related Workmentioning

confidence: 99%

Decentralized News-Retrieval Architecture Using Blockchain Technology

Alexandrescu,

Butincu

2023

Mathematics

View full text Add to dashboard Cite

Trust is a critical element when it comes to news articles, and an important problem is how to ensure trust in the published information on news websites. First, this paper describes the inner workings of a proposed news-retrieval and aggregation architecture employed by a blockchain-based solution for fighting disinformation; this includes a comparison between existing information retrieval solutions. The decentralized nature of the solution is achieved by separating the crawling (i.e., extracting the web page links) from the scraping (i.e., extracting the article information) and having third-party actors extract the data. A majority-rule mechanism is used to determine the correctness of the information, and the blockchain network is used for traceability. Second, the steps needed to deploy the distributed components in a cloud environment seamlessly are discussed in detail, with a special focus on the open-source OpenStack cloud solution. Lastly, novel methods for achieving a truly decentralized architecture based on community input and blockchain technology are presented, thus ensuring maximum trust and transparency in the system. The results obtained by testing the proposed news-retrieval system are presented, and the optimizations that can be made are discussed based on the crawling and scraping test results.

show abstract