2020
DOI: 10.1109/access.2020.2984503
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

Abstract: Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree increases the time cost depending on the data structure of the DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. Thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 56 publications
(17 citation statements)
references
References 53 publications
0
16
0
1
Order By: Relevance
“…In our study, a web crawler was developed that can easily create a dataset and extract new features from web pages. A crawler offers a lot of information about web data extraction [27]. With this crawler, a large dataset of 20,000 web pages from 200 websites was created.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…In our study, a web crawler was developed that can easily create a dataset and extract new features from web pages. A crawler offers a lot of information about web data extraction [27]. With this crawler, a large dataset of 20,000 web pages from 200 websites was created.…”
Section: Discussionmentioning
confidence: 99%
“…Uzun et al [5] develop an intelligent crawler, namely iCrawler, automatically pulls content out of various layouts for improving the crawling process. Uzun [27] proposes a novel approach that extracts data quickly using the string functions and additional information including the starting position, the number of the inner tag, and tag repetition obtained from web pages. The data obtained through web crawlers can be used for many different purposes.…”
Section: Related Studiesmentioning
confidence: 99%
See 1 more Smart Citation
“…Singrodia et al [10] introduced the concept of web scraping from easy to hard. Uzun [11] presented the concept of a document object model tree to scrape website data. Pujari et al [12] selected the XAMPP platform as the display surface to implement a web scraping scene application.…”
Section: Related Workmentioning
confidence: 99%
“…Recent advancements in machine learning and Artificial Intelligence (AI) have unfolded new opportunities, even in extensively studied research programs in numerous domains, including medical imaging (e.g., image recognition), transportation (feature extraction in selfdriving cars) [1,2] , and traffic scenarios (e.g., object detection) [3,4] . These advancements also encourage the extraction of relevant information from documents (pdf, doc, or txt files), websites, and images that use Optical Character Recognition (OCR) [5] , subsequently inspiring the development of automated web data extraction systems through leading edge technology solutions [6,7] . The application of deep learning in web data extraction [8,9] is still in its nascent stage; in addition to extracting data from documents or web pages, this application involves navigating different websites and storing data for analytics and visualization purposes.…”
Section: Introductionmentioning
confidence: 99%