A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

Uzun, Erdinç

doi:10.1109/access.2020.2984503

Cited by 56 publications

(17 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In our study, a web crawler was developed that can easily create a dataset and extract new features from web pages. A crawler offers a lot of information about web data extraction [27]. With this crawler, a large dataset of 20,000 web pages from 200 websites was created.…”

Section: Discussionmentioning

confidence: 99%

“…Uzun et al [5] develop an intelligent crawler, namely iCrawler, automatically pulls content out of various layouts for improving the crawling process. Uzun [27] proposes a novel approach that extracts data quickly using the string functions and additional information including the starting position, the number of the inner tag, and tag repetition obtained from web pages. The data obtained through web crawlers can be used for many different purposes.…”

Section: Related Studiesmentioning

confidence: 99%

“…Unfortunately, these features are not enough for accurate prediction. In order to resolve this situation, we identified features that can be obtained from the web page by considering other features [5], [10] suggested in the literature. Furthermore, new features are suggested thanks to the additional modules added to our crawler in this study.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Automatically Discovering Relevant Images From Web Pages

et al. 2020

Self Cite

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Section: Related Studiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatically Discovering Relevant Images From Web Pages

et al. 2020

Self Cite

View full text Add to dashboard Cite

“…Singrodia et al [10] introduced the concept of web scraping from easy to hard. Uzun [11] presented the concept of a document object model tree to scrape website data. Pujari et al [12] selected the XAMPP platform as the display surface to implement a web scraping scene application.…”

Section: Related Workmentioning

confidence: 99%

Taiwan Stock Tape Reading Periodically Using Web Scraping Technology with GUI

Lin

Yang

2022

ASI

View full text Add to dashboard Cite

Stock tape reading involves surveilling stock prices once in a while and recording stock prices. The method of observing stock prices may be television or stock exchange. The time step for recoding stock prices is every stock user’s experience and their theory, perhaps 3 min or 2 h and so on. As an example, the Taiwan stock market starts at 9:00 a.m. to 13:30 p.m. It will have a 4 h operating time. Splitting the 4 h operating time for tape reading is the skill of stock users. The stock price sequence generated by tape reading can be predicted by stock users, but finally, it is the stock user’s experience. Therefore, the meaning of tape reading is to record the stock price, but its concept should have no prediction purpose. This study used thread technology and proposed a tape-reading method with web scraping. This method can periodically scrape stock prices and generate a stock price sequence to Excel file. This application can satisfy the demand of these stock users, who are called day trading users. Because these day trading users want to gain stock price sequences minute by minute, rather than the stock exchange format day by day, and also ones which are better than the those provided by the stock website service, because its stock sequence format is limited and not normalized, these day trading users think that minute-by-minute stock price sequences are very clear to forecast. This study implemented the prior scheme and designed the GUI to query a company’s stock price and its stock news, even per second, etc., and how long it took to update the stock price, and the GUI also included a time-up feature to stop scraping stock prices if users just wanted to scrape stock prices periodically.

show abstract

“…Recent advancements in machine learning and Artificial Intelligence (AI) have unfolded new opportunities, even in extensively studied research programs in numerous domains, including medical imaging (e.g., image recognition), transportation (feature extraction in selfdriving cars) [1,2] , and traffic scenarios (e.g., object detection) [3,4] . These advancements also encourage the extraction of relevant information from documents (pdf, doc, or txt files), websites, and images that use Optical Character Recognition (OCR) [5] , subsequently inspiring the development of automated web data extraction systems through leading edge technology solutions [6,7] . The application of deep learning in web data extraction [8,9] is still in its nascent stage; in addition to extracting data from documents or web pages, this application involves navigating different websites and storing data for analytics and visualization purposes.…”

Section: Introductionmentioning

confidence: 99%

Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks

Patnaik

Babu

Bhave³

2021

Big Data Min. Anal.

View full text Add to dashboard Cite

Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.

show abstract

A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

Cited by 56 publications

References 53 publications

Automatically Discovering Relevant Images From Web Pages

Automatically Discovering Relevant Images From Web Pages

Taiwan Stock Tape Reading Periodically Using Web Scraping Technology with GUI

Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks

Contact Info

Product

Resources

About