Proceedings of the 2018 International Conference on Industrial Enterprise and System Engineering (IcoIESE 2018) 2019
DOI: 10.2991/icoiese-18.2019.50
|View full text |Cite
|
Sign up to set email alerts
|

Comparison of Web Scraping Techniques : Regular Expression, HTML DOM and Xpath

Abstract: Data collection is the initial stage of research. There are various data sources on the internet that can be used in the research process. The process of taking data or information from sites on the internet is called web scraping. Some methods of web scraping include Regular Expression (Regex), HTML DOM and XPath. This study aims to determine the performance of the three methods of web scraping. The Comparison is done by testing each method when retrieving data from the target website, then measuring the perf… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0
11

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4

Relationship

3
5

Authors

Journals

citations
Cited by 32 publications
(27 citation statements)
references
References 9 publications
0
16
0
11
Order By: Relevance
“…Gunawan et al 4 compare the three different techniques for web scraping that is, Regular Expression (Regex), HTML DOM, and XPath in terms of performance. The research through experimentation, concludes by terming Regex technique as better in “less consumption of memory” than the other two, while HTML Dom technique performed better, in terms of “time.” Parvez et al 5 discusses about the various data extraction techniques and compare those, in relation to services.…”
Section: Related Workmentioning
confidence: 99%
“…Gunawan et al 4 compare the three different techniques for web scraping that is, Regular Expression (Regex), HTML DOM, and XPath in terms of performance. The research through experimentation, concludes by terming Regex technique as better in “less consumption of memory” than the other two, while HTML Dom technique performed better, in terms of “time.” Parvez et al 5 discusses about the various data extraction techniques and compare those, in relation to services.…”
Section: Related Workmentioning
confidence: 99%
“…Salah satu cara untuk mengambil data dari web adalah web sraping. Teknologi web scraping sudah banyak digunakan diantaranya pada penelitian [5], sementara berdasarkan penelitian [6] mengenai pengambilan data menggunakan web scraping dan HTML DOM, telah berhasil dilakukan pengambilan data untuk membangun korpus paralel dari data hasil scraping dengan berbagai format data secara otomatis, sehingga pada sistem LSP Universitas Siliwangi akan diterapkan teknik web scraping.…”
Section: Pendahuluanunclassified
“…Pada penelitian yang akan dilakukan untuk melengkapi kekurangan-kekurangan pada penelitian sebelumnya akan diterapkan Teknik Web Scraping yang sering dikenal sebagai screen scraping adalah teknik pengambilan sebuah dokumen semi terstruktur dalam bahasa markup seperti HTML atau XHTML, dan menganalisis dokumen tersebut untuk diambil data tertentu dan dimanfaatkan dalam berbagai kepentingan [12]. Berdasarakan penelitian sebelumnya [6] perbandingan Teknik web scraping HTML DOM memiliki penggunaan memori paling sedikit. HTML DOM adalah sebuah library untuk mendapatkan, mengubah, menambah, atau menghapus elemen HTML.…”
Section: E-issn:2540-9719unclassified
“…Several studies related to the implementation of web scraping of scientific article or literature from the internet have been carried out beforehand including: web scraping for Indonesian -English parallel corpus using HTML DOM method [4], web-scraping software in searching for gray literature [5], application of web scraping techniques in scientific article search engines [13], the application of web scraping and winnowing web for the detection of plagiarism in the final project title [14], [15]. There are several algorithms that can be used in web scraping such as: regular expressions, HTML DOM, and Xpath [16]. Each algorithm has its own characteristics, so it needs a good understanding before applying it.…”
Section: Introductionmentioning
confidence: 99%