Erdinç Uzun scite author profile

Predictive emission monitoring systems (PEMS) are important tools for validation and backing up of costly continuous emission monitoring systems used in gas-turbine-based power plants. Their implementation relies on the availability of appropriate and ecologically valid data. In this paper, we introduce a novel PEMS dataset collected over five years from a gas turbine for the predictive modeling of the CO and NOx emissions. We analyze the data using a recent machine learning paradigm, and present useful insights about emission predictions. Furthermore, we present a benchmark experimental procedure for comparability of future works on the data.

show abstract

An effective and efficient Web content extractor for optimizing the crawling process

Uzun

Güner

Kılıçaslan

et al. 2013

Softw. Pract. Exper.

View full text Add to dashboard Cite

SUMMARY Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright © 2013 John Wiley & Sons, Ltd.

show abstract

A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

Uzun

2020

IEEE Access

View full text Add to dashboard Cite

Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree increases the time cost depending on the data structure of the DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree. The string methods consist of the following consecutive steps: searching for a given pattern, then calculating the number of closing HTML elements for this pattern, and finally extracting content for the pattern. In the crawling process, our approach collects the additional information, including the starting position for enhancing the searching process, the number of inner tag for improving the extraction process, and tag repetition for terminating the extraction process. The string methods of this novel approach are about 60 times faster than extracting with the DOM-based method. Moreover, using these additional information improves extraction time by 2.35 times compared to using only the string methods. Furthermore, this approach can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies. INDEX TERMS Computational efficiency, algorithm design and analysis, web crawling and scraping, document object model.

show abstract

Text classification of web based news articles by using Turkish grammatical features

Tüfekçi

Uzun

Sevinc

2012

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Erdinç Uzun

A hybrid approach for extracting informative content from web pages

Predicting CO and NOxemissions from gas turbines: novel data and abenchmark PEMS

An effective and efficient Web content extractor for optimizing the crawling process

A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

Text classification of web based news articles by using Turkish grammatical features

Contact Info

Product

Resources

About