Proceedings of the 6th International Conference on Web Engineering - ICWE '06 2006
DOI: 10.1145/1145581.1145634
|View full text |Cite
|
Sign up to set email alerts
|

Catching web crawlers in the act

Abstract: This paper recommends a new approach to the detection and containment of Web crawler traverses based on clickstream data mining. Timely detection prevents crawler abusive consumption of Web server resources and eventual site contents privacy or copyrights violation. Clickstream data differentiation ensures focused usage analysis, valuable both for regular users and crawler profiling. Our platform, named ClickTips, sustains a sitespecific, updatable detection model that tags Web crawler traverses based on incre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0
1

Year Published

2009
2009
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 27 publications
(28 citation statements)
references
References 9 publications
0
27
0
1
Order By: Relevance
“…Our analysis reveals that such observation is generally, but not always, true. Among all the logs, unassigned referrer field was found in about 25% of the requests, which is about half of the number reported in an earlier study (Lourenco and Belo, 2006). Among the rest, about 60% contained URL external to Microsoft.…”
Section: Referrermentioning
confidence: 75%
See 2 more Smart Citations
“…Our analysis reveals that such observation is generally, but not always, true. Among all the logs, unassigned referrer field was found in about 25% of the requests, which is about half of the number reported in an earlier study (Lourenco and Belo, 2006). Among the rest, about 60% contained URL external to Microsoft.…”
Section: Referrermentioning
confidence: 75%
“…Dikaiakos and Stassopoulou Kumar, 2000, 2002) Tan and Kumer (Dikaiakos et al, 2003(Dikaiakos et al, , 2005Stassopoulou and Dikaiakos, 2006) Almeida et al (Almeida et al, 2001) Lourenco and Belo (Lourenco and Belo, 2006) They used most of the information included in the log including some variations on traffic patterns. In particular, they examined HTTP-traffic characteristics (e.g., methods, error codes, etc) as well as resource referencing behavior (e.g., file type, percentage of distinct requests, resource popularity, and concentration of requests, etc).…”
Section: This Papermentioning
confidence: 99%
See 1 more Smart Citation
“…The crawler will automatic crawl information to avoid this robot.txt file will be removed [17]. In this paper, one lexical pattern is defined in which the robot.txt file will never be crawled.…”
Section: Descriptionmentioning
confidence: 99%
“…The present work used the pattern mining approach introduced in [4,3], involving the semi-automatic labeling of a training set of Web sessions and tree model induction. Besides crawler and regular user sessions there were identified browser-related application sessions ( …”
Section: Profile Analysismentioning
confidence: 99%