Proceedings of the 18th International Conference on World Wide Web 2009
DOI: 10.1145/1526709.1526880
|View full text |Cite
|
Sign up to set email alerts
|

Purely URL-based topic classification

Abstract: Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper.Usually, web pages are classified using their content [7], but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
70
0
1

Year Published

2009
2009
2020
2020

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 93 publications
(71 citation statements)
references
References 6 publications
0
70
0
1
Order By: Relevance
“…We then use these lists to search for indicative words in the visible text of <a> elements, and in normalized text extracted from the URLs. We use a token extraction technique from [18] to extract the normalized text from URLs. For example, /watch/baseball/foxsports.html would be split into the tokens watch, baseball, foxsports, and html.…”
Section: A Feature Extractionmentioning
confidence: 99%
“…We then use these lists to search for indicative words in the visible text of <a> elements, and in normalized text extracted from the URLs. We use a token extraction technique from [18] to extract the normalized text from URLs. For example, /watch/baseball/foxsports.html would be split into the tokens watch, baseball, foxsports, and html.…”
Section: A Feature Extractionmentioning
confidence: 99%
“…aspx, u2: http://allrecipes.com/RECIPE/Smores/ default.aspx}. Rule generated from these URLs with u1 as source URL and u2 as target URL, has context c(k (1,5) …”
Section: Deep Token Componentsmentioning
confidence: 99%
“…In contrast with efforts in literature for extracting semantic features from URLs [16,5], we discuss a technique which extracts syntactic features from URLs. We employ an unsupervised technique to learn custom URL encodings used by webmasters.…”
Section: Deep Tokenizationmentioning
confidence: 99%
“…When classifying Web documents, another source of information that can be used for classification is their Uniform Resource Locator (URL). Previous research has shown that classifiers built from features based solely on document URLs can achieve surprisingly good results on tasks such as language identification [24] or topic attribution [23]. Intuitively, URLs contain information that can be used to discriminate between local and global pages, such as top level domains or words such as local or regional.…”
Section: Classification Featuresmentioning
confidence: 99%
“…For instance, a document whose URL has a top level domain .uk is more likely to be local than a document with a top level domain such as .com. Taking inspiration on the experiments reported by Baykan et al [23], the following features were considered:…”
Section: Classification Featuresmentioning
confidence: 99%