Automatic Extraction of Logical Web Lists

Lanotte, Pasqua Fabiana; Fumarola, Fabio; Ceci, Michelangelo; Scarpino, Andrea; Torelli, Michele Damiano; Malerba, Donato

doi:10.1007/978-3-319-08326-1_37

Cited by 5 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, researchers and practitioners have used the hyperlink structure to organize Web pages for many years. The basic idea of Web structure mining algorithms is that if there is a hyperlink between two pages, then some semantic relation may exist between them [5,22,27]. A Web structure mining naïve solution for sitemap generation is the application of the simple breadth search algorithm.…”

Section: Sitemap Extractionmentioning

confidence: 99%

“…Several works in the field of Web mining exploit Web pages taking advantage of the structural and visual information embedded in the HTML tags. In [5,22,26] collections of hyperlinks having similar visual and/or structural properties are used to filter noisy links and collect Web pages belonging to same semantic type. In [5,26] the aim is to exploit Web lists for the task of Web page clustering.…”

Section: Automatic Extraction Of Web Listsmentioning

confidence: 99%

“…In this way, two Web pages have a similar in-page link-structure if they frequently appear together in link collections. Differently, in [22] the authors define the concept of logical Web lists (i.e. Web lists that collect structured data spanned in multiple pages of the same website) for information extraction purposes.…”

Section: Automatic Extraction Of Web Listsmentioning

confidence: 99%

“…This is the case of hyperlinks used to enforce the Web page authority in a link-based ranking scenario, short-cut hyperlinks, etc. The solution we propose, based on the usage of Web lists, has a twofold effect: on the one hand, it guarantees that only hyperlinks which may belong to potential navigation systems are considered; on the other hand, it allows the method to identify hyperlinks by implicitly taking into account the Web page structure codified in the Web lists available in the Web pages [5,22,37,42], even if the hyperlinks do not belong to the navigation system. The crawling algorithm is described in Algorithm 1.…”

Section: Website Crawlingmentioning

confidence: 99%

See 3 more Smart Citations

Closed sequential pattern mining for sitemap generation

Ceci

Lanotte

2020

World Wide Web

Self Cite

View full text Add to dashboard Cite

A sitemap represents an explicit specification of the design concept and knowledge organization of a website and is therefore considered as the website’s basic ontology. It not only presents the main usage flows for users, but also hierarchically organizes concepts of the website. Typically, sitemaps are defined by webmasters in the very early stages of the website design. However, during their life websites significantly change their structure, their content and their possible navigation paths. Even if this is not the case, webmasters can fail to either define sitemaps that reflect the actual website content or, vice versa, to define the actual organization of pages and links which do not reflect the intended organization of the content coded in the sitemaps. In this paper we propose an approach which automatically generates sitemaps. Contrary to other approaches proposed in the literature, which mainly generate sitemaps from the textual content of the pages, in this work sitemaps are generated by analyzing the Web graph of a website. This allows us to: i) automatically generate a sitemap on the basis of possible navigation paths, ii) compare the generated sitemaps with either the sitemap provided by the Web designer or with the intended sitemap of the website and, consequently, iii) plan possible website re-organization. The solution we propose is based on closed frequent sequence extraction and only concentrates on hyperlinks organized in “Web lists”, which are logical lists embedded in the pages. These “Web lists” are typically used for supporting users in Web site navigation and they include menus, navbars and content tables. Experiments performed on three real datasets show that the extracted sitemaps are much more similar to those defined by website curators than those obtained by competitor algorithms.

show abstract

Section: Sitemap Extractionmentioning

confidence: 99%