Multimodal Web Page Segmentation Using Self-organized Multi-objective Clustering

Ramesh, Srivatsa; Dias, Gaël; Andrew, Judith Jeyafreeda; Saha, Sriparna; Maurel, Fabrice; Ferrari, Stéphane

doi:10.1145/3480966

Cited by 4 publications

(3 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, the HEPS method [20], mentioned in the Webis-WebSeg-20 dataset [16] for comparison, utilizes text nodes and images to identify potential headings, corresponding blocks, and create a hierarchical segmentation. The DOM structure is also a vital component in other segmentation models [14,15], where additional factors like textual and visual cues are integrated to enhance performance.…”

Section: Wps Approachesmentioning

confidence: 99%

“…Over time, many solutions have been proposed to address the segmentation problem using different approaches and learning strategies. The most commonly used techniques fall into several categories: ad-hoc approaches [7,29,6,18,25] (which rely on manually-tuned heuristics and parameter-dependent methods), theoretically-founded approaches [9,1] (based on graph-theoretic and classical clustering algorithms), computer vision approaches [13,11], and others (as mentioned in [14]). In general, these approaches share three key elements: visual, textual, and structural cues found on web pages.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A DOM-structural Cohesion Analysis Approach for Segmentation of Modern Web Pages

Huynh,

Le,

Nguyen

et al. 2024

Preprint

View full text Add to dashboard Cite

Web page segmentation is a fundamental technique applied in information retrieval systems to enhance web crawling tasks and information extraction. Its purpose is to gain deep insights from crawling results and extract the main content of a webpage by disregarding the irrelevant regions. Over time, several solutions have been proposed to address the segmentation problem using different approaches and learning strategies. Among these, the structural cue, which is a characteristic of the DOM tree, is widely utilized as a primary factor in segmentation models. In this paper, we propose a novel technique for web page segmentation using DOM-structural cohesion analysis. Our approach involves generating blocks that represent groups of DOM subtrees with similar tag structures. By analyzing the cohesion within each generated block and comparing detailed information such as types, attributes, and visual cues of web page elements, we can effectively maintain or reconstruct the segmentation layout. Additionally, we employ the Canny algorithm to optimize the segmentation result by reducing redundant spaces, resulting in a more correct segmentation. We evaluate the effectiveness of our approach using a dataset of 1,969 web pages. The approach achieves 64% on the FB3 score, surpassing existing state-of-the-art methods. The proposed DOM-structural cohesion analysis has the potential for improving web page segmentation and its various applications.

show abstract

Section: Wps Approachesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A DOM-structural Cohesion Analysis Approach for Segmentation of Modern Web Pages

Huynh,

Le,

Nguyen

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…In comparison to Andrew Judith et al solution [37], our method defines as many content blocks as there are on the page, not limiting the number of blocks. In comparison to other segment number not fixed solutions [38], this method is faster, as it does not require two stages (to identify the number of clusters and then to divide the web page into this number of blocks) and extracts all possible content blocks from the web page. The blocks are not limited to text containing structured blocks only [39] and extract all, not only structured blocks [40].…”

mentioning

confidence: 99%

Web Page Content Block Identification with Extended Block Properties

Griazev

Ramanauskaitė

2023

Applied Sciences

View full text Add to dashboard Cite

Web page segmentation is one of the most influential factors for the automated integration of web page content with other systems. Existing solutions are focused on segmentation but do not provide a more detailed description of the segment including its range (minimum and maximum HTML code bounds, covering the segment content) and variants (the same segments with different content). Therefore the paper proposes a novel solution designed to find all web page content blocks and detail them for further usage. It applies text similarity and document object model (DOM) tree analysis methods to indicate the maximum and minimum ranges of each identified HTML block. In addition, it indicates its relation to other blocks, including hierarchical as well as sibling blocks. The evaluation of the method reveals its ability to identify more content blocks in comparison to human labeling (in manual labeling only 24% of blocks were labeled). By using the proposed method, manual labeling effort could be reduced by at least 70%. Better performance was observed in comparison to other analyzed web page segmentation methods, and better recall was achieved due to focus on processing every block present on a page, and providing a more detailed web page division into content block data by presenting block boundary range and block variation data.

show abstract