Extracting the main content from a web page is essential in various applications such as web crawlers and browser reader modes. Existing extraction methods using text-based algorithms and features for English text can be ineffective for non-English web pages. This study proposes a main content extraction method that obtains visual and structural features from the rendered web page. Our method uses the first impression area (FIA), a part of a web page that users initially view. In this area, websites have applied many techniques that enable users to find the main content easily. Using the non-textual properties in the FIA, our method selects three points with high content area density and expands the area from each point until it meets several structural and visual-based conditions. We evaluated our method, browsers' (Mozilla Firefox and Google Chrome) reader modes, and existing main content extraction methods on multilingual datasets using two measures: Longest Common Subsequences and matched text blocks. The results showed that our method performed better than other methods in both English (up to 46%, matched text blocks F 0.5 ) and non-English (up to 42%, matched text blocks F 0.5 ) web pages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.