2011 International Conference on Asian Language Processing 2011
DOI: 10.1109/ialp.2011.23
|View full text |Cite
|
Sign up to set email alerts
|

Using HTML Tags to Improve Parallel Resources Extraction

Abstract: This paper proposes a new approach to extract parallel resources (including bilingual sentences and bilingual terms) from bilingual web pages, which have a primary language and a secondary language (the second language is often the translation to primary language). Our method is composed of four tasks: 1) parsing the web page into a DOM tree and segmenting inner texts of each node into series of monolingual snippets; 2) selecting adjacent snippet pairs in different languages and with higher translation scores … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…A crawler is a program for which we specify a seed URL, and keep going based on that URL retrieving connected pages. 1 A page is parsed for additional URLs by URL Normalization, where these URLs are saved in storage for crawling [27], [28], and are used to retrieve the more available Web pages from a Web server. The process of crawling may be divided among multiple distributed crawlers.…”
Section: Methodsmentioning
confidence: 99%
“…A crawler is a program for which we specify a seed URL, and keep going based on that URL retrieving connected pages. 1 A page is parsed for additional URLs by URL Normalization, where these URLs are saved in storage for crawling [27], [28], and are used to retrieve the more available Web pages from a Web server. The process of crawling may be divided among multiple distributed crawlers.…”
Section: Methodsmentioning
confidence: 99%