Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural 2009
DOI: 10.3115/1690219.1690268
|View full text |Cite
|
Sign up to set email alerts
|

Mining bilingual data from the web with adaptively learnt patterns

Abstract: Mining bilingual data (including bilingual sentences and terms 1 ) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
22
0

Year Published

2010
2010
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(23 citation statements)
references
References 15 publications
0
22
0
Order By: Relevance
“…But they didn't pay attention to filter out noisy candidates extracted from good wrappers. Compared to [8], we propose to rank all candidates, extracted from all wrappers, by their relevance with seeds, but don't concern on the quality of wrappers. y Our method can be applied to bilingual web pages written in any pair of languages indiscriminately, such as Japanese-English, Korean-English and so on, for that our approach is completely character-based and doesn't limit any language and domain.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…But they didn't pay attention to filter out noisy candidates extracted from good wrappers. Compared to [8], we propose to rank all candidates, extracted from all wrappers, by their relevance with seeds, but don't concern on the quality of wrappers. y Our method can be applied to bilingual web pages written in any pair of languages indiscriminately, such as Japanese-English, Korean-English and so on, for that our approach is completely character-based and doesn't limit any language and domain.…”
Section: Introductionmentioning
confidence: 99%
“…y We propose to extract good candidate instances by ranking. In [8], they focus on selecting good wrappers, and all candidate pairs extracted by them are regarded as parallel data. But they didn't pay attention to filter out noisy candidates extracted from good wrappers.…”
Section: Introductionmentioning
confidence: 99%
“…Jiang et al (2009) used an adaptive pattern-based method to mine interesting bilingual data based on the observation that bilingual data usually appears collectively following similar patterns. They found that bilingual web pages are a promising source of up-to-date bilingual terms/sentences which cover many domains and application scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…In the past decade, there have been extensive studies on parallel resource extraction from the web (e.g., Chen and Nie, 2000;Resnik 2003;Jiang et al, 2009) and many effective Web mining systems have been developed such as STRAND, PTMiner, BITS and WPDE. For most of these mining systems, there is a typical parallel resource mining strategy which involves three steps: (1) locate the bilingual websites (2) identify parallel web pages from these bilingual websites and (3) extract bilingual resources from the parallel web pages.…”
Section: Introductionmentioning
confidence: 99%
“…de Souza et al, 2015). Also most of the previous approaches to bilingual data mining/cleaning for statistical MT rely on supervised learning (Resnik and Smith, 2003;Munteanu and Marcu, 2005;Jiang et al, 2009). Unsupervised solutions, like the one proposed by Cui et al (2013) usually rely on redundancy-based approaches that reward parallel segments containing phrase pairs that are frequent in a training corpus.…”
Section: Introductionmentioning
confidence: 99%