Aligning sentences in bilingual corpora using lexical information

Chen, Stanley F.

doi:10.3115/981574.981576

Cited by 134 publications

(72 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, these need a large number of previouslyaligned texts for training, which is a great hurdle for language pairs, such as Russian-German. Moreover, as Braune and Fraser (2010) note, a large number of them are also not completely language independent and not flexible to other language pairs (Chen, 1993;Fattah et al, 2007). Thus, supervised alignment cannot be easily applied to this data and we turn back to unsupervised approaches.…”

Section: Gargantuamentioning

confidence: 99%

Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies

Zhekova

Zangenfeind

Mikhaylova

et al. 2015

MATLIT

View full text Add to dashboard Cite

A navegação consulta e descarregamento dos títulos inseridos nas Bibliotecas Digitais UC Digitalis, UC Pombalina e UC Impactum, pressupõem a aceitação plena e sem reservas dos Termos e Condições de Uso destas Bibliotecas Digitais, disponíveis em https://digitalis.uc.pt/pt-pt/termos.Conforme exposto nos referidos Termos e Condições de Uso, o descarregamento de títulos de acesso restrito requer uma licença válida de autorização devendo o utilizador aceder ao(s) documento(s) a partir de um endereço de IP da instituição detentora da supramencionada licença.Ao utilizador é apenas permitido o descarregamento para uso pessoal, pelo que o emprego do(s) título(s) descarregado(s) para outro fim, designadamente comercial, carece de autorização do respetivo autor ou editor da obra. Na medida em que todas as obras da UC Digitalis se encontram protegidas pelo Código do Direito de Autor e Direitos Conexos e demais legislação aplicável, toda a cópia, parcial ou total, deste documento, nos casos em que é legalmente admitida, deverá conter ou fazer-se acompanhar por este aviso. Sentence-alignment and application of russian-german multi-target parallel corpora for linguistic analysis and literary studies Autor(es):Zhekova, Desislava; Zangenfeind, Robert; Mikhaylova, Alena; Nikolaienko, Tetiana ResumoEste artigo apresenta a aplicação de corpora multialvo paralelos -compostos por um único texto-fonte e múltiplas traduções-alvo desse texto -para análise linguística. Discute-se o alinhamento, busca interativa e visualização deste tipo de dados usando uma ferramenta específica chamada ALuDo (Alinhamento com Lucene para Dostoievski). Trata-se de uma aplicação Java que utiliza gramáticas locais, informação ontológica, dicionários bilingues e abordagens estatísticas para alinhamento e pesquisa. O conjunto de dados utilizado é constituído pelo romance russo Crime e Castigo de Fiodor Dostoievski e três traduções do romance em alemão. Com este corpus bilingue é possível levar a cabo investigação significativa no campo da linguística e dos estudos literários. Adicionalmente, publicamos parte do corpus paralelo resultante. Palavras-chave: alinhamento interativo; alinhamento baseado em regras; alinhamento estatístico; resolução de correferência; identificação de paráfrase.

show abstract

Section: Gargantuamentioning

confidence: 99%

Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies

Zhekova

Zangenfeind

Mikhaylova

et al. 2015

MATLIT

View full text Add to dashboard Cite

show abstract

“…The algorithm from Brown, Gale and Chen introduced the conception of anchor and divided the whole corpus into several smaller segments when aligning Hansard corpus [3]. It adopted the specific annotation from corpus to serve as anchor, and matched these anchors with dynamic planning algorithm.…”

Section: Related Workmentioning

confidence: 99%

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Qiu¹

2015

IJDTA

View full text Add to dashboard Cite

show abstract

“…So most parallel corpora are aligned in terms of sentences. Reviewing the literature on aligning parallel corpora, we found four main approaches to the problem of alignment at the sentence level: word length-based (Gale and Church 1991), character length-based (Brown et al 1991), dictionary-or translation-based (Chen 1993, Melamed 1996, Moore 2002, and partial similarity-based (Simard and Plamondon 1998). In this experiment, the alignment of sentences was done entirely manually.…”

Section: Aligning the Parallel Corpusmentioning

confidence: 99%

Constructing a Large-Scale English-Persian Parallel Corpus

Miangah

2009

meta

View full text Add to dashboard Cite

In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.Au cours des dernières années, l’exploitation de grands corpus de textes pour résoudre des problèmes linguistiques, notamment des problèmes de traduction, est devenue une pratique courante. Jusqu’à récemment, aucun corpus bilingue anglais-persan à grande échelle n’avait été constitué, en raison des difficultés qu’implique une telle entreprise.Cet article présente un projet réalisé en vue de colliger des corpus de textes numériques variés, tels que des documents du réseau Internet, avec le moins de bruit possible. L’utilisation d’Internet peut être considérée comme une aide précieuse car, souvent, il existe des traductions antérieures qui sont déjà publiées sur le Web. La tâche consiste à trouver les pages parallèles en anglais et en persan, à évaluer la qualité de leur traduction, à les télécharger et à les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles données peuvent être ajoutées, selon les besoins.Une des principales conséquences de l’élaboration d’un tel corpus est la mise au point d’un logiciel de concordance parallèle, dans lequel l’utilisateur pourrait introduire une chaîne de caractères dans une langue et afficher toutes les citations concernant cette chaîne dans la langue recherchée ainsi que des phrases correspondantes dans la langue cible. L’étape suivante serait d’utiliser ce corpus parallèle pour construire un logiciel de traduction générale.Le corpus bilingue aligné se trouve être utile dans beaucoup d’autres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguïtés de sens, pour le rétablissement des données interlangues, en lexicographie ainsi que pour l’apprentissage des langues

show abstract

Aligning sentences in bilingual corpora using lexical information

Cited by 134 publications

References 7 publications

Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies

Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Constructing a Large-Scale English-Persian Parallel Corpus

Contact Info

Product

Resources

About