Reinhard Rapp scite author profile

Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is more difficult, because most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts. Whereas for parallel texts in some studies up to 99% of the word alignments have been shown to be correct, the accuracy for non-parallel texts has been around 30% up to now. The current study, which is based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a significant improvement to about 72% of word translations identified correctly.

show abstract

Identifying word translations in non-parallel texts

Rapp

1995

169

181

View full text Add to dashboard Cite

Common algorithms for sentence and word-alignment allow the automatic identification of word translations from parallel texts. This study suggests that the identification of word translations should also be possible with non-parallel and even unrelated texts. The method proposed is based on the assumption that there is a correlation between the patterns of word cooccurrences in texts of different languages.

show abstract

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora

Zweigenbaum¹,

Sharoff²,

Rapp³

2017

View full text Add to dashboard Cite

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (GermanEnglish), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined manually a small sample of the false negative sentence pairs for the most precise FrenchEnglish runs and estimated the number of parallel sentence pairs not yet in the provided gold standard. Adding them to the gold standard leads to revised estimates for the French-English F-scores of at most +1.5pt. This suggests that the BUCC 2017 datasets provide a reasonable approximate evaluation of the parallel sentence spotting task.

show abstract

Free Word Associations Correspond to Contiguities Between Words in Texts*

Wettler

Rapp²,

Sedlmeier

2005

Journal of Quantitative Linguistics

View full text Add to dashboard Cite

A free associative response is the first word a person comes up with after perceiving another word, the so-called associative stimulus. People commonly associate hot to cold, church to priest, and hard to work. According to traditional association theory this behaviour is the result of learning by contiguity: ''Objects once experienced together tend to become associated in the imagination, so that when any one of them is thought of, the others are likely to be thought of also, in the same order of sequence or coexistence as before'' (James, 1890). This explanation has been rejected by cognitive psychologists who explain the production of associations as the result of symbolic processes which make use of complex semantic structures (Clark, 1970). We will show, however, that human associative responses can be predicted from contiguities between words in language use. This finding supports the hypothesis that the behaviour of participants in the free association task can be explained by associative learning.

show abstract

Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora

Sharoff

Rapp²,

Zweigenbaum

2013

View full text Add to dashboard Cite

A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German

Lezius

Rapp²,

Wettler

1998

View full text Add to dashboard Cite

In this paper we present Morphy, an integrated tool for German morphology, part-ofspeech tagging and context-sensitive lemmatization. Its large lexicon of more than 320,000 word forms plus its ability to process German compound nouns guarantee a wide morphological coverage. Syntactic ambiguities can be resolved with a standard statistical part-of-speech tagger. By using the output of the tagger, the lemmatizer can determine the correct root even for ambiguous word forms. The complete package is freely available and can be downloaded from the World Wide Web.

show abstract

The computation of word associations

Rapp

2002

View full text Add to dashboard Cite

It is shown that basic language processes such as the production of free word associations and the generation of synonyms can be simulated using statistical models that analyze the distribution of words in large text corpora. According to the law of association by contiguity, the acquisition of word associations can be explained by Hebbian learning. The free word associations as produced by subjects on presentation of single stimulus words can thus be predicted by applying first-order statistics to the frequencies of word co-occurrences as observed in texts. The generation of synonyms can also be conducted on co-occurrence data but requires second-order statistics. The reason is that synonyms rarely occur together but appear in similar lexical neighborhoods. Both approaches are systematically compared and are validated on empirical data. It turns out that for both tasks the performance of the statistical system is comparable to the performance of human subjects.

show abstract

BUCC Shared Task: Cross-Language Document Similarity

Sharoff

Zweigenbaum

Rapp

2015

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Reinhard Rapp

Automatic identification of word translations from unrelated English and German corpora

Identifying word translations in non-parallel texts

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora

Free Word Associations Correspond to Contiguities Between Words in Texts*

Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora

A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German

The computation of word associations

BUCC Shared Task: Cross-Language Document Similarity

Contact Info

Product

Resources

About