In the last decade the dominant models of machine translation (MT) have been data-driven or corpus-based. This is in sharp contrast to the dominant framework of the 1980s and previous decades, which was 'rule-based' (RBMT). In general, a distinction is made between, on the one hand, statistical machine translation (SMT), based primarily on word frequency and word combinations, and on the other hand, example-based machine translation (EBMT), based on the extraction and combination of phrases (or other short segments of texts). In both cases the corpora comprise bilingual texts (originals and their translations).The origin of EBMT can be dated precisely to a conference paper in 1981 by Makoto Nagao (1984). Research, however, did not begin until the late 1980s at the same time as the first appearance of the translation memory (TM) as a translator's tool and the first research on SMT. The latter in particular gave rise to much dispute in the early 1990s. EBMT was associated with SMT as both were seen as variants of corpus-based approaches to MT systems, and during the 1990s both became familiar at MT conferences. In recent years, SMT has become the dominant (almost 'mainstream') approach in MT (as witnessed by the proceedings of almost any conference in the field of computational linguistics), and EBMT systems are less evident than SMT (but now more prevalent than RBMT).The overall conception of SMT is now familiar -in essence, virtually all described models derive from the design first formulated in 1988 by the IBM group (Brown et al. 1988). Sentences of the bilingual corpus are first aligned, then individual words or word sequences (called 'phrases' or 'clumps' in SMT literature) of source language (SL) and target language (TL) texts are aligned, i.e. brought into correspondence. On the basis of these alignments are derived a 'translation model' of SL-TL frequencies and a 'language model' of TL word sequences. Translation involves the selection of most probable TL output for each input word or phrase and the determination of the most probable sequence(s) of words in the TL.By contrast, the EBMT model is less clearly defined than the SMT model. Basically (if somewhat superficially), an MT system is an EBMT system if it uses segments (word sequences (strings) and not individual words) of source language (SL) texts extracted from a text corpus (its example database) to build texts in a target language (TL) with the same meaning. The basic units for EBMT are sequences of words (phrases, or 'fragments'), and the basic techniques are the matching of input strings against SL strings in the database, the extraction of corresponding TL strings and the 'recombination' of the strings as acceptable TL sentences. However, there is a multiplicity of techniques, many derived from other approaches, including methods used in RBMT systems, methods found in SMT, techniques used in translation memories (TM) etc., and there seems to be no clear consensus on what the basic 'model' (or design framework) of EBMT is and what it is not.
Co...