Bilingual terminology extraction using multi‐level termhood

Zhang, Chengzhi; Wu, Dan

doi:10.1108/02640471211221395

Cited by 7 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some of these approaches rely on the existence of a seed lexicon (Semmar, 2018; Tsvetkov and Wintner, 2010; Xu et al , 2015) or existing translation memories and phrase tables (Oliver, 2017), while in some cases the existence of additional resources, in addition to the input corpus, is not required (Arcan et al , 2017; Bouamor et al , 2012; Garabík and Dimitrova, 2015; Naguib, 2016). Some approaches require parallel sentence-aligned data (Arcan et al , 2017; Bouamor et al , 2012; Garabík and Dimitrova, 2015; Semmar, 2018; Zhang and Wu, 2012), while others perform the extraction on comparable corpora (Hazem and Morin, 2016; Pinnis et al , 2012; Xu et al , 2015). The technique employed in Naguib (2016), used groups of aligned sentences (verses).…”

Section: Related Workmentioning

confidence: 99%

“…In Irvine and Callison-Burch (2016), the authors performed two experiments, the first one relying on the existence of a bilingual dictionary with no parallel texts and the second one requiring only the existence of a small amount of parallel data. Bilingual lexica were compiled for different language pairs: English/French (Bouamor et al , 2012; Hakami and Bollegala, 2017; Semmar, 2018), English/Spanish (Oliver, 2017), English/Arabic (Naguib, 2016), English/Italian and English/German (Arcan et al , 2017), English/Slovene (Vintar and Fišer, 2008), English/Croatian, Latvian, Lithuanian and Romanian (Pinnis et al , 2012), English/Chinese (Xu et al , 2015; Zhang and Wu, 2012), English/Hebrew (Tsvetkov and Wintner, 2010), English/Italian (Arcan et al , 2017), Slovak/Bulgarian (Garabík and Dimitrova, 2015), Serbian/English (Krstev et al , 2018) and so on.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Bilingual lexical extraction based on word alignment for improving corpus search

Андоновски

Šandrih

Kitanović

2019

View full text Add to dashboard Cite

Purpose This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a benchmark Serbian-German annotated corpus searchable with various query expansions. Design/methodology/approach The presented research is particularly focused on the enhancement of bilingual search queries in a full-text search of aligned SrpNemKor collection. The enhancement is based on using existing lexical resources such as Serbian morphological electronic dictionaries and the bilingual lexical database Termi. Findings For the purpose of this research, the lexical database Termi is enriched with a bilingual list of German-Serbian translated pairs of lexical units. The list of correct translation pairs was extracted from SrpNemKor, evaluated and integrated into Termi. Also, Serbian morphological e-dictionaries are updated with new entries extracted from the Serbian part of the corpus. Originality/value A bilingual search of SrpNemKor in Bibliša is available within the user-friendly platform. The enriched database Termi enables semantic enhancement and refinement of user’s search query based on synonyms both in Serbian and German at a very high level. Serbian morphological e-dictionaries facilitate the morphological expansion of search queries in Serbian, thereby enabling the analysis of concepts and concept structures by identifying terms assigned to the concept, and by establishing relations between terms in Serbian and German which makes Bibliša a valuable Web tool that can support research and analysis of SrpNemKor.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Bilingual lexical extraction based on word alignment for improving corpus search

Андоновски

Šandrih

Kitanović

2019

View full text Add to dashboard Cite

show abstract

“…It is used for lexicon creation, acquisition of novel terms, text classification, text indexing, machine-assisted translation and other NLP tasks. Different approaches to MWT extraction, linguistics- or statistics-based (or both), have already been published recently (Cram and Daille, 2016; Sclano and Velardi, 2007; Verberne et al , 2016; Vivaldi and Rodríguez, 2007; Yin et al , 2016; Zhang and Wu, 2012). Most of the methods used for MWT extraction today are hybrid; that is, they usually integrate statistical information, such as frequencies of n-grams and collocations, with linguistic information, such as syntactic patterns of expressions.…”

Section: Related Workmentioning

confidence: 99%

Semi-automatic extraction of multiword terms from domain-specific corpora

Pajić

Stankovic

Stanković

et al. 2018

View full text Add to dashboard Cite

Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.

show abstract