Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks

Kanerva, Jenna; Ginter, Filip; Salakoski, Tapio

doi:10.1017/s1351324920000224

Cited by 31 publications

(44 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We report precision, recall and F1-score for indomain senses and out-of-domain senses, except for Lithuanian, where not enough examples are available. Precision and recall are computed as follows: 9 Precision = # examples with correct target words # examples with either correct or incorrect target words 8 We used the Turku neural lemmatizer with pretrained models (Kanerva et al, 2019). For Lithuanian, as no pretrained model was available, we trained one using the respective available data from the Universal Dependencies project.…”

Section: Wmt 2019 Test Suite Resultsmentioning

confidence: 99%

The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation

Raganato¹,

Scherrer²,

Tiedemann³

2019

Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

View full text Add to dashboard Cite

Supervised Neural Machine Translation (NMT) systems currently achieve impressive translation quality for many language pairs. One of the key features of a correct translation is the ability to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. Existing evaluation benchmarks on WSD capabilities of translation systems rely heavily on manual work and cover only few language pairs and few word types. We present MU-COW, a multilingual contrastive test suite that covers 16 language pairs with more than 200 000 contrastive sentence pairs, automatically built from word-aligned parallel corpora and the wide-coverage multilingual sense inventory of BabelNet. We evaluate the quality of the ambiguity lexicons and of the resulting test suite on all submissions from 9 language pairs presented in the WMT19 news shared translation task, plus on other 5 language pairs using pretrained NMT models. The MUCOW test suite is available at http://github. com/Helsinki-NLP/MuCoW.

show abstract

Section: Wmt 2019 Test Suite Resultsmentioning

confidence: 99%

The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation

Raganato¹,

Scherrer²,

Tiedemann³

2019

Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

View full text Add to dashboard Cite

show abstract

“…Lemmatization is a process in text preprocessing that determines the shape of a word and change it into a root word or finding the root of each word based on the context of the sentence [6]. The purpose of the lemmatization is to optimize the text mining process.…”

Section: Lemmatizationmentioning

confidence: 99%

Recommendation System for Thesis Topics Using Content-based Filtering

Kusuma¹,

Musdholifah²

2021

Indonesian J. Comput. Cybern. Syst.

View full text Add to dashboard Cite

When pursuing their bachelor degree, every students are required to pursue a thesis in order to graduate from the major that they take. However, during the process, students got several difficulty regarding chosing their thesis topics. Therefore, a recommendation system is needed to classify thesis topics based on the students’ interest and abilities. This study developed a recommendation system for thesis topics using content-based filtering where the students will be asked to choose the course that they interested in along with their grades. After getting all the required data, the recommendation system will process the data and then it’ll show the title and the abstract of publication that fits the criteria.In this research, there are 2 datasets that is used, there are lecturer publication within 3 years and syllabus data of Computer Science UGM course. After running this research, it was found that the recommendation system has an average 7.46 seconds running time. It was also found that the recommendation system got an average 83% of the recommendation system objectives. The recommendation system objectives consist of relevance, novelty, serendipity, and increasing recommendation diversity.

show abstract

“…Lemmatisation has been of interest in NLP for the last few decades [Hann, 1974]. Since then, tools for lemmatisation have been divided into universal lemmatisers [Straka et al, 2017] [Bergmanis and Goldwater, 2018] [Kanerva et al, 2020] and specific lemmatisers designed to execute a particular task, for instance, for a particular language [Džeroski and Erjavec, 2001] [Groenewald, 2007] [Tamburini, 2013] or for a particular POS [Prinsloo, 2012] [Gouws and Prinsloo, 2012] [Nthambeleni and Musehane, 2014], or a group of words within a POS [Fernández, 2020], or a class of words with a very specific behaviour, such as words within fixed expressions [Farkas et al, 2008] [Mulhall, 2008] [Kosch, 2016]. One approach unites both lemmatiser and tagger in a single model [Spyns, 1996] [Aduriz et al, 1998].…”

Section: Previous Workmentioning

confidence: 99%

“…Thus, automatic lemmatisation with this approach may be defined as a learning task of determining of a lemmatising rule on the basis of a given word, and using it to acquire the lemma of the given word. The second definition relies on the newer approach that appeared during the last decade in which lemmatisation is made in one step [Kanerva et al, 2020]. A tool takes a given word and auxiliary information, such as the POS tag, or morphological data, or left context, and produces a lemma.…”

Section: Introductionmentioning

confidence: 99%

A Hybrid Lemmatiser For Old Church Slavonic

Афанасьев

2021

SSRN Journal

View full text Add to dashboard Cite

The article considers a lemmatiser that is developed specifically for Old Church Slavonic (OCS). The introduction underlines the problem of the lack of lemmatisers that might deal with different datasets of the OCS. The review gives a short description of previous attempts and current trends in lemmatisation. The lemmatiser is hybrid-based and uses the advantages of linguistic rules for specific cases (fragmentary tokens, punctuation, or digits), a dictionary for the most common tokens, and a sequence-to-sequence (seq2seq) neural network with an attention mechanism for the rest of material. The model achieves an 85% overall accuracy score, which is lower than one of the previous models for the Universal Dependencies(UD) dataset. However, when specific tokens are taken into consideration, the model outperforms the previous ones with the help of its rule-based part. Possible further directions of the research include the use of more sophisticated architectures, such as BART.

show abstract

Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks

Cited by 31 publications

References 27 publications

The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation

The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation

Recommendation System for Thesis Topics Using Content-based Filtering

A Hybrid Lemmatiser For Old Church Slavonic

Contact Info

Product

Resources

About