Filip Klubička scite author profile

Filip Klubička

5Publications

147Citation Statements Received

98Citation Statements Given

How they've been cited

190

136

How they cite others

Affiliations

Technological University Dublin, University of Zagreb

Publications

Order By: Most citations

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Ljubešić¹,

Klubička²

2014

View full text Add to dashboard Cite

In this paper we present the construction process of top-level-domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with standard language identification methods and 2. as most web corpora, they partially contain low-quality data not suitable for the specific research and application objectives. We approach both problems by using language modeling on the crawled data only, omitting the need for manually validated language samples for training. On the task of discriminating between closely related languages we outperform the state-of-the-art Blacklist classifier reducing its error to a fourth.

show abstract

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Klubička¹,

Toral

Sánchez-Cartagena³

2018

Machine Translation

View full text Add to dashboard Cite

This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-toCroatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.

show abstract

Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation

Klubička

Toral

Sánchez-Cartagena³

2017

View full text Add to dashboard Cite

We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems' outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.

show abstract

ADAPT at SemEval-2018 Task 9: Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora

Maldonado¹,

Klubička²

2018

View full text Add to dashboard Cite

This paper describes a simple but competitive unsupervised system for hypernym discovery. The system uses skip-gram word embeddings with negative sampling, trained on specialised corpora. Candidate hypernyms for an input word are predicted based on cosine similarity scores. Two sets of word embedding models were trained separately on two specialised corpora: a medical corpus and a music industry corpus. Our system scored highest in the medical domain among the competing unsupervised systems but performed poorly on the music industry domain. Our approach does not depend on any external data other than raw specialised corpora.

show abstract

Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification

Nedumpozhimana

Klubička

Kelleher

2022

Front. Artif. Intell.

View full text Add to dashboard Cite

This article examines the basis of Natural Language Understanding of transformer based language models, such as BERT. It does this through a case study on idiom token classification. We use idiom token identification as a basis for our analysis because of the variety of information types that have previously been explored in the literature for this task, including: topic, lexical, and syntactic features. This variety of relevant information types means that the task of idiom token identification enables us to explore the forms of linguistic information that a BERT language model captures and encodes in its representations. The core of this article presents three experiments. The first experiment analyzes the effectiveness of BERT sentence embeddings for creating a general idiom token identification model and the results indicate that the BERT sentence embeddings outperform Skip-Thought. In the second and third experiment we use the game theory concept of Shapley Values to rank the usefulness of individual idiomatic expressions for model training and use this ranking to analyse the type of information that the model finds useful. We find that a combination of idiom-intrinsic and topic-based properties contribute to an expression's usefulness in idiom token identification. Overall our results indicate that BERT efficiently encodes a variety of information from topic, through lexical and syntactic information. Based on these results we argue that notwithstanding recent criticisms of language model based semantics, the ability of BERT to efficiently encode a variety of linguistic information types does represent a significant step forward in natural language understanding.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Filip Klubička

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation

ADAPT at SemEval-2018 Task 9: Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora

Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification

Contact Info

Product

Resources

About