Playing with Words at the National Library of Sweden -- Making a Swedish BERT

Malmsten, Martin; Börjeson, Love; Haffenden, Chris

doi:10.48550/arxiv.2007.01658

Cited by 18 publications

(18 citation statements)

References 8 publications

(10 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, word-embedding-based and transformer-based models are available in a multitude of languages (for Spanish, see Canete et al 2020; for English, see Clark et al 2020; for Swedish, see Malmsten, Börjeson, and Haffenden 2020;and for French, see Martin et al 2019). Even though this article deals with the classification of discrete emotional language in German, it can serve as a framework to create similar tools for other languages which potentially achieve even better performances.…”

Section: Discussionmentioning

confidence: 99%

Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text

Widmann¹,

Wich²

2022

Polit. Anal.

View full text Add to dashboard Cite

Previous research on emotional language relied heavily on off-the-shelf sentiment dictionaries that focus on negative and positive tone. These dictionaries are often tailored to nonpolitical domains and use bag-of-words approaches which come with a series of disadvantages. This paper creates, validates, and compares the performance of (1) a novel emotional dictionary specifically for political text, (2) locally trained word embedding models combined with simple neural network classifiers, and (3) transformer-based models which overcome limitations of the dictionary approach. All tools can measure emotional appeals associated with eight discrete emotions. The different approaches are validated on different sets of crowd-coded sentences. Encouragingly, the results highlight the strengths of novel transformer-based models, which come with easily available pretrained language models. Furthermore, all customized approaches outperform widely used off-the-shelf dictionaries in measuring emotional language in German political discourse.

show abstract

Section: Discussionmentioning

confidence: 99%

Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text

Widmann¹,

Wich²

2022

Polit. Anal.

View full text Add to dashboard Cite

show abstract

“…BERT's architecture is a multi-layer Transformer encoder that is based on the original Transformer architecture introduced by Vaswani et al (2017). We use cased BERT models (TensorFlow versions) through the Huggingface Transformers library (Wolf et al, 2020) with the following language-specific models: the original English BERT, Finnish FinBERT (Virtanen et al, 2019), French FlauBERT (Le et al, 2020) and Swedish KB-BERT (Malmsten et al, 2020). Additionally, we use Multilingual BERT (mBERT) (Devlin et al, 2019), which was pretrained on monolingual Wikipedia corpora from 104 languages with a shared multilingual vocabulary.…”

Section: Methodsmentioning

confidence: 99%

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

Repo

Skantsi

Rönnqvist

et al. 2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research W

View full text Add to dashboard Cite

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new registerannotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zeroshot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

show abstract

“…The keyword-based labelling system produces annotation labels as outputs, which can be used as supervision signals. To widen the experimental scope, the base Swedish BERT model, KB-BERT (Malmsten et al, 2020), was used on a token level. Thus, an annotation consisting of five tokens resulted in a sequence of five 768 dimensional embeddings.…”

Section: Resultsmentioning

confidence: 99%

Processing of Condition Monitoring Annotations with BERT and Technical Language Substitution: A Case Study

et al. 2022

View full text Add to dashboard Cite

Annotations in condition monitoring systems contain information regarding asset history and fault characteristics in the form of unstructured text that could, if unlocked, be used for intelligent fault diagnosis. However, processing these annotations with pre-trained natural language models such as BERT is problematic due to out-of-vocabulary (OOV) technical terms, resulting in inaccurate language embeddings. Here we investigate the effect of OOV technical terms on BERT and SentenceBERT embeddings by substituting technical terms with natural language descriptions. The embeddings were computed for each annotation in a pre-processed corpus, with and without substitution. The K-Means clustering score was calculated on sentence embeddings, and a Long Short-Term Memory (LSTM) network was trained on word embeddings with the objective to recreate the output from a keywordbased annotation classifier. The K-Means score for SentenceBERT annotation embeddings improved by 40% at seven clusters by technical language substitution, and the labelling capacity of the BERT-LSTM model was improved from 88.3 to 94.2%. These results indicate that the substitution of OOV technical terms can improve the representation accuracy of the embeddings of the pre-trained BERT and SentenceBERT models, and that pre-trained language models can be used to process technical language.

show abstract

Playing with Words at the National Library of Sweden -- Making a Swedish BERT

Cited by 18 publications

References 8 publications

Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text

Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

Processing of Condition Monitoring Annotations with BERT and Technical Language Substitution: A Case Study

Contact Info

Product

Resources

About