Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

Poerner, Nina; Waltinger, Ulli; Schütze, Hinrich

doi:10.18653/v1/2020.findings-emnlp.134

Cited by 34 publications

(27 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nevertheless, an opposite way has also been observed, with authors searching for a green research that does not use models which are not environmentally friendly (expensive in terms of hardware, running time, and CO 2 footprint). This way has been investigated by Poerner et al [ 18 ] which proposed a GreenBioBERT model that has been produced using Word2vec to train a model on a new target domain (namely, on the COVID-19 issue) along with an alignment of vectors from the existing BioBERT model and the model trained with Word2vec.…”

Section: Current Trends In Biomedical Nlpmentioning

confidence: 99%

Year 2020 (with COVID): Observation of Scientific Literature on Clinical Natural Language Processing

Grabar

Grouin

2021

Yearb Med Inform

View full text Add to dashboard Cite

Summary Objectives: To analyze the content of publications within the medical NLP domain in 2020. Methods: Automatic and manual preselection of publications to be reviewed, and selection of the best NLP papers of the year. Analysis of the important issues. Results: Three best papers have been selected in 2020. We also propose an analysis of the content of the NLP publications in 2020, all topics included. Conclusion: The two main issues addressed in 2020 are related to the investigation of COVID-related questions and to the further adaptation and use of transformer models. Besides, the trends from the past years continue, such as diversification of languages processed and use of information from social networks

show abstract

Section: Current Trends In Biomedical Nlpmentioning

confidence: 99%

Year 2020 (with COVID): Observation of Scientific Literature on Clinical Natural Language Processing

Grabar

Grouin

2021

Yearb Med Inform

View full text Add to dashboard Cite

show abstract

“…Among the three best papers selected by the Natural Language Processing (NLP) section, the paper by Poerner et al . presented a new energy-efficient transformer model, applied in particular to perform question-answering about COVID-19 [ 12 ]. As underlined by Natalia Grabar and Cyril Grouin, the co-editors of the NLP section, much work has been done this year in the NLP field on COVID-19, including the development of a dedicated corpus and the use of patient data, scientific literature, and social networks to predict or analyse COVID-19-related events [ 13 ].…”

Section: Highlights Of the 30th Edition Of The Imia Yearbookmentioning

confidence: 99%

Health Data, Information, and Knowledge Sharing for Addressing the COVID-19

et al. 2021

View full text Add to dashboard Cite

Summary Objectives: To introduce the 2021 International Medical Informatics Association (IMIA) Yearbook by the editors. Methods: The editorial provides an introduction and overview to the 2021 IMIA Yearbook whose special topic is “Managing Pandemics with Health Informatics - Successes and Challenges”. The Special Topic, the keynote paper, and survey papers are discussed. The IMIA President's statement and the IMIA dialogue with the World Health Organization are introduced. The sections’ changes in the Yearbook Editorial Committee are also described. Results: Health informatics, in the context of a global pandemic, led to the development of ways to collect, standardize, disseminate and reuse data worldwide: public health data but also information from social networks and scientific literature. Fact checking methods were mostly based on artificial intelligence and natural language processing. The pandemic also introduced new challenges for telehealth support in times of critical response. Next generation sequencing in bioinformatics helped in decoding the sequence of the virus and the development of messenger ribonucleic acid (mRNA) vaccines. Conclusions: The Corona Virus Disease 2019 (COVID-19) pandemic shows the need for timely, reliable, open, and globally available information to support decision making and efficiently control outbreaks. Applying Findable, Accessible, Interoperable, and Reusable (FAIR) requirements for data is a key success factor while challenging ethical issues have to be considered.

show abstract

“…Figure 3 shows that the gap between monolingual and multilingual tokenization quality is indeed larger in the specific texts (green bars) compared to the general texts (brown bars), indicating that in a specific domain, it is even harder for a multilingual model to outperform a monolingual model. This suggests that methods for explicitly adding representations of domain-specific words (Poerner et al, 2020;Schick and Schütze, 2020) could be a promising direction for improving our approach. Error analysis on financial sentence classification To provide a better insight into the difference between the mono and multi models, we compare the error predictions on the Danish FINNEWS dataset, since results in Table 4 show that the mono outperforms all multi models with a large margin on this dataset.…”

Section: Domain-specific Multilingual Representationsmentioning

confidence: 99%

mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Jørgensen¹,

Hartmann²,

Dai³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, largescale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language-and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets-for biomedical named entity recognition and financial sentence classificationcovering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.

show abstract

Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

Cited by 34 publications

References 20 publications

Year 2020 (with COVID): Observation of Scientific Literature on Clinical Natural Language Processing

Year 2020 (with COVID): Observation of Scientific Literature on Clinical Natural Language Processing

Health Data, Information, and Knowledge Sharing for Addressing the COVID-19

mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Contact Info

Product

Resources

About