Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis 2015
DOI: 10.18653/v1/w15-2605
|View full text |Cite
|
Sign up to set email alerts
|

An Analysis of Biomedical Tokenization: Problems and Strategies

Abstract: Choosing the right tokenizer is a non-trivial task, especially in the biomedical domain, where it poses additional challenges, which if not resolved means the propagation of errors in successive Natural Language Processing analysis pipeline. This paper aims to identify these problematic cases and analyze the output that, a representative and widely used set of tokenizers, shows on them. This work will aid the decision making process of choosing the right strategy according to the downstream application. In add… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 19 publications
(11 reference statements)
0
4
0
Order By: Relevance
“…We had not only considered synonyms that exist in the ontologies but also created a rules-based term variant generator (TVG) to cover a case when the same object, Uniprot [P01375], might be written as "TNF alpha", "TNFa", or "TNF α" in a paper. Next generating techniques groups were utilized: -orthographic; -abbreviations and acronyms; -inflectional variations; -morphological variations; -structural recombinations [2,3,5]. Table 1 shows average number of original terms' synonyms and how much variants were generated.…”
Section: Design and Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…We had not only considered synonyms that exist in the ontologies but also created a rules-based term variant generator (TVG) to cover a case when the same object, Uniprot [P01375], might be written as "TNF alpha", "TNFa", or "TNF α" in a paper. Next generating techniques groups were utilized: -orthographic; -abbreviations and acronyms; -inflectional variations; -morphological variations; -structural recombinations [2,3,5]. Table 1 shows average number of original terms' synonyms and how much variants were generated.…”
Section: Design and Methodologymentioning
confidence: 99%
“…-orthographic; -abbreviations and acronyms; -inflectional variations; -morphological variations; -structural recombinations [4,5,6]. Table 1 shows average number of original terms' synonyms and how much variants were generated.…”
Section: Design and Methodologymentioning
confidence: 99%
“…Tokenization. Biomedical text data poses additional challenges to the problem of tokenization [24]. DNA sequences, chemical substances and mathematical formula's appear frequently in this domain, but are not easily captured by simple tokenizers.…”
Section: Corpusmentioning
confidence: 99%
“…Tokenization. Biomedical text data poses additional challenges to the problem of tokenization [46]. DNA sequences, chemical substances and mathematical formula's appear frequently in this domain, but are not easily captured by simple tokenizers.…”
Section: Corpusmentioning
confidence: 99%