An Analysis of Biomedical Tokenization: Problems and Strategies

Díaz, Noa P. Cruz; López, Mercedes

doi:10.18653/v1/w15-2605

Cited by 4 publications

(4 citation statements)

References 19 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We had not only considered synonyms that exist in the ontologies but also created a rules-based term variant generator (TVG) to cover a case when the same object, Uniprot [P01375], might be written as "TNF alpha", "TNFa", or "TNF α" in a paper. Next generating techniques groups were utilized: -orthographic; -abbreviations and acronyms; -inflectional variations; -morphological variations; -structural recombinations [2,3,5]. Table 1 shows average number of original terms' synonyms and how much variants were generated.…”

Section: Design and Methodologymentioning

confidence: 99%

See 1 more Smart Citation

Increasing papers’ discoverability with precise semantic labeling: The sci.AI publishing platform

Gurinovich¹,

Pashuk²,

Petrovskiy

et al. 2018

ISU

View full text Add to dashboard Cite

Abstract. The number of published findings in biomedicine increases continually. At the same time, specifics of the domain's terminology complicates the task of relevant publications retrieval. In the current research, we investigate influence of terms' variability and ambiguity on a paper's likelihood of being retrieved. We obtained statistics that demonstrate significance of the issue and its challenges, followed by presenting the sci.AI platform, which allows precise terms labeling as a resolution.

show abstract

Section: Design and Methodologymentioning

confidence: 99%

“…-orthographic; -abbreviations and acronyms; -inflectional variations; -morphological variations; -structural recombinations [4,5,6]. Table 1 shows average number of original terms' synonyms and how much variants were generated.…”

Section: Design and Methodologymentioning

confidence: 99%

Increasing papers’ discoverability with precise semantic labeling: The sci.AI publishing platform

Gurinovich¹,

Pashuk²,

Petrovskiy

et al. 2018

ISU

View full text Add to dashboard Cite

show abstract

“…Tokenization. Biomedical text data poses additional challenges to the problem of tokenization [24]. DNA sequences, chemical substances and mathematical formula's appear frequently in this domain, but are not easily captured by simple tokenizers.…”

Section: Corpusmentioning

confidence: 99%

The Case of Imperfect Negation Cues: A Two-Step Approach for Automatic Negation Scope Resolution

de Jong,

Bagheri

2022

Natural Language Processing and Information Systems

View full text Add to dashboard Cite

Negation is a complex grammatical phenomenon that has received considerable attention in the biomedical natural language processing domain. While neural network-based methods are the state-of-the-art in negation scope resolution, they often use the unrealistic assumption that negation cue information is completely accurate. Even if this assumption holds, there remains a dependency on engineered features from state-of-the-art machine learning methods. To tackle this issue, in this study, we adopted a two-step negation resolving approach to assess whether a neural network-based model, here a bidirectional long short-term memory, can be a an alternative for cue detection. Furthermore, we investigate how inaccurate cue predictions would affect the scope resolution performance. We ran various experiments on the open access Bio-Scope corpus. Experimental results suggest that word embeddings alone can detect cues reasonably well, but there still exist better alternatives for this task. As expected, scope resolution performance suffers from imperfect cue information, but remains acceptable on the Abstracts subcorpus. We also found that the scope resolution performance is most robust against inaccurate information for models with a recurrent layer only, compared to extensions with a conditional random field layer and extensions with a postprocessing algorithm. We advocate for more research into the application of automated deep learning on the effect of imperfect information on scope resolution.

show abstract

“…Tokenization. Biomedical text data poses additional challenges to the problem of tokenization [46]. DNA sequences, chemical substances and mathematical formula's appear frequently in this domain, but are not easily captured by simple tokenizers.…”

Section: Corpusmentioning

confidence: 99%

Scope resolution of predicted negation cues: A two-step neural network-based approach

de Jong

2021

Preprint

View full text Add to dashboard Cite

Neural network-based methods are the state of the art in negation scope resolution. However, they often use the unrealistic assumption that cue information is completely accurate. Even if this assumption holds, there remains a dependency on engineered features from state-of-the-art machine learning methods. The current study adopted a two-step negation resolving apporach to assess whether a Bidirectional Long Short-Term Memory-based method can be used for cue detection as well, and how inaccurate cue predictions would affect the scope resolution performance. Results suggest that this method is not suitable for negation detection. Scope resolution performance is most robust against inaccurate information for models with a recurrent layer only, compared to extensions with a Conditional Random Fields layer or a post-processing algorithm. We advocate for more research into the application of deep learning on negation detection and the effect of imperfect information on scope resolution.

show abstract

An Analysis of Biomedical Tokenization: Problems and Strategies

Cited by 4 publications

References 19 publications

Increasing papers’ discoverability with precise semantic labeling: The sci.AI publishing platform

Increasing papers’ discoverability with precise semantic labeling: The sci.AI publishing platform

The Case of Imperfect Negation Cues: A Two-Step Approach for Automatic Negation Scope Resolution

Scope resolution of predicted negation cues: A two-step neural network-based approach

Contact Info

Product

Resources

About