Clinical information extraction using small data: An active learning approach based on sequence representations and word embeddings

Kholghi, Mahnoosh; Vine, Lance De; Sitbon, Laurianne; Zuccon, Guido; Nguyen, Anthony

doi:10.1002/asi.23936

Cited by 15 publications

(11 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the initial labelled set (i.e. seed set) is an important factor in increasing the performance of AL at early iterations (Kholghi, Vine, Sitbon, Zuccon, & Nguyen, ), it could have influenced the AL performance reported in Qian et al. ().…”

Section: Introductionmentioning

confidence: 99%

Active learning for classifying long‐duration audio recordings of the environment

et al. 2018

Self Cite

View full text Add to dashboard Cite

This paper presents an active learning (AL) framework for the classification of 1‐min audio recordings derived from long‐duration recordings of the environment. The goal of the framework was to investigate the efficacy of AL on reducing the manual annotation effort required to label a large volume of acoustic data according to its dominant sound source, while ensuring the high quality of automatically labelled data. We present a comprehensive empirical comparison through extensive simulation experiments of a range of AL approaches against a Random Sampling baseline for soundscape classification. Random Forest is used as a benchmark supervised approach to build classifiers in the AL framework. Also, 12 summary indices extracted for each 1 min of 13‐month recording are used as features for training the classifiers. Our experimental findings demonstrate that (a) among existing query strategies, those based on classifier confidence and diversity of samples are more effective for very large datasets where the classes are imbalanced in size; (b) by considering a practical target performance (i.e. F‐measure equal or greater than 0.8, 0.85 and 0.9) for AL, only 5–16 hr of manual annotation effort is required to build a classifier that automatically annotates a large amount (13 months) of unlabelled audio data. Active learning has a key role to play in alleviating the burden of manual annotation required to build classifiers which can support effective monitoring of species diversity in at‐risk ecosystems.

show abstract

Section: Introductionmentioning

confidence: 99%

Active learning for classifying long‐duration audio recordings of the environment

et al. 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although DKI achieved the lowest time rate among AL query strategies, it should be noted that it strongly relies on the availability of the domain knowledge (i.e., less generalizable) compared to unsupervised-based (i.e., ULC, UID, and 2L-UID) and similarity-based approaches (i.e., IDiv) [186]. Another observation is that the time rates in Table 4 show that the actual time savings are much closer to the estimated concept annotation rates than to the sequence and token annotation rates.…”

Section: Discussionmentioning

confidence: 99%

“…We also study the role of a smart seed selection approach in reducing the annotation time from early batches of active learning. Our previous study demonstrated that Longest Sequence Cluster (LSC) can lead to an initial model with significantly higher effectiveness at early batches of AL compared to when using RS [186]. We use LSC and RS seed selection approaches to build two initial models.…”

Section: Objectivementioning

confidence: 99%

“…According to the results from our previous study [186], the Longest Sequence Cluster (LSC) approach is used for smart seed selection and contrasted with a random seed selection. The batch size (Β) is set to 88, leading to a total of 100 batches.…”

Section: Active Learning Setupmentioning

confidence: 99%

“…A wide range of query strategies (Table 3) are evaluated against Random Sampling and Longest sequence (LS) baselines: Least Confidence (LC) [106], Sequence Entropy (SE) [49], Margin [107], Information Density (IDen) [49], Information Diversity (IDiv) [176], Information Density and Diversity (IDD) [176], Domain Knowledge Informativeness (DKI) [176], Unsupervised Least Confidence (ULC) [186], Unsupervised Information Diversity [186], and Two-Level Unsupervised Information Diversity (2L-UID) [186].…”

Section: Active Learning Setupmentioning

confidence: 99%

See 2 more Smart Citations

Active Learning for Concept Extraction from Clinical Free Text

Kholghi¹

Self Cite

View full text Add to dashboard Cite

An increasing volume of clinical free-text data, such as discharge summaries and progress reports, has been collected by hospitals and healthcare centres and stored electronically for further processing. Extracting structured clinical information from such unstructured text resources is necessary for enabling secondary usage of reports, such as reporting, reasoning and retrieving, and for further processing in down-stream eHealth workflows. However, this analysis cannot be done manually, due to the high cost incurred by qualified experts to annotate the clinical free text. A significant initial step in extracting information from clinical free text is concept extraction, which involves identifying entities of interest in the clinical domain (such as diseases, medications, and symptoms). Currently, supervised machine learning approaches effectively extract clinical concepts by building powerful statistical models. However, these approaches require a large amount of high quality, annotated train data, which is created manually by domain experts through a costly and timeconsuming process. This results in a robust active learning framework for extracting clinical concepts, using state-of-the-art, active learning approaches. The second step is to leverage clinical information resources (i.e., terminologies and clinical information extraction tools) and other machine learning approaches (i.e., semi-supervised learning, unsupervised learning, and representation learning) to develop domain-specific and generic active learning approaches. This leads to a number of novel, active learning query strategies and a seed selection approach that outperform the state-of-the-art approaches with less manual annotation effort. The last step is to validate the benefits of the developed AL-based framework in reducing the annotation cost (i.e., time) through a comprehensive user study. An AL-assisted pre-annotation scheme is also introduced, in which the learning models built across the AL process generate high quality pre-annotations to be reviewed by human annotators. This further accelerates the annotation process, by significantly reducing the number of manual annotations that must be added or corrected compared to de novo annotation.The results of this study demonstrate that AL plays an important role in reducing the manual annotation cost. The CEAL framework extracts high quality domain concepts from clinical narratives, while significantly reducing the labour cost with up to 35% less annotation time required. Additionally, AL-assisted preannotations accelerate the de novo annotation process with a further 20% less annotation time required. This thesis contributes to information extraction from clinical unstructured text resources by alleviating the burden of manual annotation.The practical significance of this research is three-fold: (1) benefitting the overall patient healthcare by facilitating downstream eHealth workflows such as supporting clinical information processing, reporting, reasoning, and efficient decision m...

show abstract