New Arabic Medical Dataset for Diseases Classification

Hammoud, Jaafar; Vatian, Aleksandra; Dobrenko, Natalia; Vedernikov, Nikolay; Shalyto, Anatoly; Gusarova, Natalia

doi:10.1007/978-3-030-91608-4_20

Cited by 9 publications

(6 citation statements)

References 43 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unfortunately, many corpora in LoE remain unavailable to the public for various reasons (ethics, data sensitivity, company policy, etc.). Nevertheless, they are often featured in publications that carry detailed and valuable information on the specificities of a particular LoE (Ukrainian [35]); the resource selection (Arabic [36]); the annotation process (Tibetan [37]), or the evaluation of different machine learning methods (French [38]).…”

Section: New Multilingual Resources and Monolingual Datasets In Loementioning

confidence: 99%

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey

Shaitarova,

Zaghir,

Lavelli

et al. 2023

Yearb Med Inform

View full text Add to dashboard Cite

Objectives: This survey aims to provide an overview of the current state of biomedical and clinical Natural Language Processing (NLP) research and practice in Languages other than English (LoE). We pay special attention to data resources, language models, and popular NLP downstream tasks. Methods: We explore the literature on clinical and biomedical NLP from the years 2020-2022, focusing on the challenges of multilinguality and LoE. We query online databases and manually select relevant publications. We also use recent NLP review papers to identify the possible information lacunae. Results: Our work confirms the recent trend towards the use of transformer-based language models for a variety of NLP tasks in medical domains. In addition, there has been an increase in the availability of annotated datasets for clinical NLP in LoE, particularly in European languages such as Spanish, German and French. Common NLP tasks addressed in medical NLP research in LoE include information extraction, named entity recognition, normalization, linking, and negation detection. However, there is still a need for the development of annotated datasets and models specifically tailored to the unique characteristics and challenges of medical text in some of these languages, especially low-resources ones. Lastly, this survey highlights the progress of medical NLP in LoE, and helps at identifying opportunities for future research and development in this field.

show abstract

Section: New Multilingual Resources and Monolingual Datasets In Loementioning

confidence: 99%

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey

Shaitarova,

Zaghir,

Lavelli

et al. 2023

Yearb Med Inform

View full text Add to dashboard Cite

show abstract

“…In addition, (Abdelhay et al, 2023) tackled the challenges of implementing medical bots in Arabic with the introduction of the MAQA dataset, high-lighting the effectiveness of Transformer models. (Hammoud et al, 2020) fine-tuned neural networks for medical entity recognition in Arabic medical texts, while (Hammoud et al, 2021) presented a novel dataset for disease classification, emphasizing the potential of pre-trained models. Finally, (Samy et al, 2012) compared strategies for medical term extraction, revealing the advantages of using Arabic equivalents of Latin prefixes and suffixes.…”

Section: Related Workmentioning

confidence: 99%

Automated De-Identification of Arabic Medical Records

Kocaman,

Mellah,

Haq

et al. 2023

Proceedings of ArabicNLP 2023

View full text Add to dashboard Cite

As Electronic Health Records (EHR) become ubiquitous in healthcare systems worldwide, including in Arabic-speaking countries, the dual imperative of safeguarding patient privacy and leveraging data for research and quality improvement grows. This paper presents a firstof-its-kind automated de-identification pipeline for medical text specifically tailored for the Arabic language. This includes accurate medical Named Entity Recognition (NER) for identifying personal information; data obfuscation models to replace sensitive entities with fake entities; and an implementation that natively scales to large datasets on commodity clusters. This research makes two contributions. First, we adapt two existing NER architectures-BERT For Token Classification (BFTC) and BiLSTM-CNN-Char -to accommodate the unique syntactic and morphological characteristics of the Arabic language. Comparative analysis suggests that BFTC models outperform Bi-LSTM models, achieving higher F1 scores for both identifying and redacting personally identifiable information (PII) from Arabic medical texts. Second, we augment the deep learning models with a contextual parser engine to handle commonly missed entities. Experiments show that the combined pipeline demonstrates superior performance with micro F1 scores ranging from 0.94 to 0.98 on the test dataset, which is a translated version of the i2b2 2014 de-identification challenge, across 17 sensitive entities. This level of accuracy is in line with that achieved with manual de-identification by domain experts, suggesting that a fully automated and scalable process is now viable.

show abstract

“…Hammoud et al [12] presented a new Arabic medical dataset for text classification. The dataset included 2,000 articles over 10 classes (blood, bone, cardiovascular, ear, endocrine, eye, gastrointestinal, immune, liver, and nephrological) of disease.…”

Section: Related Workmentioning

confidence: 99%

Classification of specialities in textual medical reports based on natural language processing and feature selection

Almuhana

Abbas

2022

IJEECS

View full text Add to dashboard Cite

Nowadays, a great deal of detailed information about patients, including disease status, medication history, and side effects, is collected in an electronic format; called an electronic medical record (EMR), and the data serves as a valuable resource for further analysis, diagnosis, and treatment. The huge q uantity of detailed patient information in these medical texts produces a huge challenge in terms of processing this data efficiently, however. Machine learning (ML) algorithms, artificial intelligence techniques, and natural language processing tools can have the potential effect of simplifying unstructured data, which could positively affect medical report analysis. Natural language processing (NLP) has recently made huge advances on a variety of tasks. In this paper, an automatic system was thus produced to classify specialist consultant interactions based on patients’ medical reports. NLP was used as a pre-processing step on a dataset formed of unstructured medical reports. Feature extraction and selection methods were used to convert the textual reports into sets of features and to extract the most effective features to increase classification accuracy and reduce execution time. Various classification methods were then applied (ML perceptron, logistic regression random forest (RF), and linear support vec tor classifier (LSVC)). The highest accuracy (99.39%) was achieved in ML-perceptron classification techniques .

show abstract

New Arabic Medical Dataset for Diseases Classification

Cited by 9 publications

References 43 publications

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey

Automated De-Identification of Arabic Medical Records

Classification of specialities in textual medical reports based on natural language processing and feature selection

Contact Info

Product

Resources

About