Natural language processing for under-resourced languages: Developing a Welsh natural language toolkit

Cunliffe, Daniel; Vlachidis, Andreas; Williams, D. F.; Tudhope, Douglas

doi:10.1016/j.csl.2021.101311

Cited by 9 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study presents the use of a GATE-based system for extracting outcomes from TTE reports, with a special emphasis on capturing both discrete and continuous variables. Most of the rule-based research in the field of NLP in healthcare have focused on general clinical text, methodology or specific medical domains, with limited exploration in the context of echocardiography reports or the use of GUI based interface to interpret extraction [18,[20][21][22][24][25][26][27][28]. To the best of our knowledge, this is one of the first studies to use a GATE-based NLP system for TTE extraction, adding to the understanding of natural language processing (NLP) in this area.…”

Section: Discussionmentioning

confidence: 99%

Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database

Dong,

Sunderland,

Nightingale

et al. 2023

Bioengineering

View full text Add to dashboard Cite

Background: Although electronic health records (EHR) provide useful insights into disease patterns and patient treatment optimisation, their reliance on unstructured data presents a difficulty. Echocardiography reports, which provide extensive pathology information for cardiovascular patients, are particularly challenging to extract and analyse, because of their narrative structure. Although natural language processing (NLP) has been utilised successfully in a variety of medical fields, it is not commonly used in echocardiography analysis. Objectives: To develop an NLP-based approach for extracting and categorising data from echocardiography reports by accurately converting continuous (e.g., LVOT VTI, AV VTI and TR Vmax) and discrete (e.g., regurgitation severity) outcomes in a semi-structured narrative format into a structured and categorised format, allowing for future research or clinical use. Methods: 135,062 Trans-Thoracic Echocardiogram (TTE) reports were derived from 146967 baseline echocardiogram reports and split into three cohorts: Training and Validation (n = 1075), Test Dataset (n = 98) and Application Dataset (n = 133,889). The NLP system was developed and was iteratively refined using medical expert knowledge. The system was used to curate a moderate-fidelity database from extractions of 133,889 reports. A hold-out validation set of 98 reports was blindly annotated and extracted by two clinicians for comparison with the NLP extraction. Agreement, discrimination, accuracy and calibration of outcome measure extractions were evaluated. Results: Continuous outcomes including LVOT VTI, AV VTI and TR Vmax exhibited perfect inter-rater reliability using intra-class correlation scores (ICC = 1.00, p < 0.05) alongside high R2 values, demonstrating an ideal alignment between the NLP system and clinicians. A good level (ICC = 0.75–0.9, p < 0.05) of inter-rater reliability was observed for outcomes such as LVOT Diam, Lateral MAPSE, Peak E Velocity, Lateral E’ Velocity, PV Vmax, Sinuses of Valsalva and Ascending Aorta diameters. Furthermore, the accuracy rate for discrete outcome measures was 91.38% in the confusion matrix analysis, indicating effective performance. Conclusions: The NLP-based technique yielded good results when it came to extracting and categorising data from echocardiography reports. The system demonstrated a high degree of agreement and concordance with clinician extractions. This study contributes to the effective use of semi-structured data by providing a useful tool for converting semi-structured text to a structured echo report that can be used for data management. Additional validation and implementation in healthcare settings can improve data availability and support research and clinical decision-making.

show abstract

Section: Discussionmentioning

confidence: 99%

Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database

Dong,

Sunderland,

Nightingale

et al. 2023

Bioengineering

View full text Add to dashboard Cite

show abstract

“…Volviendo sobre la relación entre soporte tecnológico y revitalización de lenguas, podemos afirmar que la disponibilidad de este tipo de herramientas en una lengua, sumada a otros factores de carácter extralingüístico, puede influir fuertemente en la decisión de un investigador de trabajar con ella: la carencia de herramientas hace que la investigación en una LIT sea menos frecuente y, por ello, la creación de las mismas herramientas sea lenta y difícil (Maxwell & Hughes, 2006;Cunliffe et al, 2022). La disponibilidad de estos elementos, a su vez, hace posible que los hablantes de la lengua puedan contar con diferentes aplicaciones de las tecnologías de la información y la comunicación, tales como la traducción automática o los diccionarios digitales.…”

Section: Fundamentosunclassified

“…Sin embargo, el desarrollo de herramientas tecnológicas lingüísticas no está a la par de las necesidades urgentes de todas las lenguas minoritarias. Como ocurre respecto de la desigualdad en la vitalidad de las lenguas, donde, por ejemplo, el 37.7% de hablantes nativos está distribuido en ocho lenguas de un total de 7111 lenguas vivas (Eberhard et al, 2019), la mayoría de los investigadores y programadores se orienta a aquellos idiomas predominantes en las relaciones comerciales y académicas como el inglés, el mandarín o el español (Ostler, 2014;Camacho & Zevallos, 2020;Cunliffe et al, 2022).…”

Section: Introductionunclassified

“…De esta manera, las LIT afrontan distintos retos en lo que toca a las diversas aplicaciones tecnológicas que se vienen desarrollando en las últimas décadas, desde las herramientas para la sistematización de corpus hasta las interfaces de usuario en sistemas de uso cotidiano. En ese sentido, la revitalización -esto es, los procesos tendientes al mantenimiento y supervivencia social de la diversidad lingüística-demanda el apoyo en tecnologías tanto para los investigadores como para los hablantes (Ostler, 2014;Cunliffe et al, 2022).…”

Section: Introductionunclassified

See 1 more Smart Citation

UnderRL Tagger: un etiquetador gramatical para lenguas infrasoportadas tecnológicamente y lenguas minoritarias

Pemberty Tamayo,

Molina Mejía,

Vallejo Zapata

2023

Forma. func.

View full text Add to dashboard Cite

En este artículo se presenta UnderRL Tagger, un programa informático de acceso libre diseñado para el etiquetado morfosintáctico (POS tagging) en lenguas que no cuentan con etiquetadores automáticos. El programa busca facilitar el trabajo con corpus en estas lenguas infrasoportadas tecnológicamente y en las lenguas minoritarias, aportando así a los procesos de revitalización desde la investigación descriptiva y las herramientas computacionales. UnderRL Tagger permite que el proceso manual de etiquetado se convierta poco a poco en automático gracias a un sistema que permite recordar y reutilizar las etiquetas, manejar grandes cantidades de textos y generar archivos de salida en formato XML con etiquetas basadas en el sistema estandarizado EAGLES. Este artículo muestra el proceso de modelado y elaboración del sistema, sus diferentes funcionalidades y las perspectivas de trabajos posteriores.

show abstract

“…, English) have been proposed in abundance, as shown in this section. Although multilingual content is available on social media platforms, yet little efforts have been made to cater for the resource-poor languages ( Cunliffe et al, 2022 ). This has increased the need for automatic offensive language detection systems for low or poor-resource languages.…”

Section: Introductionmentioning

confidence: 99%

Detection of offensive terms in resource-poor language using machine learning algorithms

Raza,

Mahoto,

Hamdi

et al. 2023

PeerJ Computer Science

View full text Add to dashboard Cite

The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, i.e., Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively.

show abstract

Natural language processing for under-resourced languages: Developing a Welsh natural language toolkit

Cited by 9 publications

References 20 publications

Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database

Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database

UnderRL Tagger: un etiquetador gramatical para lenguas infrasoportadas tecnológicamente y lenguas minoritarias

Detection of offensive terms in resource-poor language using machine learning algorithms

Contact Info

Product

Resources

About