Extraction of breast cancer biomarker status using natural language processing

Dexter, Paul; Jian, He; Baker, Jarod; Eckert, George J.; Church, Abby; Zhang, Ning Jackie

doi:10.1504/ijcmh.2019.104365

Cited by 1 publication

(7 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study was designed to evaluate the possibility of automatically extracting the status of the 3 main breast cancer biomarkers (ER, PR, and HER2) from the contents of pathology reports written in two different languages, and coming from 82 different providers, using conventional machine learning models. After testing different classifiers, the best performing ones achieved macro-averaged F 1 scores ranging from 0.89 to 0.92 on the held-out test sets, which is on par with best efforts in the literature (6,7,11,12). The reported F 1 scores in the literature range between 0.87 and 1, but use only three possible labels for HER2, whereas five are used in the present work.…”

Section: Discussionsupporting

confidence: 58%

“…Within the context of these activities, having the breast cancer receptor status at its disposal would undoubtedly be of added value. In breast cancer, estrogen receptor (ER), progesterone receptor (PR), and Erb-b2 receptor tyrosine kinase 2 (ERBB2, previously named Human Epidermal Growth Factor 2 or HER2 or HER-2/neu 2 ) are biomarkers known to be related to tumor growth and prognosis, and assessing their expression is necessary to define therapeutic management (3)(4)(5)(6)(7). Currently this information is not available in a structured form at the BCR.…”

Section: Introductionmentioning

confidence: 99%

“…For instance, Dexter et al wrote a series of rules to identify and classify sentences corresponding information about the 1 https://kankerregister.org/media/docs/Exempleprotocoles-version2020.txt and https://kankerregister.org/media/docs/Voorbeeldprotocols%E2%80%93versie202 0.txt 2 The commonly used name HER2 will be used throughout this manuscript. biomarker status (7). Designing rule-based tools is very timeconsuming, because there are many ways to express the information of interest (6,11,12).…”

Section: Introductionmentioning

confidence: 99%

“…Several natural language processing (NLP) tools have previously been developed to automatically extract ER, PR, and HER2 status from free-text reports written in English ( 7 , 11 ), Chinese ( 12 ), and Bulgarian ( 6 ). In general, clinical NLP tools can be separated into three categories, according to the applied methodology: rule-based ( 7 , 13 ), conventional machine learning ( 11 , 12 ), and deep learning ( 6 ).…”

Section: Introductionmentioning

confidence: 99%

“…Rule-based clinical NLP tools rely on a set of rules written by experts describing how a computer should classify a report. For instance, Dexter et al wrote a series of rules to identify and classify sentences corresponding information about the biomarker status ( 7 ). Designing rule-based tools is very time-consuming, because there are many ways to express the information of interest ( 6 , 11 , 12 ).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports

Pironet¹,

Poirel²,

Tambuyzer³

et al. 2021

Front. Digit. Health

View full text Add to dashboard Cite

As part of its core business of gathering population-based information on new cancer diagnoses, the Belgian Cancer Registry receives free-text pathology reports, describing results of (pre-)malignant specimens. These reports are provided by 82 laboratories and written in 2 national languages, Dutch or French. For breast cancer, the reports characterize the status of estrogen receptor, progesterone receptor, and Erb-b2 receptor tyrosine kinase 2. These biomarkers are related with tumor growth and prognosis and are essential to define therapeutic management. The availability of population-scale information about their status in breast cancer patients can therefore be considered crucial to enrich real-world scientific studies and to guide public health policies regarding personalized medicine. The main objective of this study is to expand the data available at the Belgian Cancer Registry by automatically extracting the status of these biomarkers from the pathology reports. Various types of numeric features are computed from over 1,300 manually annotated reports linked to breast tumors diagnosed in 2014. A range of popular machine learning classifiers, such as support vector machines, random forests and logistic regressions, are trained on this data and compared using their F1 scores on a separate validation set. On a held-out test set, the best performing classifiers achieve F1 scores ranging from 0.89 to 0.92 for the four classification tasks. The extraction is thus reliable and allows to significantly increase the availability of this valuable information on breast cancer receptor status at a population level.

show abstract

Section: Discussionsupporting

confidence: 58%