Anonymization of German financial documents using neural network-based language models with contextual word representations

Biesner, David; Ramamurthy, Rajkumar; Stenzel, Robin; Lübbering, Max; Hillebrand, Lars; Ladi, Anna; Pielka, Maren; Loitz, Rüdiger; Bauckhage, Christian; Sifa, Rafet

doi:10.1007/s41060-021-00285-x

Cited by 16 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are efforts to automate the de-identification of German text data using ML methods, from medical and other domains. However, currently, these methods cannot guarantee 100% accuracy [ 10 , 12 ]. Consequently, efficient development of NLP models for structuring radiological reports on-site is of great interest.…”

Section: Introductionmentioning

confidence: 99%

Transformer-based structuring of free-text radiology report databases

et al. 2023

View full text Add to dashboard Cite

Objectives To provide insights for on-site development of transformer-based structuring of free-text report databases by investigating different labeling and pre-training strategies. Methods A total of 93,368 German chest X-ray reports from 20,912 intensive care unit (ICU) patients were included. Two labeling strategies were investigated to tag six findings of the attending radiologist. First, a system based on human-defined rules was applied for annotation of all reports (termed “silver labels”). Second, 18,000 reports were manually annotated in 197 h (termed “gold labels”) of which 10% were used for testing. An on-site pre-trained model (Tmlm) using masked-language modeling (MLM) was compared to a public, medically pre-trained model (Tmed). Both models were fine-tuned on silver labels only, gold labels only, and first with silver and then gold labels (hybrid training) for text classification, using varying numbers (N: 500, 1000, 2000, 3500, 7000, 14,580) of gold labels. Macro-averaged F1-scores (MAF1) in percent were calculated with 95% confidence intervals (CI). Results Tmlm,gold (95.5 [94.5–96.3]) showed significantly higher MAF1 than Tmed,silver (75.0 [73.4–76.5]) and Tmlm,silver (75.2 [73.6–76.7]), but not significantly higher MAF1 than Tmed,gold (94.7 [93.6–95.6]), Tmed,hybrid (94.9 [93.9–95.8]), and Tmlm,hybrid (95.2 [94.3–96.0]). When using 7000 or less gold-labeled reports, Tmlm,gold (N: 7000, 94.7 [93.5–95.7]) showed significantly higher MAF1 than Tmed,gold (N: 7000, 91.5 [90.0–92.8]). With at least 2000 gold-labeled reports, utilizing silver labels did not lead to significant improvement of Tmlm,hybrid (N: 2000, 91.8 [90.4–93.2]) over Tmlm,gold (N: 2000, 91.4 [89.9–92.8]). Conclusions Custom pre-training of transformers and fine-tuning on manual annotations promises to be an efficient strategy to unlock report databases for data-driven medicine. Key Points • On-site development of natural language processing methods that retrospectively unlock free-text databases of radiology clinics for data-driven medicine is of great interest. • For clinics seeking to develop methods on-site for retrospective structuring of a report database of a certain department, it remains unclear which of previously proposed strategies for labeling reports and pre-training models is the most appropriate in context of, e.g., available annotator time. • Using a custom pre-trained transformer model, along with a little annotation effort, promises to be an efficient way to retrospectively structure radiological databases, even if not millions of reports are available for pre-training.

show abstract

Section: Introductionmentioning

confidence: 99%

Transformer-based structuring of free-text radiology report databases

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Recently, the memorization effect in LMs has been further exploited in the federated learning setting (Konečnỳ et al, 2016), where in combination with the information leakage from model updates (Melis et al, 2019;Huang et al, 2020), the attacker is capable of recovering private text in federated learning (Gupta et al, 2022). To mitigate privacy risks, there is a growing interest in making language models privacypreserving (Yu et al, 2022;Shi et al, 2022b;Yue et al, 2023;Cummings et al, 2023) by training them with a differential privacy guarantee (Dwork et al, 2006b;Abadi et al, 2016) or with various anonymization approaches (Nakamura et al, 2020;Biesner et al).…”

Section: Privacy Risks In Language Modelsmentioning

confidence: 99%

Privacy Implications of Retrieval-Based Language Models

Huang,

Gupta,

Zhong

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly kNN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that kNN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks: When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would eliminate the risks while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs 1 .

show abstract

“…IRE extract formulas from the text of numerical description and then these formulas are used for NCC. Biesner et al [1] developed a framework based on state-of-the-art deep learning techniques to anonymize sensitive information in financial documents in the German language so that the documents can be further used in other applications without any restriction. As compared to the approaches that do consistency checking, this paper's approach automates the consistency checking task using different transformer-based tabular models.…”

Section: Related Workmentioning

confidence: 99%

Automatic Consistency Checking of Table and Text in Financial Documents

Ali

Deußer

Houben

et al. 2023

nldl

View full text Add to dashboard Cite

A company's financial documents use tables along with text to organize the data containing key performance indicators (KPIs) (such as profit and loss) and a financial quantity linked to them. The KPI’s linked quantity in a table might not be equal to the similarly described KPI's quantity in a text. Auditors take substantial time to manually audit these financial mistakes and this process is called consistency checking. As compared to existing work, this paper attempts to automate this task with the help of transformer-based models. Furthermore, for consistency checking it is essential for the table's KPIs embeddings to encode the semantic knowledge of the KPIs and the structural knowledge of the table. Therefore, this paper proposes a pipeline that uses a tabular model to get the table's KPIs embeddings. The pipeline takes input table and text KPIs, generates their embeddings, and then checks whether these KPIs are identical. The pipeline is evaluated on the financial documents in the German language and a comparative analysis of the cell embeddings' quality from the three tabular models is also presented. From the evaluation results, the experiment that used the English-translated text and table KPIs and Tabbie model to generate table KPIs’ embeddings achieved an accuracy of 72.81% on the consistency checking task, outperforming the benchmark, and other tabular models.

show abstract

Anonymization of German financial documents using neural network-based language models with contextual word representations

Cited by 16 publications

References 18 publications

Transformer-based structuring of free-text radiology report databases

Transformer-based structuring of free-text radiology report databases

Privacy Implications of Retrieval-Based Language Models

Automatic Consistency Checking of Table and Text in Financial Documents

Contact Info

Product

Resources

About