Utility and Privacy Assessments of Synthetic Data for Regression Tasks

Hittmeir, Markus; Ekelhart, Andreas; Mayer, Rudolf

doi:10.1109/bigdata47090.2019.9005476

Cited by 30 publications

(27 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Narrow or specific measures are widely used for assessing synthetic data [15], [19], [20], [27], [31], [32]. They are useful when the analysis to be performed on the synthetic data is known ahead of time.…”

Section: B Utility Metrics: Overview and Classificationmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-Dimensional Evaluation of Synthetic Data Generators

2022

View full text Add to dashboard Cite

show abstract

Section: B Utility Metrics: Overview and Classificationmentioning

confidence: 99%

“…Pairwise Correlations are sometimes measured using pairwise correlation plots such as heat maps [27], [32], but more often using statistical measures such as pairwise correlation difference (𝑃𝐶𝐷) [18]. We assess the correlations between attribute pairs using the latter.…”

Section: ) Bivariate Fidelitymentioning

confidence: 99%

A Multi-Dimensional Evaluation of Synthetic Data Generators

2022

View full text Add to dashboard Cite

show abstract

“…A way to ensure this from a mathematical perspective is to train the generative models with a differential privacy (DP) objective. The premise of DP is that no output could be directly attributed to a single training instance [2,7,19,35]. In this study, we consciously chose not to include DP to maximize the utility of the synthetic corpora for the downstream task, but we recommend that future research uses DP in order to minimize privacy risks.…”

Section: Privacy Of Synthetic Textmentioning

confidence: 99%

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Libbi

Trienes

Trieschnigg³

et al. 2021

Future Internet

View full text Add to dashboard Cite

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations. In this work, we propose the use of neural language models (LSTM and GPT-2) for generating artificial EHR text jointly with annotations for named-entity recognition. Our experiments show that artificial documents can be used to train a supervised named-entity recognition model for de-identification, which outperforms a state-of-the-art rule-based baseline. Moreover, we show that combining real data with synthetic data improves the recall of the method, without manual annotation effort. We conduct a user study to gain insights on the privacy of artificial text. We highlight privacy risks associated with language models to inform future research on privacy-preserving automated text generation and metrics for evaluating privacy-preservation during text generation.

show abstract

“…Dataset yang digunakan pada penelitian ini adalah data Boston Housing, yaitu data mengenai housing market di kota Boston, Amerika Serikat yang dikumpulkan oleh Statlib Library of Carnegie Mellon University [5]. Dataset ini sering dipakai pada penelitian mengenai data mining seperti pada penelitian prediksi housing prices [6] dan regression [7]. Algoritma K-Means diimplementasikan pada framework R menggunakan library.…”

Section: Pendahuluanunclassified

Optimasi Algoritma K-Means Clustering dengan Parallel Processing menggunakan Framework R

Marieska

Lestari

Mahendra

et al. 2021

JEPIN

View full text Add to dashboard Cite

Parallel processing sering digunakan untuk melakukan optimasi execution time terhadap algoritma data mining. Pada penelitian ini, parallel processing digunakan untuk melakukan optimasi pada algoritma clustering K-Means. Implementasi algoritma K-means dilakukan dengan memanfaatkan package yang tersedia pada framework R. Algoritma K-Means dijalankan secara serial dan parallel. Untuk mendapatkan persentase optimasi, maka dilakukan perbandingan antara execution time pada parallel processing dan execution time pada serial processing. Penelitian ini menggunakan dataset Boston Housing yang umum digunakan pada data mining. Skenario pengujian dibedakan berdasarkan jumlah core dan jumlah centroid. Hasil pengujian menunjukkan bahwa parallel processing untuk tiap skenario memiliki execution time yang lebih kecil daripada serial processing. Optimasi yang dihasilkan cukup signifikan, yakni bernilai 20% hingga 52%. Optimasi tertinggi didapatkan pada jumlah core terbanyak dan jumlah centroid terbesar.

show abstract

Utility and Privacy Assessments of Synthetic Data for Regression Tasks

Cited by 30 publications

References 13 publications

A Multi-Dimensional Evaluation of Synthetic Data Generators

A Multi-Dimensional Evaluation of Synthetic Data Generators

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Optimasi Algoritma K-Means Clustering dengan Parallel Processing menggunakan Framework R

Contact Info

Product

Resources

About