Redacting sensitive information in software artifacts

Grechanik, Mark; McMillan, Collin; Dasgupta, Tathagata; Poshyvanyk, Denys; Gethers, Malcom

doi:10.1145/2597008.2597138

Cited by 6 publications

(18 citation statements)

References 42 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, we show the method with only one quartile of the terms visible. We obfuscated the other three quartiles using a standard term-replacement technique (replace the terms with non-meaningful strings such as xxxx) [50]. For example, a Java method with 20 terms would have about five terms visible, and about 15 terms obfuscated.…”

Section: Methodsmentioning

confidence: 99%

Detecting Important Terms in Source Code for Program Comprehension

Rodeghero

McMillan

2019

Proceedings of the Annual Hawaii International Conference on System Sciences

Self Cite

View full text Add to dashboard Cite

Software Engineering research has become extremely dependent on terms (words in textual data) extracted from source code. Different techniques have been proposed to extract the most "important" terms from code. These terms are typically used as input to research prototypes: the quality of the output of these prototypes will depend on the quality of the term extraction technique. At present no consensus exists about which technique predicts the best terms for code comprehension. We perform a literature review, and propose a unified prediction model based on a Naive Bayes algorithm. We evaluate our model in a field study with professional programmers, as well as a standard 10-fold synthetic study. We found our model predicts the top quartile of the most-important terms with approximately 50% precision and recall, outperforming other popular techniques. We found the predictions from our model to help programmers to the same degree as the gold set.

show abstract

Section: Methodsmentioning

confidence: 99%

Detecting Important Terms in Source Code for Program Comprehension

Rodeghero

McMillan

2019

Proceedings of the Annual Hawaii International Conference on System Sciences

Self Cite

View full text Add to dashboard Cite

show abstract

“…Values at the leaves are generalized by replacing them with the sub-ranges [3][4][5][6] or (6)(7)(8)(9)(10)(11)(12)(13)(14). These in turn can be replaced by [3][4][5][6][7][8][9][10][11][12][13][14]. Or the leaf values can be suppressed by replacing them with a symbol such as the stars at the top of 11 12 14 Datafly then replaces values in the quasi-identifiers according to the hierarchy.…”

Section: Datafly For K-anonymitymentioning

confidence: 99%

“…Randomly choose an instance from the data, for this example we will use row 1 in Table 4.2b, randomly select an attribute from A, e.g. wmc pair with its sub-range (6)(7)(8)(9)(10)(11)(12)(13)(14).…”

Section: Query Generatormentioning

confidence: 99%

“…In the end the query we generate is, wmc = (6)(7)(8)(9)(10)(11)(12)(13)(14). Table 4.3, shows more examples of queries, their sizes, the number and rows they match from the data set.…”

Section: Query Generatormentioning

confidence: 99%

“…Figure 8.1: Pie chart showing privacy research in software engineering. The pie slices are sized according to the number of publications in each area of research: 1) software testing [3][4][5][6], 2) bug reporting [7,8], 3) requirements [9, 10], 4) cross defect prediction [11,12], and 5) program comprehension [13].…”

Section: Future Workmentioning

confidence: 99%

See 2 more Smart Citations

LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning

Peters¹

View full text Add to dashboard Cite

LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning Cross Project Defect Prediction (CPDP) is a field of study where an organization lacking enough local data can use data from other organizations or projects for building defect predictors. Research in CPDP has shown challenges in using "other" data, therefore transfer defect learning has emerged to improve on the quality of CPDP results. With this new found success in CPDP, it is now increasingly important to focus on the privacy concerns of data owners. To support CPDP, data must be shared. There are many privacy threats that inhibit data sharing. We focus on sensitive attribute disclosure threats or attacks, where an attacker seeks to associate a record(s) in a data set to its sensitive information. Solutions to this sharing problem comes from the field of Privacy Preserving Data Publishing (PPDP) which has emerged as a means to confuse the efforts of sensitive attribute disclosure attacks and therefore reduce privacy concerns. PPDP covers methods and tools used to disguise raw data for publishing. However, prior work warned that increasing data privacy decreases the efficacy of data mining on privatized data. The goal of this research is to help encourage organizations and individuals to share their data publicly and/or with each other for research purposes and/or improving the quality of their software product through defect prediction. The contributions of this work allow three benefits for data owners willing to share privatized data: 1) that they are fully aware of the sensitive attribute disclosure risks involved so they can make an informed decision about what to share, 2) they are provided with the ability to privatize their data and have it remain useful, and 3) the ability to work with others to share their data based on what they learn from each others data. We call this private multiparty data sharing. To achieve these benefits, this dissertation presents LACE (Large-scale Assurance of Confidentiality Environment). LACE incorporates a privacy metric called IPR (Increased Privacy Ratio) which calculates the risk of sensitive attribute disclosure of data through comparing results of queries (attacks) on the original data and a privatized version of that data. LACE also includes a privacy algorithm which uses intelligent instance selection to prune the data to as low as 10% of the original data (thus offering complete privacy to the other 90%). It then mutates the remaining data making it possible that over 70% of sensitive attribute disclosure attacks are unsuccessful. Finally, LACE can facilitate private multiparty data sharing via a unique leader-follower algorithm (developed for this dissertation). The algorithm allows data owners to serially build a privatized data set, by allowing them to only contribute data that are not already in the private cache. In this scenario, each data owner shares even less of their data, some as low as 2%. The experiments of this thesis, lead to the following conclusion: at least for the defe...

show abstract

Detecting Complex Sensitive Information via Phrase Structure in Recursive Neural Networks

Neerbek

Assent

Dolog

2018

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

The amount of data for processing and categorization grows at an ever increasing rate. At the same time the demand for collaboration and transparency in organizations, government and businesses, drives the release of data from internal repositories to the public or 3rd party domain. This in turn increase the potential of sharing sensitive information. The leak of sensitive information can potentially be very costly, both financially for organizations, but also for individuals. In this work we address the important problem of sensitive information detection. Specially we focus on detection in unstructured text documents.We show that simplistic, brittle rule sets for detecting sensitive information only find a small fraction of the actual sensitive information. Furthermore we show that previous state-of-the-art approaches have been implicitly tailored to such simplistic scenarios and thus fail to detect actual sensitive content.We develop a novel family of sensitive information detection approaches which only assumes access to labeled examples, rather than unrealistic assumptions such as access to a set of generating rules or descriptive topical seed words. Our approaches are inspired by the current state-of-the-art for paraphrase detection and we adapt deep learning approaches over recursive neural networks to the problem of sensitive information detection. We show that our context-based approaches significantly outperforms the family of previous state-of-the-art approaches for sensitive information detection, so-called keyword-based approaches, on real-world data and with human labeled examples of sensitive and non-sensitive documents.A key challenge in the field of sensitive information detection is the lack of publicly available real-world datasets on which to train and/or benchmark on. This is due to the inherent sensitive nature of the data in question. We address this issue in this work by releasing publicly labeled examples of sensitive and non-sensitive content. We release a total of 8 different types of sensitive information over 2 distinct sets of documents. We utilize efforts by human domain experts in labeling both datasets for 4 complex types of informational content for each set of documents. This release totals 750, 000 labeled sentences with their parse trees for the research community to make use of. i ResuméMaengden af information tilgaengeligt som skal kunne automatisk håndteres og bearbejdes vokser eksplosivt. Dette sker samtidigt med øget fokus på deling af data og krav om transparens. Dette øger risikoen for deling af potentielt følsomme oplysninger som ikke skulle have vaeret delt. Sådanne fejlagtige delinger og afsløringer af følsomme oplysninger er forbundet med høje omkostninger. I denne afhandling adresseres det voksende og komplekse problemområde omkring at finde følsomme informationer ved hjaelp af datalogiske algoritmer. Specifikt fokuseres på at finde følsomme oplysninger i ustrukturerede tekst dokumenter.Vi påviser at simple regelsaet kun finder en relativ lille del af de faktiske f...

show abstract

Redacting sensitive information in software artifacts

Cited by 6 publications

References 42 publications

Detecting Important Terms in Source Code for Program Comprehension

Detecting Important Terms in Source Code for Program Comprehension

LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning

Detecting Complex Sensitive Information via Phrase Structure in Recursive Neural Networks

Contact Info

Product

Resources

About