Abstract:In the past decade, there have been many well-publicized cases of source code leaking from different well-known companies. These leaks pose a serious problem when the source code contains sensitive information encoded in its identifier names and comments. Unfortunately, redacting the sensitive information requires obfuscating the identifiers, which will quickly interfere with program comprehension. Program comprehension is key for programmers in understanding the source code, so sensitive information is often … Show more
“…First, we show the method with only one quartile of the terms visible. We obfuscated the other three quartiles using a standard term-replacement technique (replace the terms with non-meaningful strings such as xxxx) [50]. For example, a Java method with 20 terms would have about five terms visible, and about 15 terms obfuscated.…”
Software Engineering research has become extremely dependent on terms (words in textual data) extracted from source code. Different techniques have been proposed to extract the most "important" terms from code. These terms are typically used as input to research prototypes: the quality of the output of these prototypes will depend on the quality of the term extraction technique. At present no consensus exists about which technique predicts the best terms for code comprehension. We perform a literature review, and propose a unified prediction model based on a Naive Bayes algorithm. We evaluate our model in a field study with professional programmers, as well as a standard 10-fold synthetic study. We found our model predicts the top quartile of the most-important terms with approximately 50% precision and recall, outperforming other popular techniques. We found the predictions from our model to help programmers to the same degree as the gold set.
“…First, we show the method with only one quartile of the terms visible. We obfuscated the other three quartiles using a standard term-replacement technique (replace the terms with non-meaningful strings such as xxxx) [50]. For example, a Java method with 20 terms would have about five terms visible, and about 15 terms obfuscated.…”
Software Engineering research has become extremely dependent on terms (words in textual data) extracted from source code. Different techniques have been proposed to extract the most "important" terms from code. These terms are typically used as input to research prototypes: the quality of the output of these prototypes will depend on the quality of the term extraction technique. At present no consensus exists about which technique predicts the best terms for code comprehension. We perform a literature review, and propose a unified prediction model based on a Naive Bayes algorithm. We evaluate our model in a field study with professional programmers, as well as a standard 10-fold synthetic study. We found our model predicts the top quartile of the most-important terms with approximately 50% precision and recall, outperforming other popular techniques. We found the predictions from our model to help programmers to the same degree as the gold set.
“…Values at the leaves are generalized by replacing them with the sub-ranges [3][4][5][6] or (6)(7)(8)(9)(10)(11)(12)(13)(14). These in turn can be replaced by [3][4][5][6][7][8][9][10][11][12][13][14]. Or the leaf values can be suppressed by replacing them with a symbol such as the stars at the top of 11 12 14 Datafly then replaces values in the quasi-identifiers according to the hierarchy.…”
Section: Datafly For K-anonymitymentioning
confidence: 99%
“…Randomly choose an instance from the data, for this example we will use row 1 in Table 4.2b, randomly select an attribute from A, e.g. wmc pair with its sub-range (6)(7)(8)(9)(10)(11)(12)(13)(14).…”
Section: Query Generatormentioning
confidence: 99%
“…In the end the query we generate is, wmc = (6)(7)(8)(9)(10)(11)(12)(13)(14). Table 4.3, shows more examples of queries, their sizes, the number and rows they match from the data set.…”
Section: Query Generatormentioning
confidence: 99%
“…Figure 8.1: Pie chart showing privacy research in software engineering. The pie slices are sized according to the number of publications in each area of research: 1) software testing [3][4][5][6], 2) bug reporting [7,8], 3) requirements [9, 10], 4) cross defect prediction [11,12], and 5) program comprehension [13].…”
LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning Cross Project Defect Prediction (CPDP) is a field of study where an organization lacking enough local data can use data from other organizations or projects for building defect predictors. Research in CPDP has shown challenges in using "other" data, therefore transfer defect learning has emerged to improve on the quality of CPDP results. With this new found success in CPDP, it is now increasingly important to focus on the privacy concerns of data owners. To support CPDP, data must be shared. There are many privacy threats that inhibit data sharing. We focus on sensitive attribute disclosure threats or attacks, where an attacker seeks to associate a record(s) in a data set to its sensitive information. Solutions to this sharing problem comes from the field of Privacy Preserving Data Publishing (PPDP) which has emerged as a means to confuse the efforts of sensitive attribute disclosure attacks and therefore reduce privacy concerns. PPDP covers methods and tools used to disguise raw data for publishing. However, prior work warned that increasing data privacy decreases the efficacy of data mining on privatized data. The goal of this research is to help encourage organizations and individuals to share their data publicly and/or with each other for research purposes and/or improving the quality of their software product through defect prediction. The contributions of this work allow three benefits for data owners willing to share privatized data: 1) that they are fully aware of the sensitive attribute disclosure risks involved so they can make an informed decision about what to share, 2) they are provided with the ability to privatize their data and have it remain useful, and 3) the ability to work with others to share their data based on what they learn from each others data. We call this private multiparty data sharing. To achieve these benefits, this dissertation presents LACE (Large-scale Assurance of Confidentiality Environment). LACE incorporates a privacy metric called IPR (Increased Privacy Ratio) which calculates the risk of sensitive attribute disclosure of data through comparing results of queries (attacks) on the original data and a privatized version of that data. LACE also includes a privacy algorithm which uses intelligent instance selection to prune the data to as low as 10% of the original data (thus offering complete privacy to the other 90%). It then mutates the remaining data making it possible that over 70% of sensitive attribute disclosure attacks are unsuccessful. Finally, LACE can facilitate private multiparty data sharing via a unique leader-follower algorithm (developed for this dissertation). The algorithm allows data owners to serially build a privatized data set, by allowing them to only contribute data that are not already in the private cache. In this scenario, each data owner shares even less of their data, some as low as 2%. The experiments of this thesis, lead to the following conclusion: at least for the defe...
The amount of data for processing and categorization grows at an ever increasing rate. At the same time the demand for collaboration and transparency in organizations, government and businesses, drives the release of data from internal repositories to the public or 3rd party domain. This in turn increase the potential of sharing sensitive information. The leak of sensitive information can potentially be very costly, both financially for organizations, but also for individuals. In this work we address the important problem of sensitive information detection. Specially we focus on detection in unstructured text documents.We show that simplistic, brittle rule sets for detecting sensitive information only find a small fraction of the actual sensitive information. Furthermore we show that previous state-of-the-art approaches have been implicitly tailored to such simplistic scenarios and thus fail to detect actual sensitive content.We develop a novel family of sensitive information detection approaches which only assumes access to labeled examples, rather than unrealistic assumptions such as access to a set of generating rules or descriptive topical seed words. Our approaches are inspired by the current state-of-the-art for paraphrase detection and we adapt deep learning approaches over recursive neural networks to the problem of sensitive information detection. We show that our context-based approaches significantly outperforms the family of previous state-of-the-art approaches for sensitive information detection, so-called keyword-based approaches, on real-world data and with human labeled examples of sensitive and non-sensitive documents.A key challenge in the field of sensitive information detection is the lack of publicly available real-world datasets on which to train and/or benchmark on. This is due to the inherent sensitive nature of the data in question. We address this issue in this work by releasing publicly labeled examples of sensitive and non-sensitive content. We release a total of 8 different types of sensitive information over 2 distinct sets of documents. We utilize efforts by human domain experts in labeling both datasets for 4 complex types of informational content for each set of documents. This release totals 750, 000 labeled sentences with their parse trees for the research community to make use of. i
ResuméMaengden af information tilgaengeligt som skal kunne automatisk håndteres og bearbejdes vokser eksplosivt. Dette sker samtidigt med øget fokus på deling af data og krav om transparens. Dette øger risikoen for deling af potentielt følsomme oplysninger som ikke skulle have vaeret delt. Sådanne fejlagtige delinger og afsløringer af følsomme oplysninger er forbundet med høje omkostninger. I denne afhandling adresseres det voksende og komplekse problemområde omkring at finde følsomme informationer ved hjaelp af datalogiske algoritmer. Specifikt fokuseres på at finde følsomme oplysninger i ustrukturerede tekst dokumenter.Vi påviser at simple regelsaet kun finder en relativ lille del af de faktiske f...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.