Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

Jeske, Daniel R.; Samadi, B.; Lin, P.J.; Ye, Lan; Cox, Sean; Xiao, Rui; Younglove, Theodore; Ly, Minh Huong Phu; Holt, Douglas B.; Rich, Ryan

doi:10.1145/1081870.1081969

Cited by 28 publications

(17 citation statements)

References 9 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a few approaches the underlying model is well described and defined, e.g., by intersecting planes [72] or variations of the SwissRoll [97]. Other approaches [55,56] use rules and statistics to encode relationships between data instances (e.g., older person implies higher income) and allow one to insert anomalies for different applications. The data is typically created in a black-box manner, making its scope and validity hard to grasp.…”

Section: Multi-dimensional Data Visualizationmentioning

confidence: 99%

Generative Data Models for Validation and Evaluation of Visualization Techniques

Schulz

Nocaj

El‐Assady

et al. 2016

Proceedings of the Sixth Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization

View full text Add to dashboard Cite

We argue that there is a need for substantially more research on the use of generative data models in the validation and evaluation of visualization techniques. For example, user studies will require the display of representative and unconfounded visual stimuli, while algorithms will need functional coverage and assessable benchmarks. However, data is often collected in a semi-automatic fashion or entirely hand-picked, which obscures the view of generality, impairs availability, and potentially violates privacy. There are some sub-domains of visualization that use synthetic data in the sense of generative data models, whereas others work with real-world-based data sets and simulations. Depending on the visualization domain, many generative data models are "side projects" as part of an ad-hoc validation of a techniques paper and thus neither reusable nor general-purpose. We review existing work on popular data collections and generative data models in visualization to discuss the opportunities and consequences for technique validation, evaluation, and experiment design. We distill handling and future directions, and discuss how we can engineer generative data models and how visualization research could benefit from more and better use of generative data models.

show abstract

Section: Multi-dimensional Data Visualizationmentioning

confidence: 99%

Generative Data Models for Validation and Evaluation of Visualization Techniques

Schulz

Nocaj

El‐Assady

et al. 2016

Proceedings of the Sixth Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization

View full text Add to dashboard Cite

show abstract

“…However, it was claimed that based on the heuristic devised, the system could be extended to handle three or higher dimensional data. Jeske et al (2005) proposed an architecture for an information discovery analysis system data and scenario generator that generates synthetic datasets on a to-be-decided semantic graph. Based on this architecture, Lin et al (2006) developed a prototype of this system, which is capable of generating synthetic data for a particular scenario, such as credit card transactions.…”

Section: Related Workmentioning

confidence: 99%

“…A solution to these problems could be using synthetic generated data with intrinsic patterns. There are a number of approaches and techniques that have been developed for generating synthetic data (Coyle et al, 2013, Frasch et al, 2011, van der Walt and Bernard, 2007, Sanchez-Monedero et al, 2013, Jeske et al, 2005, Lin et al, 2006, and Pei and Zaiane, 2006. However, since each of the previous research was either focused on a particular category, such as clustering, or using some special techniques, there are still spaces for further research.…”

Section: Introductionmentioning

confidence: 99%

An Evaluation of the Challenges of Multilingualism in Data Warehouse Development

Dediź

Stanier

2016

Proceedings of the 18th International Conference on Enterprise Information Systems

View full text Add to dashboard Cite

ICEIS 2016 received 257 paper submissions from 42 countries in all continents, which makes it one of the largest conferences in the World in the area of Information Systems, thus demonstrating the success and global dimension of this conference. From these, 42 papers were selected for publication and presentation at the Conference as full papers. These numbers, leading to a full-paper acceptance ratio of 16%, show the intention of preserving a high quality forum for this conference, a quality that we intend to maintain in the future, for the next editions of this conference.The high number and high quality of the received papers imposed difficult choices in the selection process. To evaluate each submission, a double blind paper review was performed by the Program Committee, whose members are highly qualified researchers in ICEIS topic areas.All presented papers will be available at the SCITEPRESS Digital Library and will be submitted for indexation by Thomson Reuters Conference Proceedings Citation Index (ISI), INSPEC, DBLP, EI (Elsevier Index) and Scopus.Additionally, a short list of presented papers will be selected to be expanded into a forthcoming book of ICEIS 2016 Selected Papers to be published by Springer in the LNBIP Series.The technical program of the conference included a panel and 4 invited talks delivered by internationally distinguished speakers, namely: Claudia Loebbecke (University of Cologne, Germany), Sergio Gusmeroli (TXT e-solutions SPA, Italy), Wil Van Der Aalst (Technische Universiteit Eindhoven, Netherlands) and Jan Vom Brocke (University of Liechtenstein, Liechtenstein). Their participation positively contributes to reinforce the overall quality of the Conference and to provide a deeper understanding of the fields addressed by the conference.Moreover, ICEIS 2016 had a Doctoral Consortium on Enterprise Information Systems and 1 tutorial. We are thankful to the Conference Co-chairs (Olivier Camp and José Cordeiro) and Program Co-chairs (Slimane Hammoudi, Leszek Maciaszek and Michele M. Missikoff) for their dedication and hard work in organizing these events.We sincerely thank all the authors for their submissions and participation in ICEIS 2016. Furthermore, we would like to thank all the members of the program committee and reviewers, who helped us with their expertise, dedication and time. We would also like to thank the invited speakers for their excellent contribution in sharing their knowledge and vision and the workshop/special session chairs whose collaboration with ICEIS 2015 was much appreciated. Finally, we gratefully acknowledge the professional support of the ICEIS 2016 team for all organizational processes. XIIIWe hope that all colleagues find this a fruitful and inspiring conference. We hope to contribute to the development of the Enterprise Information Systems community and look forward to having additional research results presented at the next edition of ICEIS, details of which are available at http://www.iceis.org. Slimane Hammoudi University of Cologne, GermanyAbstract:...

show abstract

“…There are a number of approaches and techniques that have been developed for generating synthetic data (Coyle et al, 2013, Frasch et al, 2011, van der Walt and Bernard, 2007, Sanchez-Monedero et al, 2013, Jeske et al, 2005, Lin et al 2006, and Pei and Zaiane, 2006. However, since each of the previous research was either focused on a particular category, such as clustering, or using some special techniques, there are still spaces for further research.…”

Section: Introductionmentioning

confidence: 99%

Towards a Synthetic Data Generator for Matching Decision Trees

Peng

Hanke

2016

Proceedings of the 18th International Conference on Enterprise Information Systems

View full text Add to dashboard Cite

Abstract:It is popular to use real-world data to evaluate or teach data mining techniques. However, there are some disadvantages to use real-world data for such purposes. Firstly, real-world data in most domains is difficult to obtain for several reasons, such as budget, technical or ethical. Secondly, the use of many of the realworld data is restricted or in the case of data mining, those data sets do either not contain specific patterns that are easy to mine for teaching purposes or the data needs special preparation and the algorithm needs very specific settings in order to find patterns in it. The solution to this could be the generation of synthetic, "meaningful data" (data with intrinsic patterns). This paper presents a framework for such a data generator, which is able to generate datasets with intrinsic patterns, such as decision trees. A preliminary run of the prototype proves that the generation of such "meaningful data" is possible. Also the proposed approach could be extended to a further development for generating synthetic data with other intrinsic patterns.

show abstract

Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

Cited by 28 publications

References 9 publications

Generative Data Models for Validation and Evaluation of Visualization Techniques

Generative Data Models for Validation and Evaluation of Visualization Techniques

An Evaluation of the Challenges of Multilingualism in Data Warehouse Development

Towards a Synthetic Data Generator for Matching Decision Trees

Contact Info

Product

Resources

About