Everything you always wanted to know about a dataset: Studies in data summarisation

Koesten, Laura; Simperl, Elena; Blount, Tom; Kacprzak, Emilia; Tennison, Jeni

doi:10.1016/j.ijhcs.2019.10.004

Cited by 27 publications

(37 citation statements)

References 79 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Choosing a dataset greatly depends on the information provided alongside it. A number of studies indicate that standard metadata does not provide sufficient information for dataset reuse [81,106]. Recent studies have discussed textual ( [81,129]) or visual [138] surrogates of datasets that aim to help people identify relevant documents and increase accuracy and/or satisfaction with their relevance judgments.…”

Section: Results Presentationmentioning

confidence: 99%

“…Users judge the relevance of datasets for a specific task based on the dataset's scope (e.g. geographical and temporal scope) [104,75], basic statistics about the dataset such as counts and value ranges, and information about granularity of information in the data [81]. The documentation of variables and the context from which the dataset comes from also play a key role.…”

Section: Results Presentationmentioning

confidence: 99%

“…Summarization and Annotations. To help both search and user understanding, summarizations and annotations are additional metadata that can be generated about the underlying dataset [81]. For instance, [102] deal with the problem that the underlying dataset cannot be exposed, but good summaries may help the user undertake the task of data access.…”

Section: Data Handlingmentioning

confidence: 99%

“…These limitations impact the use of the retrieved data -machine learning can be unduly affected by the processing that was performed over a dataset prior to its release [125], while knowing the original purpose for collecting the data aids interpretation and analysis [140]. In other words, in a dataset search context, approaches need to consider additional aspects such as data provenance [27,53,64,87,101,142], annotations [67,93,144], quality [116,131,148], granularity of content [81], and schema [9,20] to effectively evaluate a dataset's fitness for a particular use. The user does not have the ability to introspect over large amounts of data, and their attention must be prioritized [13].…”

Section: Introductionmentioning

confidence: 99%

“…; DCAT [95] is the W3C standard for interoperability of catalogues, and contains a representation and vocabulary for datasets. Additional metadata, such as summarizations [81,106,144] could also be contributed. Unfortunately, the creation and maintenance of this metadata is currently resource intensive.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Dataset search: a survey

et al. 2019

Self Cite

View full text Add to dashboard Cite

Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems in dataset retrieval. We identify what makes dataset search a research field in its own right, with unique challenges and methods and highlight open problems. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to resolve these open problems as well as immediate next steps that will take the field forward.

show abstract

Section: Results Presentationmentioning

confidence: 99%

Section: Results Presentationmentioning

confidence: 99%

Section: Data Handlingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Dataset search: a survey

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

Toward Best Practices for Unstructured Descriptions of Research Data

Phillips

Smit

2021

Proceedings of the Association for Information Science and Tech

View full text Add to dashboard Cite

Achieving the potential of widespread sharing of open research data requires that sharing data is straightforward, supported, and well-understood; and that data is discoverable by researchers. Our literature review and environment scan suggest that while substantial effort is dedicated to structured descriptions of research data, unstructured fields are commonly available (title, description) yet poorly understood. There is no clear description of what information should be included, in what level of detail, and in what order. These human-readable fields, routinely used in indexing and search features and reliably federated, are essential to the research data user experience. We propose a set of high-level best practices for unstructured description of datasets, to serve as the essential starting point for more granular, discipline-specific guidance. We based these practices on extensive review of literature on research article abstracts; archival practice; experience in supporting research data management; and grey literature on data documentation. They were iteratively refined based on comments received in a webinar series with researchers, data curators, data repository managers, and librarians in Canada. We demonstrate the need for information research to more closely examine these unstructured fields and provide a foundation for a more detailed conversation.

show abstract

Towards the FAIRification of Meteorological Data: A Meteorological Semantic Model

Annane

Kamel

Santos

et al. 2022

Metadata and Semantic Research

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de

show abstract

Everything you always wanted to know about a dataset: Studies in data summarisation

Cited by 27 publications

References 79 publications

Dataset search: a survey

Dataset search: a survey

Toward Best Practices for Unstructured Descriptions of Research Data

Towards the FAIRification of Meteorological Data: A Meteorological Semantic Model

Contact Info

Product

Resources

About