Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data

Megler, V. M.; Maier, David

doi:10.1109/tkde.2014.2320737

Cited by 7 publications

(6 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A true primary key will have a uniqueness ratio of 1, but because the profiler uses the Hyperloglog sketch to compute this, we may have a small error rate, so we simply check that the value is close to 1. When we retrieve the content-similar candidates, we iterate over them (line 18) and check whether they are PK/FK candidates (line 19), in which case we add the candidate PK/FK relationship to the EKG (20). This method is similar to PowerPivot's [7] approach, and works well in practice, as shown in the evaluation.…”

Section: Building and Maintaining The Ekgmentioning

confidence: 98%

Aurum: A Data Discovery System

Fernandez

Abedjan

Koko

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

106

118

View full text Add to dashboard Cite

Organizations face a data discovery problem when their analysts spend more time looking for relevant data than analyzing it. This problem has become commonplace in modern organizations as: i) data is stored across multiple storage systems, from databases to data lakes, to the cloud; ii) data scientists do not operate within the limits of well-defined schemas or a small number of data sources-instead, to answer complex questions they must access data spread across thousands of data sources. To address this problem, we capture relationships between datasets in an enterprise knowledge graph (EKG), which helps users to navigate among disparate sources. The contribution of this paper is AURUM, a system to build, maintain and query the EKG. To build the EKG, we introduce a Two-step process which scales to large datasets and requires only one-pass over the data, avoiding overloading the source systems. To maintain the EKG without re-reading all data every time, we introduce a resource-efficient sampling signature (RESS) method which works by only using a small sample of the data. Finally, to query the EKG, we introduce a collection of composable primitives, thus allowing users to define many different types of discovery queries. We describe our experience using AURUM in three corporate scenarios and do a performance evaluation of each component.

show abstract

Section: Building and Maintaining The Ekgmentioning

confidence: 98%

Aurum: A Data Discovery System

Fernandez

Abedjan

Koko

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

106

118

View full text Add to dashboard Cite

show abstract

“…T d behind the oth our similarity fu data within the uch summarizati search engine ty between the g [14]. The result ity [14,15] We separately t poral, and envi es from a prior the dataset con that study. We s Figure 5: in eac oint-wise calcul culated from the scores for tempo atial search term in Figure 5c.…”

Section: Zing Datamentioning

confidence: 99%

“…With correlat search term tes n is reasonably v imilar ranking. W of using a co ance is reasonabl rsus applying th f the bubble re in both scoring other high-scor and scores calc ons in the datase ested search ter ironmental varia study [15]. We ntents to the 2,9 show the results ch case we grap lation (on the v e dataset summ oral search term ms in Figure 5 The results for d n. Finally, we co ms for each searc g orientation of most cases, the ween the two app ficient of determ mbined scores.…”

Section: Zing Datamentioning

confidence: 99%

“…DNH usability and effectiveness was shown in two user studies. Megler and Maier [15] initially evaluated the appropriateness of a proposed similarity measure, based in cognitive science, via a user study that tested whether a population of scientists had a relatively consistent mental model of what it meant for a collection of numeric data to be "close" to a numeric search term. They found that the population of scientists surveyed provided markedly similar responses, and concluded that it might be possible to replicate the scientists' assessments.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Demonstrating "Data Near Here"

Megler

Maier

2015

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

Prior work proposed "Data Near Here" (DNH), a data search engine for scientific archives that is modeled on Internet search engines. DNH performs a periodic, asynchronous scan of each dataset in an archive, extracting lightweight features that are combined to form a dataset summary. During a search, DNH assesses the similarity of the search terms to the summary features and returns to the user, at interactive timescales, a ranked list of datasets for further exploration and analysis. We will demonstrate the search capabilities and ancillary metadata-browsing features for an archive of observational oceanographic data. While comparing search terms to complete datasets might seem ideal, interactive search speed would be impossible with archives of realistic size. We include an analysis showing that our summarybased approach gives a reasonable approximation of such a "complete dataset" similarity measure.

show abstract

“…From user perspectives, research has revealed the similarities and differences between dataset search and document retrieval (Kern & Mathiak, 2015;Megler & Maier, 2015), and recognized that locating data for reuse in research is challenging for many (Gregory et al, 2020). Specifically, data discovery and reuse are purpose-driven activities, characterized by a variety of data needs and discovery strategies (Gregory et al, 2020b) that are interwoven with processes of sensemaking (Koesten et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Data Discovery and Reuse in Data Service Practices: A Global Perspective

Liu

Chen

Katô

et al. 2021

Proceedings of the Association for Information Science and Tech

View full text Add to dashboard Cite

The proposed panel will address the issues of the discovery and reuse of publicly available data on the web in the context of data service practices from a global perspective. Thousands of data discovery services have appeared around the world since the promotion of 'open science', reproducible research, and the FAIR (Findable, Accessible, Interoperable and Reusable) data principles in the research sector. However, there is also increasing demand for transparency of search algorithms, and in the design, development, evaluation, and deployment of current data search services; this requires a better understanding of how users approach data discovery and interact with data in search settings. From a global perspective, we will identify and discuss the specific system design issues in data discovery and reuse, drawing on our organization of the NTCIR (NII Testbeds and Community for Information access Research) project of Data Search track, the design and evaluation of the data discovery service of the Australian Research Data Commons (ARDC), and studies examining researchers' practices of data discovery and reuse.

show abstract

Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data

Cited by 7 publications

References 30 publications

Aurum: A Data Discovery System

Aurum: A Data Discovery System

Demonstrating "Data Near Here"

Data Discovery and Reuse in Data Service Practices: A Global Perspective

Contact Info

Product

Resources

About