Leveraging Schema Labels to Enhance Dataset Search

Chen, Zhiyu; Jia, Haiyan; Heflin, Jeff; Davison, Brian D.

doi:10.1007/978-3-030-45439-5_18

Cited by 17 publications

(10 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Trabelsi et al [37] propose custom embeddings for column headers based on multiple contexts for table retrieval, and find representing numerical cell values to be useful. Chen et al [8] utilize matrix factorization to generate additional table headers and then show that those generated headers can improve the performance of unsupervised table search.…”

mentioning

confidence: 99%

Table Search Using a Deep Contextualized Language Model

Chen

Trabelsi

Heflin

et al. 2020

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

Pretrained contextualized language models such as BERT have achieved impressive results on various natural language processing benchmarks. Benefiting from multiple pretraining tasks and large scale training corpora, pretrained models can capture complex syntactic word relations. In this paper, we use the deep contextualized language model BERT for the task of ad hoc table retrieval. We investigate how to encode table content considering the table structure and input length limit of BERT. We also propose an approach that incorporates features from prior literature on table retrieval and jointly trains them with BERT. In experiments on public datasets, we show that our best approach can outperform the previous state-of-the-art method and BERT baselines with a large margin under different evaluation metrics. CCS CONCEPTS• Information systems → Content analysis and feature selection; Retrieval models and ranking; • Computing methodologies → Search methodologies.

show abstract

mentioning

confidence: 99%

Table Search Using a Deep Contextualized Language Model

Chen

Trabelsi

Heflin

et al. 2020

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…The metadata is then mapped to the Google's knowledge graph, which is then used for dataset duplicates detection and for dataset discovery. Chen et al (2020) enrich metadata records with labels based on the dataset content. Chapman et al (2020) describe the whole dataset discovery process comprising querying for datasets, query processing resulting in a list of datasets, result handling and its presentation.…”

Section: Dataset Discovery Techniquesmentioning

confidence: 99%

Modular framework for similarity-based dataset discovery using external knowledge

Nečaský

Škoda

Bernhauer

et al. 2022

DTA

View full text Add to dashboard Cite

PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

show abstract

“…Zhang & Balog, (2018) propose a semantic matching method for table retrieval where various embedding features are used. Chen et al (2020a) first learn the embedding representations of table headers and generate new headers with embedding features and curated features (Chen et al, 2018) for data tables. They show that the generated headers can be combined with the original fields of the table in order to accurately predict the relevance score of a query-table pair, and improve ranking performance.…”

Section: Structured Document Retrievalmentioning

confidence: 99%

Neural ranking models for document retrieval

et al. 2021

Self Cite

View full text Add to dashboard Cite

Ranking models are the main components of information retrieval systems. Several approaches to ranking are based on traditional machine learning algorithms using a set of hand-crafted features. Recently, researchers have leveraged deep learning models in information retrieval. These models are trained end-to-end to extract features from the raw data for ranking tasks, so that they overcome the limitations of hand-crafted features. A variety of deep learning models have been proposed, and each model presents a set of neural network components to extract features that are used for ranking. In this paper, we compare the proposed models in the literature along different dimensions in order to understand the major contributions and limitations of each model. In our discussion of the literature, we analyze the promising neural components, and propose future research directions. We also show the analogy between document retrieval and other retrieval tasks where the items to be ranked are structured documents, answers, images and videos.

show abstract

Leveraging Schema Labels to Enhance Dataset Search

Cited by 17 publications

References 14 publications

Table Search Using a Deep Contextualized Language Model

Table Search Using a Deep Contextualized Language Model

Modular framework for similarity-based dataset discovery using external knowledge

Neural ranking models for document retrieval

Contact Info

Product

Resources

About