2020
DOI: 10.1007/s10618-020-00680-1
|View full text |Cite
|
Sign up to set email alerts
|

ptype: probabilistic type inference

Abstract: Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms the existing methods.

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(20 citation statements)
references
References 27 publications
0
20
0
Order By: Relevance
“…Background: In this work, we extend the probabilistic type inference method called ptype [3]. Assuming that the data entries are read as strings, ptype allows us to infer a plausible column type (Boolean, date, float, integer or string) for a data column, and, conditioned on that type, identify any values which are deemed missing or anomalous.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Background: In this work, we extend the probabilistic type inference method called ptype [3]. Assuming that the data entries are read as strings, ptype allows us to infer a plausible column type (Boolean, date, float, integer or string) for a data column, and, conditioned on that type, identify any values which are deemed missing or anomalous.…”
Section: Methodsmentioning
confidence: 99%
“…For example, in a data table about clothing, a variable "Class Name" could be a categorical variable taking on values such as Jackets, Dresses and Pants, while a variable "Rating" may take on values in a fixed range 1 through 5. 3 To the best of our knowledge, these issues are not addressed by any existing work in the literature, except Bot (proposed by Majoor and Vanschoren [1]), OpenML and Weka which tackle the type inference based on heuristics such as labeling a column as categorical when the number of unique values is lower than a threshold (see Sec. 3 for a detailed discussion).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Automatically discovering the statistical and semantic types of data in tables is a valuable tool in data preparation and information retrieval. Accordingly, methods have been presented that predict the type of a column [3,4]. These methods expect the values in a column to have the same type.…”
Section: Single Column Type Detectionmentioning
confidence: 99%
“…Table segmentation is related to statistical and semantic type detection, where the goal is to find the data type of a set of values. Unlike our unsupervised segmentation approach, type detection generally works in a predictive setting, where the goal is to classify the statistical type of columns or to annotate them with semantic types [3,4]. As data is assumed to be grouped in sets of values that share a distinctive type, table segmentation can serve as a preprocessing step.…”
Section: Related Workmentioning
confidence: 99%