Online Similarity Learning for Big Data with Overfitting

Cong, Yang; Liu, Ji; Fan, Baojie; Zeng, Peng; Yu, Haibin; Luo, Jiebo

doi:10.1109/tbdata.2017.2688360

Cited by 23 publications

(6 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We will investigate the effect of additional data context types on the wrangling pipeline, and on other wrangling stages such as Web data extraction [49]. To further address time-varying variety and veracity problems in data wrangling, we will investigate feedback-based learning and model refinement techniques such as presented in [42] or [50]. Furthermore, we are exploring how to combine evidence gained from data context with user preferences, as shown in [23], to elaborate the possibilities in tailoring a data product for users with different requirements.…”

Section: Discussionmentioning

confidence: 99%

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Koehler

Abel

Bogatu

et al. 2021

IEEE Trans. Big Data

View full text Add to dashboard Cite

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: (i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; (ii) define a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling; (iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevance of the result; and (iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.

show abstract

Section: Discussionmentioning

confidence: 99%

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Koehler

Abel

Bogatu

et al. 2021

IEEE Trans. Big Data

View full text Add to dashboard Cite

show abstract

“…Feature selection is a process of deleting unrelated or redundant and preserving important features so that features that remain can describe more accurate models. Redundant and unrelated features are noise in machine learning models because they will waste the computational performance of the model and make the model overfit to lose accuracy [11]. There are three feature selection methods: filter, package, and embedded selection.…”

Section: Feature Selectionmentioning

confidence: 99%

The House Price Prediction Using Machine Learning Algorithm: The Case of Jinan, China

Zou¹

2023

HSET

View full text Add to dashboard Cite

House prices increase substantially in China from 1998. Because of expensive house prices, most Chinese people have only one chance to select suitable houses. Therefore, building a house price prediction model based on housing conditions is significant for customers to make decisions. This paper collects the estate market data of Jinan city from the HomeLink website and performs several feature selection algorithms to get critical features for house price prediction. The paper compares the classical machine learning methods for the problem, including Multiple Linear Regression, Random Forest, and Catboost. After cross-validation tests, the CatBoost, algorithm with the lowest Mean Square Error (MSE) is regarded as the most accurate algorithm to predict house prices. The analytic results show that the house price is dominated by the location features such as area and block.

show abstract

“…al. [1] describes a model for big data to control the overfitting problem which where comes under online similarity learning. The model provides simple and robust metric matrix for finding redundant rows and columns in the metric matrix.…”

Section: Literature Surveymentioning

confidence: 99%

“…In the recent research environment, big data [1] is playing a major role to maintain high volumes of data. Many sectors are implemented a big data and analytics like Agricultural, Banking and Online Marketing, etc.…”

Section: Introductionmentioning

confidence: 99%

Similarity Based Prediction System using Machine Learning Algorithms in Big Data Analytics

Vanitha*¹,

Geetha²,

Ramaraj³

2019

IJITEE

View full text Add to dashboard Cite

Big Data is a noteworthy environment to maintain the diversity of the huge amount of data. The big data utilizes machine learning algorithms to process large datasets which comes from various places such as histories, weblogs, and data repositories, large datasets and data warehousing, etc. In an existing method, most of the data mining approaches might not be able to maintain the large dataset. Using datamining, the big data are having lack of compatibility with database systems and analysis tools; large dataset clustering and analyzing is a big issue in big data. For this reason, the research work uses machine learning algorithms which are implemented in the Hadoop tool to collect and process the large amount of data which is structured, semi-structured or unstructured in a reasonable amount of time. Also, it gives more accurate prediction system and accurate information. Using Machine Learning Algorithm computational cost and complexities is minimized. The overall research work is implemented in the Hadoop tool with the help of the python programming language and it is compared with some existing algorithms. The proposed work tested with suitable parameters such as accuracy, Kappa T and Kappa M.

show abstract

Online Similarity Learning for Big Data with Overfitting

Cited by 23 publications

References 30 publications

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

The House Price Prediction Using Machine Learning Algorithm: The Case of Jinan, China

Similarity Based Prediction System using Machine Learning Algorithms in Big Data Analytics

Contact Info

Product

Resources

About