Scalable spam classifier for web tables

Villasenor, Santiago; Nguyen, Tom; Kola, Anusha; Soderman, Sean; Gubanov, Michael

doi:10.1109/bigdata.2017.8258564

Cited by 4 publications

(5 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Villasenor, S., et al, Machine-learning techniques for green and powerful Web-tables junk mail filtering that become examined on a huge scalable Web-tables aggregation of approx. 36 million tables to used (Villasenor, S., et al, 2017).…”

Section: Spam Filteringmentioning

confidence: 99%

Analysis of Cloud and Self-Web-Hosting Services Based on Security Parameters

Khare

Badholia

2022

International Journal of Information System Modeling and Design

View full text Add to dashboard Cite

Most of the web applications are hosting by cloud due to low cost and low infrastructure setup is required. In continue to this, user no need to maintain the infrastructure by their own. In this paper, both the technology, self-web-hosting, and cloud-web-hosting are compared based on different security parameters like key generation (PKI), automatic authentication and protection of intra-tenant networks, secure logging system events, spam filtering, CAPTCHA generation and authentication, and software widgets such as password metering. These parameters are classified into seven categories and the review has been conducted based on these categories. The bibliometric analysis has been conducted where ever more than 70 research found in Web-of-Science (WoS) database through bibliometric library and biblioshine package in R programming. The outcome of this analysis is presented in tabular form, and open challenges of both the technologies are discussed in details with proposed solutions.

show abstract

Section: Spam Filteringmentioning

confidence: 99%

Analysis of Cloud and Self-Web-Hosting Services Based on Security Parameters

Khare

Badholia

2022

International Journal of Information System Modeling and Design

View full text Add to dashboard Cite

show abstract

“…We have used 100,000 dimensional feature space, i.e. 100K English terms in our vocabulary that we have selected by taking all terms from our datasets, sorting by frequency and cutting off the noise words and spam [33]. Increasing the dimensionality further led to significantly slower training time, which would prevent or make the experiments much more difficult (see the Section below for the configuration of our cluster).…”

Section: Feature Spacementioning

confidence: 99%

Hybrid Metadata Classification in Large-scale Structured Datasets

Pavia¹,

Piraino²,

Islam³

et al. 2022

JDI

Self Cite

View full text Add to dashboard Cite

Metadata location and classification is an important problem for large-scale structured datasets. For example, Web tables \cite{wt_corpus} have hundreds of millions of tables, but often have missing or incorrect labels for rows (or columns) with attribute names. Such errors \cite{wtitles} significantly complicate all data management tasks such as {\em query processing, data integration, indexing}, etc. Different sources or authors position metadata rows/columns differently inside a table, which makes its reliable identification challenging.In this work we describe our scalable, hybrid two-layer Deep- and Machine-learning based ensemble, combining Long Short Term Memory (LSTM) and Naive Bayes Classifier to accurately identify Metadata-containing rows or columns in a table. We have performed an extensive evaluation on several datasets, including an ultra large-scale dataset containing more than 15 million tables coming from more than 26 thousands of sources to justify scalability and resistance to variety, stemming from a large number of sources. We observed superiority of this two-layer ensemble, compared to the recent previous approaches and report an impressive 95.73\text{\%} accuracy at scale with our ensemble model using regular LSTM.

show abstract

“…We use a large-scale Web tables dataset of ≈ 86 million Web table tuples (Approximately 55 million non-spam tuples) from [28,55]. These instances came from Web tables pulled from sources such as online forums, social media sites, product offers, and others.…”

Section: Datasetmentioning

confidence: 99%

“…Similar to E-mail or Web pages, tables extracted from the Web also have spam (examples include empty tables, HTML formatting, junk advertisements, etc) and require cleaning before ingestion. We trained our own J48 web table spam classifier [43,55] to filter out tables with these characteristics. Using 10-fold cross-validation, a Figure 1: Hybrid.AI Architecture technique for estimating model performance [43] (in this case, the performance of our classifier), we observed 72.6% precision and 70.6% recall.…”

Section: Ingestionmentioning

confidence: 99%

“…ti − idi (Term Intersection -Inverse Document Intersection): Our ranking function for large-scale structured datasets accounts for high redundancy of search keywords in arbitrary database rows, a phenomena we observed for such datasets. This could happen even when the row in question is not spam [55]. An example of this would be a row containing a large amount of information concerning nobility, the word "lady" could appear multiple times.…”

Section: Baseline Rankingmentioning

confidence: 99%

See 1 more Smart Citation

Hybrid.AI

Soderman

Kola

Podkorytov

et al. 2018

Companion of the the Web Conference 2018 on the Web Conference 2018 - WWW '18

Self Cite

View full text Add to dashboard Cite

Variety of Big data [17, 40, 44, 47, 52] is a significant impediment for anyone who wants to search inside a large-scale structured dataset. For example, there are millions of tables available on the Web, but the most relevant search result does not necessarily match the keyword-query exactly due to a variety of ways to represent the same information. Here we describe Hybrid.AI, a learning search engine for largescale structured data that uses automatically generated machine learning classifiers and Unified Famous Objects (UFOs) [33] to return the most relevant search results from a large-scale Web tables corpora. We evaluate it over this corpora, collecting 99 queries and their results from users, and observe significant relevance gain.

show abstract

Scalable spam classifier for web tables

Cited by 4 publications

References 2 publications

Analysis of Cloud and Self-Web-Hosting Services Based on Security Parameters

Analysis of Cloud and Self-Web-Hosting Services Based on Security Parameters

Hybrid Metadata Classification in Large-scale Structured Datasets

Hybrid.AI

Contact Info

Product

Resources

About