2017 IEEE International Conference on Big Data (Big Data) 2017
DOI: 10.1109/bigdata.2017.8258564
|View full text |Cite
|
Sign up to set email alerts
|

Scalable spam classifier for web tables

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
2
1
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 2 publications
0
5
0
Order By: Relevance
“…Villasenor, S., et al, Machine-learning techniques for green and powerful Web-tables junk mail filtering that become examined on a huge scalable Web-tables aggregation of approx. 36 million tables to used (Villasenor, S., et al, 2017).…”
Section: Spam Filteringmentioning
confidence: 99%
“…Villasenor, S., et al, Machine-learning techniques for green and powerful Web-tables junk mail filtering that become examined on a huge scalable Web-tables aggregation of approx. 36 million tables to used (Villasenor, S., et al, 2017).…”
Section: Spam Filteringmentioning
confidence: 99%
“…We have used 100,000 dimensional feature space, i.e. 100K English terms in our vocabulary that we have selected by taking all terms from our datasets, sorting by frequency and cutting off the noise words and spam [33]. Increasing the dimensionality further led to significantly slower training time, which would prevent or make the experiments much more difficult (see the Section below for the configuration of our cluster).…”
Section: Feature Spacementioning
confidence: 99%
“…We use a large-scale Web tables dataset of ≈ 86 million Web table tuples (Approximately 55 million non-spam tuples) from [28,55]. These instances came from Web tables pulled from sources such as online forums, social media sites, product offers, and others.…”
Section: Datasetmentioning
confidence: 99%
“…Similar to E-mail or Web pages, tables extracted from the Web also have spam (examples include empty tables, HTML formatting, junk advertisements, etc) and require cleaning before ingestion. We trained our own J48 web table spam classifier [43,55] to filter out tables with these characteristics. Using 10-fold cross-validation, a Figure 1: Hybrid.AI Architecture technique for estimating model performance [43] (in this case, the performance of our classifier), we observed 72.6% precision and 70.6% recall.…”
Section: Ingestionmentioning
confidence: 99%
See 1 more Smart Citation