Fast logistic regression for text categorization with variable-length n-grams

Ifrim, Georgiana; BakIr, Gökhan; Weikum, Gerhard

doi:10.1145/1401890.1401936

Cited by 70 publications

(50 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work we study a sequence classification method, the Sequence Learner (SEQL), introduced in [10], [11]. Due to its greedy optimization approach, SEQL can quickly capture the distinct patterns of sequence data in very high-dimensional spaces.…”

Section: Related Workmentioning

confidence: 99%

“…Sequence Learner SEQL learns discriminative subsequences from training data by exploiting the all-subsequence space using a coordinate gradient descent approach [10], [11]. The key idea is to exploit the structure of the subsequence space in order to efficiently optimize a classification loss function, such as the binomial log-likelihood loss of Logistic Regression or squared hinge loss of Support Vector Machines.…”

Section: Classification With Sequence Learnermentioning

confidence: 99%

“…The work in [10], [11] introduces a branch-and-bound strategy which simplifies the learning problem by using greedy coordinate descent with an efficient selection of the most discriminative subsequences. The main idea relies on bounding the gradient of any subsequence based on its prefix, so that large parts of the search (i.e., feature) space do not need to be explored.…”

Section: Classification With Sequence Learnermentioning

confidence: 99%

“…We propose a new approach for structure-based time series classification based on an efficient sequence classifier, SEQL, designed to work on very long discrete sequences with large alphabets [10], [11]. SEQL learns the best discriminative subsequences 1 from training data by exploiting the all-subsequence space using greedy gradient descent.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Time Series Classification by Sequence Learning in All-Subsequence Space

Nguyen¹,

Gsponer²,

Ifrim³

2017

2017 IEEE 33rd International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

Abstract-Existing approaches to time series classification can be grouped into shape-based (numeric) and structure-based (symbolic). Shape-based techniques use the raw numeric time series with Euclidean or Dynamic Time Warping distance and a 1-Nearest Neighbor classifier. They are accurate, but computationally intensive. Structure-based methods discretize the raw data into symbolic representations, then extract features for classifiers. Recent symbolic methods have outperformed numeric ones regarding both accuracy and efficiency. Most approaches employ a bag-of-symbolic-words representation, but typically the word-length is fixed across all time series, an issue identified as a major weakness in the literature. Also, there are no prior attempts to use efficient sequence learning techniques to go beyond single words, to features based on variable-length sequences of words or symbols. We study an efficient linear classification approach, SEQL, originally designed for classification of symbolic sequences. SEQL learns discriminative subsequences from training data by exploiting the all-subsequence space using greedy gradient descent. We explore different discretization approaches, from none at all to increasing smoothing of the original data, and study the effect of these transformations on the accuracy of SEQL classifiers. We propose two adaptations of SEQL for time series data, SAX-VSEQL, can deal with X-axis offsets by learning variable-length symbolic words, and SAX-VFSEQL, can deal with X-axis and Y-axis offsets, by learning fuzzy variable-length symbolic words. Our models are linear classifiers in rich feature spaces. Their predictions are based on the most discriminative subsequences learned during training, and can be investigated for interpreting the classification decision.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Classification With Sequence Learnermentioning

confidence: 99%

Section: Classification With Sequence Learnermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Time Series Classification by Sequence Learning in All-Subsequence Space

Nguyen¹,

Gsponer²,

Ifrim³

2017

2017 IEEE 33rd International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Studies in [76], [77] designed linear classifiers to train explicit mappings of sequence data, where features correspond to subsequences. Using the relation between subsequences, they are able to design efficient training methods for very high dimensional mappings.…”

Section: A Training and Testing Explicit Data Mappings Via Linear CLmentioning

confidence: 99%

Recent Advances of Large-Scale Linear Classification

2012

View full text Add to dashboard Cite

Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some large-scale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.

show abstract

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

Miratrix¹,

Ackerman

2016

Statistical Analysis

View full text Add to dashboard Cite

We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an OSHA database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death) and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency based methods currently in wide use, and more heavyweight, model-intensive methods such as Latent Dirichlet Allocation (LDA). For a particular topic of interest (e.g., mental health disability, or carbon monoxide exposure), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can be extended to allow for phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of tuning parameters and model constraints. We evaluate this tool by comparing computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, textreg. Overall, we argue that sparse methods have much to offer text analysis, and is a branch of research that should be considered further in this context.

show abstract

Fast logistic regression for text categorization with variable-length n-grams

Cited by 70 publications

References 35 publications

Time Series Classification by Sequence Learning in All-Subsequence Space

Time Series Classification by Sequence Learning in All-Subsequence Space

Recent Advances of Large-Scale Linear Classification

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

Contact Info

Product

Resources

About