“…The applied anomaly detection techniques based on one-class SVM and k-medoids cluster dissimilarity are probably for the first time combined with the GloVe representation, applied to text data, and compared to their counterparts using the bag of words representation. Prior work using word2vec for anomaly detection (Bertero et al 2017;Pande and Ahuja 2017;Bakarov et al 2018) did not combine it with clustering-based detection methods and did not include comparisons with bag of words. 3.…”
Section: Noveltymentioning
confidence: 99%
“…This work revisits the less popular but more easily and widely applicable idea of applying general-purpose algorithms to text transformed to a vector representation (Manevitz and Yousef 2002). There is already some evidence that recent developments in word embeddings make this path more useful (Bertero et al 2017;Pande and Ahuja 2017;Bakarov et al 2018).…”
Section: Unsupervised Anomaly Detectionmentioning
confidence: 99%
“…This produces word vectors (vector representations of words), but a related doc2vec algorithm can be used to obtain document vectors as well (Le and Mikolov 2014). There are some recent encouraging results with anomaly detection using the word2vec representation (Bertero et al 2017;Pande and Ahuja 2017;Bakarov et al 2018). A related embedding-based data representation was also proposed for anomaly detection in a series of categorical events (Chen et al 2016).…”
Section: Global Vectorsmentioning
confidence: 99%
“…Besides the most common bag of words text representation, a recent more refined Global Vectors (GloVe) representation (Pennington, Socher, and Manning 2014) based on word embeddings is employed which makes it easy to control the dimensionality and apply arbitrary general-purpose classification and clustering algorithms. While text representations based on word embeddings are becoming popular (Goldberg and Levy 2014; Lau and Baldwin 2016; Bakarov 2018), there have been only few demonstrations of their utility for anomaly detection (Bertero et al 2017; Pande and Ahuja 2017; Bakarov et al 2018), using word2vec (Mikolov et al 2013a) rather than GloVe embeddings.The applied anomaly detection techniques based on one-class SVM and k -medoids cluster dissimilarity are probably for the first time combined with the GloVe representation, applied to text data, and compared to their counterparts using the bag of words representation. Prior work using word2vec for anomaly detection (Bertero et al 2017; Pande and Ahuja 2017; Bakarov et al 2018) did not combine it with clustering-based detection methods and did not include comparisons with bag of words.Unlike in most prior work on anomaly detection using word embeddings (Bertero et al 2017; Bakarov et al 2018), large datasets are used (tens of thousands rather than hundreds) to better match the scale of realistic applications.Unlike in most prior work on clustering-based anomaly detection (He et al 2003; Al-Zoubi 2009; Gao 2009; Amer and Goldstein 2012), the cluster dissimilarity approach is applied as a modeling algorithm, that is, with an anomaly detection model created on the training set and applicable to new data.New cluster dissimilarity-based anomaly score definitions are proposed that may be promising alternatives to those known from the literature (He et al 2003; Amer and Goldstein 2012).…”
Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
“…The applied anomaly detection techniques based on one-class SVM and k-medoids cluster dissimilarity are probably for the first time combined with the GloVe representation, applied to text data, and compared to their counterparts using the bag of words representation. Prior work using word2vec for anomaly detection (Bertero et al 2017;Pande and Ahuja 2017;Bakarov et al 2018) did not combine it with clustering-based detection methods and did not include comparisons with bag of words. 3.…”
Section: Noveltymentioning
confidence: 99%
“…This work revisits the less popular but more easily and widely applicable idea of applying general-purpose algorithms to text transformed to a vector representation (Manevitz and Yousef 2002). There is already some evidence that recent developments in word embeddings make this path more useful (Bertero et al 2017;Pande and Ahuja 2017;Bakarov et al 2018).…”
Section: Unsupervised Anomaly Detectionmentioning
confidence: 99%
“…This produces word vectors (vector representations of words), but a related doc2vec algorithm can be used to obtain document vectors as well (Le and Mikolov 2014). There are some recent encouraging results with anomaly detection using the word2vec representation (Bertero et al 2017;Pande and Ahuja 2017;Bakarov et al 2018). A related embedding-based data representation was also proposed for anomaly detection in a series of categorical events (Chen et al 2016).…”
Section: Global Vectorsmentioning
confidence: 99%
“…Besides the most common bag of words text representation, a recent more refined Global Vectors (GloVe) representation (Pennington, Socher, and Manning 2014) based on word embeddings is employed which makes it easy to control the dimensionality and apply arbitrary general-purpose classification and clustering algorithms. While text representations based on word embeddings are becoming popular (Goldberg and Levy 2014; Lau and Baldwin 2016; Bakarov 2018), there have been only few demonstrations of their utility for anomaly detection (Bertero et al 2017; Pande and Ahuja 2017; Bakarov et al 2018), using word2vec (Mikolov et al 2013a) rather than GloVe embeddings.The applied anomaly detection techniques based on one-class SVM and k -medoids cluster dissimilarity are probably for the first time combined with the GloVe representation, applied to text data, and compared to their counterparts using the bag of words representation. Prior work using word2vec for anomaly detection (Bertero et al 2017; Pande and Ahuja 2017; Bakarov et al 2018) did not combine it with clustering-based detection methods and did not include comparisons with bag of words.Unlike in most prior work on anomaly detection using word embeddings (Bertero et al 2017; Bakarov et al 2018), large datasets are used (tens of thousands rather than hundreds) to better match the scale of realistic applications.Unlike in most prior work on clustering-based anomaly detection (He et al 2003; Al-Zoubi 2009; Gao 2009; Amer and Goldstein 2012), the cluster dissimilarity approach is applied as a modeling algorithm, that is, with an anomaly detection model created on the training set and applicable to new data.New cluster dissimilarity-based anomaly score definitions are proposed that may be promising alternatives to those known from the literature (He et al 2003; Amer and Goldstein 2012).…”
Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.
“…3. Перед применением алгоритма классификации временные окна каждого журнала проходят процесс приведения к нормальному виду, построение векторного представления [16] и присвоения каждому слову весовых коэффициентов TF-IDF [17]. 4.…”
Section: использование журналов спо схд для диагностики неисправностейunclassified
This paper describes application of diagnostic model, created with ontological modelling methods and machine learning text classifi cation algorithms, for fault detection, based on system log messages data, in enterprise-level storage system. Proposed fault detection model uses external procedures for the description ofthe relations between parameters and states of storage systems, based on the implementation of the machine learning algorithms. As an example of such relation, author describes application of the text classifi cation method for the task of software log analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.