Representation learning for very short texts using weighted word embedding aggregation

Boom, Cedric De; Canneyt, Steven Van; Demeester, Thomas; Dhoedt, Bart

doi:10.1016/j.patrec.2016.06.012

Cited by 150 publications

(93 citation statements)

References 19 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We use the average of the word embeddings of content words in the tweet. Average of word embeddings have been used for different NLP tasks (De Boom et al, 2016;Yoon et al, 2018;Orasan, 2018;Komatsu et al, 2015;Ettinger et al, 2018). As in past work, words that were not learned in the embeddings are dropped during the computation of the tweet vector.…”

Section: Word-based Representationsmentioning

confidence: 99%

A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics

Joshi

Karimi

Sparks

et al. 2019

Proceedings of the 18th BioNLP Workshop and Shared Task

View full text Add to dashboard Cite

Distributed representations of text can be used as features when training a statistical classifier. These representations may be created as a composition of word vectors or as contextbased sentence vectors. We compare the two kinds of representations (word versus context) for three classification problems: influenza infection classification, drug usage classification and personal health mention classification. For statistical classifiers trained for each of these problems, context-based representations based on ELMo, Universal Sentence Encoder, Neural-Net Language Model and FLAIR are better than Word2Vec, GloVe and the two adapted using the MESH ontology. There is an improvement of 2-4% in the accuracy when these context-based representations are used instead of word-based representations.

show abstract

Section: Word-based Representationsmentioning

confidence: 99%

A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics

Joshi

Karimi

Sparks

et al. 2019

Proceedings of the 18th BioNLP Workshop and Shared Task

View full text Add to dashboard Cite

show abstract

“…There are some advantages to this continuous space since the dimensionality is largely reduced and the words closer in meaning are close in this new continuous space. There have been introduced some applications of the word embedding based on neural networks including the word2vec [26], Dictionary of Affect in Language (DAL) [44], SentiWordNet [43], Glove [26] and Wikitionary [45].…”

Section: ) Word-embedding Descriptormentioning

confidence: 99%

An Effective Automatic Image Annotation Model Via Attention Model and Data Equilibrium

Vatani¹,

Ahvanooey²,

Rahimi³

2018

ijacsa

View full text Add to dashboard Cite

Nowadays, a huge number of images are available. However, retrieving a required image for an ordinary user is a challenging task in computer vision systems. During the past two decades, many types of research have been introduced to improve the performance of the automatic annotation of images, which are traditionally focused on content-based image retrieval. Although, recent research demonstrates that there is a semantic gap between content-based image retrieval and image semantics understandable by humans. As a result, existing research in this area has caused to bridge the semantic gap between low-level image features and high-level semantics. The conventional method of bridging the semantic gap is through the automatic image annotation (AIA) that extracts semantic features using machine learning techniques. In this paper, we propose a novel AIA model based on the deep learning feature extraction method. The proposed model has three phases, including a feature extractor, a tag generator, and an image annotator. First, the proposed model extracts automatically the high and low-level features based on dual tree continues wavelet transform (DT-CWT), singular value decomposition, distribution of color ton, and the deep neural network. Moreover, the tag generator balances the dictionary of the annotated keywords by a new logentropy auto-encoder (LEAE) and then describes these keywords by word embedding. Finally, the annotator works based on the long-short-term memory (LSTM) network in order to obtain the importance degree of specific features of the image. The experiments conducted on two benchmark datasets confirm that the superiority of proposed model compared to the previous models in terms of performance criteria. Keywords-Automatic image annotation; attention model; skewed learning; deep learning, word embedding; log-entropy auto encoderI.

show abstract

“…In Twitter, word embeddings are generally used for classification tasks which focuses on sentiment classification such as [60], [61] and also other classification tasks like [62], [63]. Among the research which uses word embeddings, [64] has the most similar approach to our work, as it uses a hybrid approach using tf-idf and word embeddings. [64] is evaluated with Wikipedia and Twitter data.…”

Section: Related Workmentioning

confidence: 99%

“…Among the research which uses word embeddings, [64] has the most similar approach to our work, as it uses a hybrid approach using tf-idf and word embeddings. [64] is evaluated with Wikipedia and Twitter data. It performs well on Wikipedia, however the error rate on Twitter is very high due to insufficient number of words in each tweet necessary for tf-idf.…”

Section: Related Workmentioning

confidence: 99%

Tweets on a tree: Index-based clustering of tweets

Erpam¹

2017

View full text Add to dashboard Cite

Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a significant amount of data which, if filtered and analyzed, can give researchers important insights about public opinion and culture.Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis.To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally difficult to calculate semantic similarity.In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters.Keywords: clustering, twitter, summarization, suffix tree, semantic relatedness, data mining $ This technical report is based on the thesis "Tweets on a tree: Index-based clustering of tweets".

show abstract

Representation learning for very short texts using weighted word embedding aggregation

Cited by 150 publications

References 19 publications

A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics

A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics

An Effective Automatic Image Annotation Model Via Attention Model and Data Equilibrium

Tweets on a tree: Index-based clustering of tweets

Contact Info

Product

Resources

About