Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

Yan, Xing; Li, Yumei; Fan, Weiguo

doi:10.1108/idd-04-2017-0043

Cited by 2 publications

(3 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, there has been evidence that the interaction between sentiment expressed by social and conventional media has strong effect on market variables prediction (Yu et al, 2013;Li et al, 2016;Agarwal et al, 2019). Therefore, it is crucial to identify high-quality relevant content and conduct SA on integrated data, extract collective market sentiment and understand their joint influences (Kearney and Liu, 2014;Yan et al, 2017;Li et al, 2018).…”

Section: Stock Market Lexiconsmentioning

confidence: 99%

Inducing stock market lexicons from disparate Chinese texts

Zhao

Yao

Luan

et al. 2019

IMDS

View full text Add to dashboard Cite

Purpose The purpose of this paper is to propose a methodology to construct a stock market sentiment lexicon by incorporating domain-specific knowledge extracted from diverse Chinese media outlets. Design/methodology/approach This paper presents a novel method to automatically generate financial lexicons using a unique data set that comprises news articles, analyst reports and social media. Specifically, a novel method based on keyword extraction is used to build a high-quality seed lexicon and an ensemble mechanism is developed to integrate the knowledge derived from distinct language sources. Meanwhile, two different methods, Pointwise Mutual Information and Word2vec, are applied to capture word associations. Finally, an evaluation procedure is performed to validate the effectiveness of the method compared with four traditional lexicons. Findings The experimental results from the three real-world testing data sets show that the ensemble lexicons can significantly improve sentiment classification performance compared with the four baseline lexicons, suggesting the usefulness of leveraging knowledge derived from diverse media in domain-specific lexicon generation and corresponding sentiment analysis tasks. Originality/value This work appears to be the first to construct financial sentiment lexicons from over 2m posts and headlines collected from more than one language source. Furthermore, the authors believe that the data set established in this study is one of the largest corpora used for Chinese stock market lexicon acquisition. This work is valuable to extract collective sentiment from multiple media sources and provide decision-making support for stock market participants.

show abstract

Section: Stock Market Lexiconsmentioning

confidence: 99%

Inducing stock market lexicons from disparate Chinese texts

Zhao

Yao

Luan

et al. 2019

IMDS

View full text Add to dashboard Cite

show abstract

“…[1] were first to propose that method with a training dataset of 27356 English SMS phrases. His research was the base of several similar work in Portuguese [8], Turkish [9] and Chinese [10], but never in Arabic nor French. In addition, none of these work is open source, and they didn't share the word embedding models, nor the lexicons or dictionaries.…”

Section: Related Workmentioning

confidence: 99%

“…-For the English language, the 3429 English words from Oxford dictionary 9 which English stop-words are included, the 500 most frequently used words on twitter 10 and a list of 90 frequent sentiment words 11 in tweets [13] were combined. A list of 3501 English words is the result of the previous lists combination, after removing duplications.…”

Section: Lists Of Standard-form Seed-wordsmentioning

confidence: 99%

Unsupervised Creation of Normalization Dictionaries for Micro-Blogs in Arabic, French and English

Htait¹,

Fournier²,

Bellot³

2018

CyS

View full text Add to dashboard Cite

Text normalization is a necessity to correct and make more sense of the micro-blogs messages, for information retrieval purposes. Unfortunately, tools and resources of text normalization are rarely shared. In this paper, an approach is presented based on an unsupervised method for text normalization using distributed representations of words, known also as "word embedding", applied on Arabic, French and English Languages. In addition, a tool will be supplied to create dictionaries for micro-blogs normalization, in a form of pairs of misspelled word with its standard-form word, in the languages: Arabic, French and English. The tool will be available as open source 1 including the resources: word embedding's models (with vocabulary size of 9 million words for Arabic language model, 5 million words for English language model and 683 thousand words for French language model), and also three normalization dictionaries of 10 thousand pairs in Arabic language, 3 thousand pairs in French language and 18 thousand pairs in English language. The evaluation of the tool shows an average in Normalization success of 96% for English language, 89.5% for Arabic Language and 85% for French Language. Also, the results of using an English language normalization dictionary with a sentiment analysis tool for micro-blog's messages, show an increase in f-measure from 58.15 to 59.56.

show abstract

Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

Cited by 2 publications

References 44 publications

Inducing stock market lexicons from disparate Chinese texts

Inducing stock market lexicons from disparate Chinese texts

Unsupervised Creation of Normalization Dictionaries for Micro-Blogs in Arabic, French and English

Contact Info

Product

Resources

About