2016
DOI: 10.1007/978-3-319-31750-2_34
|View full text |Cite
|
Sign up to set email alerts
|

Enabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…• Long texts: We use two large datasets of long texts for evaluation: PubMed dataset consists of 330,000 articles from the PubMed central and New York Times dataset consists of 300,000 news 2 . • Short texts: We use three large datasets of short texts for evaluation: Yahoo Questions crawled from answers.yahoo.com, each document is a question; Tweets from Twitter (twitter.com), each document is the text content of a tweet; NYT-Titles from The New York Times (www.nytimes.com), each document is the title of an article [58]. These datasets are preprocessed by tokenizing, stemming, removing stop-words, removing low-frequency words (appear in less than 3 documents) and removing extremely short documents (less than 3 words).…”
Section: B Empirical Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…• Long texts: We use two large datasets of long texts for evaluation: PubMed dataset consists of 330,000 articles from the PubMed central and New York Times dataset consists of 300,000 news 2 . • Short texts: We use three large datasets of short texts for evaluation: Yahoo Questions crawled from answers.yahoo.com, each document is a question; Tweets from Twitter (twitter.com), each document is the text content of a tweet; NYT-Titles from The New York Times (www.nytimes.com), each document is the title of an article [58]. These datasets are preprocessed by tokenizing, stemming, removing stop-words, removing low-frequency words (appear in less than 3 documents) and removing extremely short documents (less than 3 words).…”
Section: B Empirical Evaluationmentioning
confidence: 99%
“…These datasets are preprocessed by tokenizing, stemming, removing stop-words, removing low-frequency words (appear in less than 3 documents) and removing extremely short documents (less than 3 words). The shortness of texts poses various difficulties [58]- [60] because of its natural characters such as sparseness, largescale, immediacy, non-standardization [57]. It is difficult for traditional methods to deal with short texts mainly because too limited words in the short text cannot represent the feature space and the relationship between words and documents.…”
Section: B Empirical Evaluationmentioning
confidence: 99%
“…Simultaneously, it should be plastic when concept drift happens. Secondly, noisy and sparse data (Nguyen et al, 2021;Ha et al, 2019;Mai et al, 2016;Tuan et al, 2020) makes a lot of difficulties for learning methods. While sparse data does not provide an unclear context, noisy data can mislead the methods.…”
Section: Introductionmentioning
confidence: 99%