2014
DOI: 10.1186/1687-4722-2014-14
|View full text |Cite
|
Sign up to set email alerts
|

Classification of heterogeneous text data for robust domain-specific language modeling

Abstract: The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 16 publications
0
8
0
Order By: Relevance
“…When the number of documents increased, the computational complexity also increased (Stas, Juhar, & Hladek, 2014). ML is often seen as an offshoot of statistics as far as data mining is concerned.…”
Section: Literature Reviewmentioning
confidence: 99%
“…When the number of documents increased, the computational complexity also increased (Stas, Juhar, & Hladek, 2014). ML is often seen as an offshoot of statistics as far as data mining is concerned.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Text classification is a well-studied area in Natural Language Processing, yet it still is a very demanding research subject [50][51][52]. Most of the text classification methods concentrate on the context classification.…”
Section: Methodsmentioning
confidence: 99%
“…To calculate the numeric value of the features in sentence S k , Eqs. (14) and (15) are introduced, where NwS k is the number of words in S k . (9) In Fig.…”
Section: Capturing Domain Sensitive Features (Dsf)mentioning
confidence: 99%
“…Meanwhile, ML-based techniques rely on ML algorithms and see SA as a regular text classification task. Text classification task assigns a piece of text data into several predefined classes involving ML algorithms [15]. In terms of SA task, ML-based techniques classify text document into one out of three classes namely positive class, neutral class, and negative class.…”
mentioning
confidence: 99%