Tibetan text classification using distributed representations of words

Jiang, Tao; Yu, Hongzhi; Zhang, Bing

doi:10.1109/ialp.2015.7451547

Cited by 8 publications

(5 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Early analysis of shallow word embedding models, showed that word vectors providing stronger semantic representation have an higher norm [34]. Moreover, when comparing the norm of the vectors with their term frequency within the training corpus, it is possible to notice that highly frequent terms, as well as rare one have considerably smaller norm.…”

Section: Vector Significancementioning

confidence: 99%

Static Fuzzy Bag-of-Words: Exploring Static Universe Matrices for Sentence Embeddings

Muffo¹,

Tedesco²,

Sbattella³

et al. 2022

Signals and Communication Technology

View full text Add to dashboard Cite

Vector semantics has slightly become a key tool for Natural Language Processing, especially concerning text analysis. This kind of vector representation is usually encoded through embeddings that can be used to encode semantic information at different levels of granularity. In fact, through the years, not only models for word embeddings have been developed, but also for sentence and documents. With this work we address sentence embeddings, in particular the non-parametric ones, which offer a good trade off between performance and inference speed. We present Static Fuzzy Bag-of-Word (SFBoW) model, a refinement of the Fuzzy Bag-of-Words approach yielding fixed-dimension sentence embeddings. We targeted fixed size embeddings to promote caching a re-usability, speeding the inference of a system that relies on our model. In this paper we explore various approaches for the construction of a static universe matrix, fundamental to make the sentence embeddings of fixed size. To show the validity of our approach, we benchmarked our model on a semantic similarity task, obtaining competitive performances.

show abstract

Section: Vector Significancementioning

confidence: 99%

Static Fuzzy Bag-of-Words: Exploring Static Universe Matrices for Sentence Embeddings

Muffo¹,

Tedesco²,

Sbattella³

et al. 2022

Signals and Communication Technology

View full text Add to dashboard Cite

show abstract

“…e task of sensitive words detection has attracted a lot of attention, due to the prevalence of online users' generated content (UGC). e majority of detection algorithms are based on the concept of sensitive word tree (SMT), which represents one sensitive word by a node path from the root to a certain leaf node [5][6][7]. Note that common prefix characters from different sensitive words will usually occupy same nodes in the sensitive word tree.…”

Section: Sensitive Words Detectionmentioning

confidence: 99%

“…e sensitive word detection is a particular problem for content monitoring, which refers to the procedure of identifying target words from the given documents. e majority of existing detection algorithms are based on the concept of sensitive word tree (SMT) [5][6][7]. As a tree structure, the SMT is a variant of the hash tree.…”

Section: Introductionmentioning

confidence: 99%

A Graph Convolutional Network‐Based Sensitive Information Detection Algorithm

Liu

Yang

2021

Complexity

View full text Add to dashboard Cite

In the field of natural language processing (NLP), the task of sensitive information detection refers to the procedure of identifying sensitive words for given documents. The majority of existing detection methods are based on the sensitive-word tree, which is usually constructed via the common prefixes of different sensitive words from the given corpus. Yet, these traditional methods suffer from a couple of drawbacks, such as poor generalization and low efficiency. For improvement purposes, this paper proposes a novel self-attention-based detection algorithm using the implementation of graph convolutional network (GCN). The main contribution is twofold. Firstly, we consider a weighted GCN to better encode word pairs from the given documents and corpus. Secondly, a simple, yet effective, attention mechanism is introduced to further integrate the interaction among candidate words and corpus. Experimental results from the benchmarking dataset of THUC news demonstrate a promising detection performance, compared to existing work.

show abstract

“…In recent years, the classification of Tibetan texts has received more and more attention. Jiang tao [5]used the distributed representation of Tibetan words as a feature to significantly improve the performance of Tibetan text classification. Cao Hui [6] proposed an improved TF-IDF weighting algorithm.…”

Section: Related Workmentioning

confidence: 99%

Study of Tibetan Text Classification based on fastText

Ma¹,

Yu²,

Ma³

2019

Proceedings of the 3rd International Conference on Computer Engineering, Information Science &Amp; Application Technology (ICCI

Self Cite

View full text Add to dashboard Cite

Tibetan text classification is an important research topic in Tibetan information processing. In this paper, we attempt to apply fastText text classification tool and fastText pre-training word vectors for Tibetan text classification. In the experiment, For the Tibetan language corpus segmented by Tibetan syllable points, we represent all the words in each document with the fastText pre-training word vectors, and then average all the word vectors in this data. The average vector (docvec) represent each piece of document, we put it into SVM classifier, and the results show that the model outperforms competitive the traditional Tibetan text classification method, and the F-measure has improved by 10%.

show abstract

Tibetan text classification using distributed representations of words

Cited by 8 publications

References 7 publications

Static Fuzzy Bag-of-Words: Exploring Static Universe Matrices for Sentence Embeddings

Static Fuzzy Bag-of-Words: Exploring Static Universe Matrices for Sentence Embeddings

A Graph Convolutional Network‐Based Sensitive Information Detection Algorithm

Study of Tibetan Text Classification based on fastText

Contact Info

Product

Resources

About