PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function &amp; POS information

Yaghoub-Zadeh-Fard, Mohammad-Ali; Minaei-Bidgoli, Behrouz; Rahmani, Saeed; Shahrivari, Saeed

doi:10.1109/kbei.2015.7436031

Cited by 9 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ref. [15] exploits Part-of-Speech information. These approaches cannot be directly compared to our proposal, in which we purposely start from plain text and avoid any kind of aid or pre-processing.…”

Section: Related Workmentioning

confidence: 99%

“…As witnessed by the most recent survey paper available on Stopword Removal, Reference [4], the literature after Reference [20] mainly focused on specific and peculiar languages, especially those using non-Latin script. A list of such works (often published in National conferences or journals) includes Arabic [24][25][26], Chinese [27,28], Persian [15], Sanskrit [29], Gujarati [30], Punjabi [31], Hindi [32][33][34], Bengali [35], Sinhala [36], and Tamil [37]. Here, we aim at devising an approach that can be applied to different languages; thus, we will not discuss these works in the following, nor can we compare our proposal to these works, which use very tailored approaches.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Automatic Multilingual Stopwords Identification from Very Small Corpora

Ferilli

2021

Electronics

View full text Add to dashboard Cite

Tools for Natural Language Processing work using linguistic resources, that are languagespecific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on stopwords, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Automatic Multilingual Stopwords Identification from Very Small Corpora

Ferilli

2021

Electronics

View full text Add to dashboard Cite

show abstract

“…In a more recent study, [17], linguistic and syntactic information are aggregated to build stop-word list in Persian information retrieval systems. In [17], part of speech (POS) tags are employed together with statistical measures such as entropy and the method is assessed by precision. The precision values reported are in range [0.25 0.3] for the whole set of different POS tags.…”

Section: Related Workmentioning

confidence: 99%

Stop Word Detection as a Binary Classification Problem

Metin¹,

Karaoğlan

2017

ANADOLU UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY a - Applied Sciences and Engineering

View full text Add to dashboard Cite

In a wide group of languages, the stop words, which have only grammatical roles and not contributing to information content, may be simply exposed by their relatively higher occurrence frequencies. But, in agglutinative or inflectional languages, a stop word may be observed in several different surface forms due to the inflection producing noise.In this study, some of the well-known binary classification methods are employed to overcome the inflectional noise problem in stop word detection. The experiments are conducted on corpora of an agglutinative language, Turkish, in which the amount of inflection is high and a non-agglutinative language, English, in which the inflection is lower for stop words. The evaluations demonstrated that in Turkish corpus, the classification methods improve stop word detection with respect to frequency-based method. On the other hand, the classification methods applied on English corpora showed no improvement in the performance of stop word detection.

show abstract

“…Stop-words list was automatically generated for Egyptian dialect using frequency method [30]. The aggregate method was used for generation of stop-words list for Persian language by combining statistical and similarity function approaches [31]. A deterministic finite automaton was used for generation of stopwords for Hindi text [32].…”

Section: Introductionmentioning

confidence: 99%

Automatic construction of generic stop words list for hausa text

Bichi

Samsudin

Hassan

2022

IJEECS

View full text Add to dashboard Cite

<span lang="EN-US">Stop-words are words having the highest frequencies in a document without any significant information. They are characterized by having common relations within a cluster. They are the noise of the text that are evenly distributed over a document. Removal of stop words improve the performance and accuracy of information retrieval algorithms and machine learning at large. It saves the storage space by reducing the vector space dimension, and helps in effective documents indexing. This research generated a list of Hausa stop words automatically using aggregated method by combining frequency and statistics methods. The experiments are conducted using a primarily collected Hausa corpus consisting of 841 Hausa news articles of size 646862 words and finally a list of distinct 81 Hausa stop words is generated.</span>

show abstract

PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information

Cited by 9 publications

References 7 publications

Automatic Multilingual Stopwords Identification from Very Small Corpora

Automatic Multilingual Stopwords Identification from Very Small Corpora

Stop Word Detection as a Binary Classification Problem

Automatic construction of generic stop words list for hausa text

Contact Info

Product

Resources

About