Analyzing Documents with TF-IDF

Lavin, Matthew J.

doi:10.46430/phen0082

Cited by 15 publications

(13 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unfortunately, no such qualitative gold standard existed for the mentioned languages. Therefore, the method of "TF-IDF" (term frequency-inverse document frequency) was used to establish important words [20][21][22][23] (1)-( 3). Important words were used to measure word overlap between sentences.…”

Section: Fig 1 the Proposed Methods Workflowmentioning

confidence: 99%

Aligning and extending technologies of parallel corpora for the Kazakh language

Rakhimova

Karibayeva

2022

EEJET

View full text Add to dashboard Cite

The paper presents the two-stage alignment and extending methods of parallel corpora for the Kazakh language. The Kazakh language is agglutinative with rich morphology and related to the Turkic language group. So, the traditional alignment methods for similar languages do not work for the Kazakh language. The alignment is used primarily to ensure that the fragment corresponding to the original is found in the translation. After that, identical fragments of parallel texts are compared with each other. At the initial stage, the question is what needs to be leveled. It is possible to align word by word, but this often becomes almost impossible for several reasons: sets of lexemes and expressions do not match in different languages. Considering the linguistic peculiarities of languages, the developed technologies and ways of universal alignment of parallel text may not work in languages with agglutination. It means that the form of the word is formed by additional affixes and auxiliary words that carry semantic and morphological information. The approach presented in this paper is to use a two-stage alignment, which uses a bilingual dictionary of synonyms. The evaluation with the use of the English-Kazakh corpus verifies that our method shows an average of 89 % correct alignment. The second method is designed to expand the parallel corpus due to the lack of natural parallel corpora of the Kazakh-English language pair with good quality. The developed method uses a combinatorial method taking into account the semantic and grammatical features of the Kazakh language. Different tenses of the Kazakh language are used for sentence generation, and different endings for parts of speech are also considered.

show abstract

Section: Fig 1 the Proposed Methods Workflowmentioning

confidence: 99%

Aligning and extending technologies of parallel corpora for the Kazakh language

Rakhimova

Karibayeva

2022

EEJET

View full text Add to dashboard Cite

show abstract

“…To analyze this aspect at an individual account granularity, we first identify the most important words shared by known troll accounts, and then check whether an account detected as a troll by TROLLMAGNIFIER posted about any of these words. To do this, we calculate the TF-IDF (Term Frequency-Inverse Document Frequency) of the corpus of messages shared by known troll accounts [29]. We then select the top 10 keywords identified by this approach as a proxy for the important narratives shared by known trolls, and check if a detected account included each of those keywords in any of their submissions or comments.…”

Section: Validationmentioning

confidence: 99%

“…Topic Discussed. To identify relevant words discussed by the known trolls, we calculate the TF-IDF (Term Frequency-Inverse Document Frequency) of the corpus of submissions and comments that they posted [29]. The TF is calculated on the known troll account dataset and the IDF on the entire dataset of 53,763 accounts.…”

Section: Validation -Account-level Indicatorsmentioning

confidence: 99%

TrollMagnifier: Detecting State-Sponsored Troll Accounts on Reddit

Saeed

Ali

Blackburn

et al. 2022

2022 IEEE Symposium on Security and Privacy (SP)

View full text Add to dashboard Cite

Growing evidence points to recurring influence campaigns on social media, often sponsored by state actors aiming to manipulate public opinion on sensitive political topics. Typically, campaigns are performed through instrumented accounts, known as troll accounts; despite their prominence, however, little work has been done to detect these accounts in the wild. In this paper, we present TROLLMAGNIFIER, a detection system for troll accounts. Our key observation, based on analysis of known Russian-sponsored troll accounts identified by Reddit, is that they show loose coordination, often interacting with each other to further specific narratives. Therefore, troll accounts controlled by the same actor often show similarities that can be leveraged for detection. TROLLMAG-NIFIER learns the typical behavior of known troll accounts and identifies more that behave similarly. We train TROLLMAGNI-FIER on a set of 335 known troll accounts and run it on a large dataset of Reddit accounts. Our system identifies 1,248 potential troll accounts; we then provide a multi-faceted analysis to corroborate the correctness of our classification. In particular, 66% of the detected accounts show signs of being instrumented by malicious actors (e.g., they were created on the same exact day as a known troll, they have since been suspended by Reddit, etc.). They also discuss similar topics as the known troll accounts and exhibit temporal synchronization in their activity. Overall, we show that using TROLLMAG-NIFIER, one can grow the initial knowledge of potential trolls provided by Reddit by over 300%.

show abstract

“…We adopt a simple query set that consists of binary queries probing the existence of words in the extended headline. The words are chosen from a pre-defined vocabulary obtained by stemming all words in the HuffPost dataset and choosing the top-1,000 according to their tf-idf scores [90]. We process the dataset to merge redundant categories (such as Style & Beauty and Beauty & Style), remove semantically ambiguous, HuffPost-specific categories (e.g.…”

Section: Word-based Queriesmentioning

confidence: 99%

Interpretable by Design: Learning Predictors by Composing Interpretable Queries

Chattopadhyay,

Slocum,

Haeffele

et al. 2022

Preprint

View full text Add to dashboard Cite

There is a growing concern about typically opaque decision-making with high-performance machine learning algorithms. Providing an explanation of the reasoning process in domain-specific terms can be crucial for adoption in risk-sensitive domains such as healthcare. We argue that machine learning algorithms should be interpretable by design and that the language in which these interpretations are expressed should be domain-and task-dependent. Consequently, we base our model's prediction on a family of user-defined and task-specific binary functions of the data, each having a clear interpretation to the end-user. We then minimize the expected number of queries needed for accurate prediction on any given input. As the solution is generally intractable, following prior work, we choose the queries sequentially based on information gain. However, in contrast to previous work, we need not assume the queries are conditionally independent. Instead, we leverage a stochastic generative model (VAE) and an MCMC algorithm (Unadjusted Langevin) to select the most informative query about the input based on previous query-answers. This enables the online determination of a query chain of whatever depth is required to resolve prediction ambiguities. Finally, experiments on vision and NLP tasks demonstrate the efficacy of our approach and its superiority over post-hoc explanations.

show abstract

Analyzing Documents with TF-IDF

Cited by 15 publications

References 0 publications

Aligning and extending technologies of parallel corpora for the Kazakh language

Aligning and extending technologies of parallel corpora for the Kazakh language

TrollMagnifier: Detecting State-Sponsored Troll Accounts on Reddit

Interpretable by Design: Learning Predictors by Composing Interpretable Queries

Contact Info

Product

Resources

About