2019
DOI: 10.46430/phen0082
|View full text |Cite
|
Sign up to set email alerts
|

Analyzing Documents with TF-IDF

Abstract: This lesson focuses on a foundational natural language processing and information retrieval method called Term Frequency - Inverse Document Frequency (tf-idf). This lesson explores the foundations of tf-idf, and will also introduce you to some of the questions and concepts of computationally oriented text analysis.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(13 citation statements)
references
References 0 publications
0
13
0
Order By: Relevance
“…Unfortunately, no such qualitative gold standard existed for the mentioned languages. Therefore, the method of "TF-IDF" (term frequency-inverse document frequency) was used to establish important words [20][21][22][23] (1)-( 3). Important words were used to measure word overlap between sentences.…”
Section: Fig 1 the Proposed Methods Workflowmentioning
confidence: 99%
“…Unfortunately, no such qualitative gold standard existed for the mentioned languages. Therefore, the method of "TF-IDF" (term frequency-inverse document frequency) was used to establish important words [20][21][22][23] (1)-( 3). Important words were used to measure word overlap between sentences.…”
Section: Fig 1 the Proposed Methods Workflowmentioning
confidence: 99%
“…To analyze this aspect at an individual account granularity, we first identify the most important words shared by known troll accounts, and then check whether an account detected as a troll by TROLLMAGNIFIER posted about any of these words. To do this, we calculate the TF-IDF (Term Frequency-Inverse Document Frequency) of the corpus of messages shared by known troll accounts [29]. We then select the top 10 keywords identified by this approach as a proxy for the important narratives shared by known trolls, and check if a detected account included each of those keywords in any of their submissions or comments.…”
Section: Validationmentioning
confidence: 99%
“…Topic Discussed. To identify relevant words discussed by the known trolls, we calculate the TF-IDF (Term Frequency-Inverse Document Frequency) of the corpus of submissions and comments that they posted [29]. The TF is calculated on the known troll account dataset and the IDF on the entire dataset of 53,763 accounts.…”
Section: Validation -Account-level Indicatorsmentioning
confidence: 99%
“…We adopt a simple query set that consists of binary queries probing the existence of words in the extended headline. The words are chosen from a pre-defined vocabulary obtained by stemming all words in the HuffPost dataset and choosing the top-1,000 according to their tf-idf scores [90]. We process the dataset to merge redundant categories (such as Style & Beauty and Beauty & Style), remove semantically ambiguous, HuffPost-specific categories (e.g.…”
Section: Word-based Queriesmentioning
confidence: 99%