2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA) 2019
DOI: 10.1109/icmla.2019.00120
|View full text |Cite
|
Sign up to set email alerts
|

Text Similarity in Vector Space Models: A Comparative Study

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
1
1

Relationship

1
9

Authors

Journals

citations
Cited by 54 publications
(25 citation statements)
references
References 7 publications
0
23
0
2
Order By: Relevance
“…Each vector is determined via a term‐frequency inverse‐document‐frequency (TFIDF) approach that up‐weights rare words and down‐weights common words. Although the TFIDF approach is relatively simple, a benchmarking study on patent data shows that TFIDF performs well in situations of long, extended, and highly granular text (Shahmirzadi, Lugowski, and Younge, 2018). The similarity measure is then computed for each pair of patents by determining the angular distance between them using a cosine measure between the patents' two vectors.…”
Section: Results From Testing Assumptionsmentioning
confidence: 99%
“…Each vector is determined via a term‐frequency inverse‐document‐frequency (TFIDF) approach that up‐weights rare words and down‐weights common words. Although the TFIDF approach is relatively simple, a benchmarking study on patent data shows that TFIDF performs well in situations of long, extended, and highly granular text (Shahmirzadi, Lugowski, and Younge, 2018). The similarity measure is then computed for each pair of patents by determining the angular distance between them using a cosine measure between the patents' two vectors.…”
Section: Results From Testing Assumptionsmentioning
confidence: 99%
“…A disadvantage of TF-IDF and other bag-of-words methods is that they do not take the ordering of words into account, thereby ignoring syntax. However, in practice, TF-IDF is often found to be a strong baseline [44].…”
Section: Feature Extractionmentioning
confidence: 99%
“…The number of training iterations was 10. These parameters were taken from Shahmirzadi et al (2018). The learning rate was set to 0.025 and reduced by 0.002 in every epoch.…”
Section: Word Embedding Trainingmentioning
confidence: 99%