Text Similarity in Vector Space Models: A Comparative Study

Shahmirzadi, Omid; Lugowski, Adam; Younge, Kenneth

doi:10.1109/icmla.2019.00120

Cited by 54 publications

(25 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Each vector is determined via a term‐frequency inverse‐document‐frequency (TFIDF) approach that up‐weights rare words and down‐weights common words. Although the TFIDF approach is relatively simple, a benchmarking study on patent data shows that TFIDF performs well in situations of long, extended, and highly granular text (Shahmirzadi, Lugowski, and Younge, 2018). The similarity measure is then computed for each pair of patents by determining the angular distance between them using a cosine measure between the patents' two vectors.…”

Section: Results From Testing Assumptionsmentioning

confidence: 99%

Patent citations reexamined

Kuhn

Younge

Marco³

2020

The RAND J of Economics

Self Cite

126

View full text Add to dashboard Cite

Many studies rely on patent citations to measure intellectual heritage and impact. In this article, we show that the nature of patent citations has changed dramatically in recent years. Today, a small minority of patent applications are generating a large majority of patent citations, and the mean technological similarity between citing and cited patents has fallen considerably. We replicate several well‐known studies in industrial organization and innovation economics and demonstrate how generalized assumptions about the nature of patent citations have misled the field.

show abstract

Section: Results From Testing Assumptionsmentioning

confidence: 99%

Patent citations reexamined

Kuhn

Younge

Marco³

2020

The RAND J of Economics

Self Cite

126

View full text Add to dashboard Cite

show abstract

“…A disadvantage of TF-IDF and other bag-of-words methods is that they do not take the ordering of words into account, thereby ignoring syntax. However, in practice, TF-IDF is often found to be a strong baseline [44].…”

Section: Feature Extractionmentioning

confidence: 99%

Active learning for screening prioritization in systematic reviews - A simulation study

Ferdinands¹,

Schram²,

Bruin³

et al. 2020

Preprint

View full text Add to dashboard Cite

BackgroundConducting a systematic review requires great screening effort. Various tools have been proposed to speed up the process of screening thousands of titles and abstracts by engaging in active learning. In such tools, the reviewer interacts with machine learning software to identify relevant publications as early as possible. To gain a comprehensive understanding of active learning models for reducing workload in systematic reviews, the current study provides a methodical overview of such models. Active learning models were evaluated across four different classification techniques (naive Bayes, logistic regression, support vector machines, and random forest) and two different feature extraction strategies (TF-IDF and doc2vec). Moreover, models were evaluated across six systematic review datasets from various research areas to assess generalizability of active learning models across different research contexts. MethodsPerformance of the models were assessed by conducting simulations on six systematic review datasets. We defined desirable model performance as maximizing recall while minimizing the number of publications needed to screen. Model performance was evaluated by recall curves, WSS@95, RRF@10, and ATD. ResultsWithin all datasets, the model performance exceeded screening at random order to a great degree.The models reduced the number of publications needed to screen by 91.7% to 63.9%. ConclusionsActive learning models for screening prioritization show great potential in reducing the workload in systematic reviews. Overall, the Naive Bayes + TF-IDF model performed the best.

show abstract

“…The number of training iterations was 10. These parameters were taken from Shahmirzadi et al (2018). The learning rate was set to 0.025 and reduced by 0.002 in every epoch.…”

Section: Word Embedding Trainingmentioning

confidence: 99%

Creating Neuroscientific Knowledge Organization System Based on Word Representation and Agglomerative Clustering Algorithm

Huangfu

Zeng

Wang

2020

Front. Neuroinform.

View full text Add to dashboard Cite

The literature on neuroscience has grown rapidly in recent years with the emergence of new domains of research. In the context of this progress, creating a knowledge organization system (KOS) that can quickly incorporate terms of a given domain is an important aim in the area. In this article, we develop a systematic method based on word representation and the agglomerative clustering algorithm to semi-automatically build a hierarchical KOS. We collected 35,832 research keywords and 11,497 research methods from PubMed Central database, and organized them in a hierarchical structure according to semantic distance. We show that the proposed KOS can help find terms related to the given topics, analyze articles related to specific domains of research, and characterize the features of article clusters. The proposed method can significantly reduce the manual work required by experts to organize the KOS.

show abstract

Text Similarity in Vector Space Models: A Comparative Study

Cited by 54 publications

References 7 publications

Patent citations reexamined

Patent citations reexamined

Active learning for screening prioritization in systematic reviews - A simulation study

Creating Neuroscientific Knowledge Organization System Based on Word Representation and Agglomerative Clustering Algorithm

Contact Info

Product

Resources

About