A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Babić, Karlo; Guerra, Francesco; Martinčić-Ipšić, Sanda; Meštrović, Ana

doi:10.31341/jios.44.2.2

Cited by 6 publications

(3 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A drawback of word-embedding approaches, however, is that all information about word order is lost. Nonetheless, surprisingly competitive results are obtained in many applications by aggregating word vectors despite this limitation (Babić et al, 2019;Kenter & de Rijke, 2015;Sinoara et al, 2019). Today, bidirectional encoder representations from transformers (BERT; Devlin et al, 2018) enable a new generation of technologies (generally referred to as transformers) to directly provide a representation of a sentence considering words in their context.…”

Section: Te X T Representation: From Words To Vectorsmentioning

confidence: 99%

Towards adaptive support for self‐regulated learning of causal relations: Evaluating four Dutch word vector models

Pijeira‐Díaz,

Braumann,

van de Pol

et al. 2024

Brit J Educational Tech

View full text Add to dashboard Cite

Advances in computational language models increasingly enable adaptive support for self‐regulated learning (SRL) in digital learning environments (DLEs; eg, via automated feedback). However, the accuracy of those models is a common concern for educational stakeholders (eg, policymakers, researchers, teachers and learners themselves). We compared the accuracy of four Dutch language models (ie, spaCy medium, spaCy large, FastText and ConceptNet NumberBatch) in the context of secondary school students' learning of causal relations from expository texts, scaffolded by causal diagram completion. Since machine learning relies on human‐labelled data for the best results, we used a dataset with 10,193 students' causal diagram answers, compiled over a decade of research using a diagram completion intervention to enhance students' monitoring of their text comprehension. The language models were used in combination with four popular machine learning classifiers (ie, logistic regression, random forests, support vector machine and neural networks) to evaluate their performance on automatically scoring students' causal diagrams in terms of the correctness of events and their sequence (ie, the causal structure). Five performance metrics were studied, namely accuracy, precision, recall, F1 and the area under the curve of the receiver operating characteristic (ROC‐AUC). The spaCy medium model combined with the neural network classifier achieved the best performance for the correctness of causal events in four of the five metrics, while the ConceptNet NumberBatch model worked best for the correctness of the causal sequence. These evaluation results provide a criterion for model adoption to adaptively support SRL of causal relations in DLEs. Practitioner notesWhat is already known about this topic Accurate monitoring is a prerequisite for effective self‐regulation. Students struggle to accurately monitor their comprehension of causal relations in texts. Completing causal diagrams improves students' monitoring accuracy, but there is room for further improvement. Automatic scoring could be used to provide adaptive support during diagramming. What this paper adds Comparison of four Dutch word vector models combined with four machine learning classifiers for the automatic scoring of students' causal diagrams. Five performance metrics to evaluate the above solutions. Evaluation of the word vector models for estimating the semantic similarity between student and model answers. Implications for practice and/or policy High‐quality word vector models could (em)power adaptive support during causal diagramming via automatic scoring. The evaluated solutions can be embedded in digital learning environments (DLEs). Criteria for model adoption to adaptively support SRL of causal relations in DLEs. The increased saliency of (in)correct answers via automatic scoring might help to improve students' monitoring accuracy.

show abstract

Section: Te X T Representation: From Words To Vectorsmentioning

confidence: 99%

Towards adaptive support for self‐regulated learning of causal relations: Evaluating four Dutch word vector models

Pijeira‐Díaz,

Braumann,

van de Pol

et al. 2024

Brit J Educational Tech

View full text Add to dashboard Cite

show abstract

“…Some DL models do not generate a vector presentation of documents or larger text units but only vector presentations of words. For these models, it was necessary to create documents vectors from words vectors, for example, in the form of centroids like the ones used in (Babi c et al, 2019;Babi c, Guerra, et al, 2020). Since no method for combining word vectors into document vectors existing today is generally accepted as the best one, and since several methods of combining have been tried, the one that has brought the best results has been chosen.…”

Section: Experiments Setupmentioning

confidence: 99%

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection

Vrbanec

Meštrović

2023

Expert Systems

Self Cite

View full text Add to dashboard Cite

Automatic detection of concealed plagiarism in the form of paraphrases is a difficult task, and finding a successful unsupervised approach for paraphrase detection is necessary as a precondition to change that. This comparative study identified the most efficient methods for unsupervised paraphrased document detection using similarity measures alone or combined with Deep Learning (DL) models. It proved the hypothesis that some DL models are more successful than the best statistically-based methods in that task. Many experiments were carried out, and their results were compared. The text similarities between documents are obtained from 60 different methods using five paraphrase corpora, including the new one made by authors, as an important original contribution. Some DL models achieved significantly better results than those obtained by the best statistical methods, especially pre-trained transformer-based language models with average values of Accuracy and F1 of 85.8% and 88.3%, respectively, with top values of 99.9% and 98.4% for Accuracy and F1 on some corpora. These results are even better than those of supervised and combined approaches. Therefore, here presented results prove that detecting concealed plagiarism becomes an attainable goal. This study highlighted those language models with the best overall results for paraphrase detection as best suited for further research. The study also discussed the choice of similarity/distance measure paired with embeddings produced by DL models and some advantages of using cosine similarity as the fastest measure. For 60 different methods, complexity has been defined in O notation. Times needed for their implementation have also been presented. The article's results and conclusions are a firm base for future semantic similarity, paraphrasing, and plagiarism detection studies, clearly marking state-ofthe-art tools and methods.

show abstract

“…The initial phase of neural language models, inspired by their work, featured shallow models. These models showcased the effectiveness of neural text representations through attributes such as lower-dimensional vector representations and the direct calculation of word similarity [10]. Moreover, leveraging embeddings as input led to enhanced performance across various NLP tasks [11], [12].…”

Section: Introductionmentioning

confidence: 98%

Recursively Autoregressive Autoencoder for Pyramidal Text Representation

Babić,

Meštrović

2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

We introduce Pyramidal Recursive learning (PyRv), a novel method for text representation learning. This approach constructs a pyramidal hierarchy by recursively building representations of phrases, starting from tokens (characters, subwords, or words). At each level, N representations are recursively combined, resulting in N-1 representations on the level above, abstracting the input text from characters or subwords to words, phrases, and potentially sentences. The proposed method employs two learning approaches: autoencoding and autoregression. The autoencoding head decodes encoded representation pairs, while the autoregressive head predicts neighboring representations on both the left and right. This method exhibits four key properties: hierarchical representation, representation compositionality, representation decodability, and self-supervised learning. To implement and validate the proposed method, we train the Pyramidal Recursive Neural Network (PyRvNN) model. Evaluation metrics include autoencoder decodability, plagiarism detection, memorization, and readability. The accuracy of autoencoder decodability serves as an indicator of the validity of the four key properties. Preliminary assessments demonstrate promising results, particularly in machine-paraphrased plagiarism, text readability, and a memorization experiment where the PyRvNN model is compared against FastText and its variant.

show abstract

A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Cited by 6 publications

References 30 publications

Towards adaptive support for self‐regulated learning of causal relations: Evaluating four Dutch word vector models

Towards adaptive support for self‐regulated learning of causal relations: Evaluating four Dutch word vector models

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection

Recursively Autoregressive Autoencoder for Pyramidal Text Representation

Contact Info

Product

Resources

About