Natural Language Understanding has seen an increasing number of publications in the last years, especially after robust word embedding models became popular. These models gained a special place in the spotlight when they proved themselves able to capture and represent semantic relations underneath huge amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Multi-sense word embeddings were devised to alleviate these and other problems by representing each word-sense separately, but studies in this area are still in its infancy and much can be explored. We follow this scenario by proposing an unsupervised technique that disambiguates and annotates words by their specific sense, considering their context influence. These are later used to train a word embeddings model to produce a more accurate vector representation. We test our approach in 6 different benchmarks for the word similarity task, showing that our approach can sustain good results and often outperforms current state-of-the-art systems.
Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machineparaphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.