Rui Sousa‐Silva scite author profile

Automatic plagiarism detection tools have evolved considerably in recent years. Owing in part to the recent technological developments, which provided more powerful processing capacities, as well as to the research interest that plagiarism detection attracted among computational linguists, results are nowadays more accurate and reliable. However, most of the plagiarism detection systems freely and commercially available are still based on similarity measures, whose algorithms search for similar or, at most, identical strings of text, within a more or less short search distance. Although these methods tend to perform well in detecting literal, verbatim plagiarism, their performance drops when other strategies are used, such as word substitution or reordering. This paper presents the results of a forensic linguistic analysis of real plagiarism cases among higher education students. Comparing the suspect plagiarised strings against the most likely originals from a legal perspective, it is demonstrated that strategies other than literal borrowing are increasingly used to plagiarise. A forensic linguistic explanation of the strategies used and why they represent instances of plagiarism is then offered, and examples are provided to illustrate why existing software fails to detect them. The paper concludes by arguing that commonly used detection software packages can be effective in identifying matching text, but are not necessarily good plagiarism detection systems. More indepth research and improvements in computational linguistics and natural language processing are required to increase the accuracy and reliability of the machinedetection procedure.

‘twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages

Laboreiro

Sarmento

et al. 2011

Abstract. In this paper we propose a set of stylistic markers for automatically attributing authorship to micro-blogging messages. The proposed markers include highly personal and idiosyncratic editing options, such as 'emoticons', interjections, punctuation, abbreviations and other low-level features. We evaluate the ability of these features to help discriminate the authorship of Twitter messages among three authors. For that purpose, we train SVM classifiers to learn stylometric models for each author based on different combinations of the groups of stylistic features that we propose. Results show a relatively good-performance in attributing authorship of micro-blogging messages (F = 0.63) using this set of features, even when training the classifiers with as few as 60 examples from each author (F = 0.54). Additionally, we conclude that emoticons are the most discriminating features in these groups.

The Dawn of the Human-Machine Era: A forecast of new and emerging language technologies

Sayers¹,

Sousa‐Silva²,

Höhn³

et al. 2021

New language technologies are coming, thanks to the huge and competing private investment fuelling rapid progress; we can either understand and foresee their effects, or be taken by surprise and spend our time trying to catch up. This report scketches out some transformative new technologies that are likely to fundamentally change our use of language. Some of these may feel unrealistically futuristic or far-fetched, but a central purpose of this report - and the wider LITHME network - is to illustrate that these are mostly just the logical development and maturation of technologies currently in prototype. But will everyone benefit from all these shiny new gadgets? Throughout this report we emphasise a range of groups who will be disadvantaged and issues of inequality. Important issues of security and privacy will accompany new language technologies. A further caution is to re-emphasise the current limitations of AI. Looking ahead, we see many intriguing opportunities and new capabilities, but a range of other uncertainties and inequalities. New devices will enable new ways to talk, to translate, to remember, and to learn. But advances in technology will reproduce existing inequalities among those who cannot afford these devices, among the world’s smaller languages, and especially for sign language. Debates over privacy and security will flare and crackle with every new immersive gadget. We will move together into this curious new world with a mix of excitement and apprehension - reacting, debating, sharing and disagreeing as we always do. Plug in, as the human-machine era dawns.

Fighting the Fake: A Forensic Linguistic Analysis to Fake News Detection

2022

Int J Semiot Law

Fake news has been the focus of debate, especially since the election of Donald Trump (2016), and remains a topic of concern in democratic countries worldwide, given (a) their threat to democratic systems and (b) the difficulty in detecting them. Despite the deployment of sophisticated computational systems to identify fake news, as well as the streamlining of fact-checking methods, appropriate fake news detection mechanisms have not yet been found. In fact, technological approaches are likely to be inefficient, given that fake news are based mostly on partisanship and identity politics, and not necessarily on outright deception. However, as disinformation is inherently expressed linguistically, this is a privileged room for forensic linguistic analysis. This article builds upon a forensic linguistic analysis of fake news pieces published in English and in Portuguese, which were collected since 2019 from acknowledged fake news outlets. The preliminary empirical analysis reveals that fake news pieces employ particular linguistic features, e.g. at the levels of typography, orthography and spelling, and morphosyntax. The systematic identification of these features, which will allow mapping linguistic resources and patterns used in those contexts, contributes to scholarship, not only by enabling a streamlined development of computational detection systems, but more importantly by permitting the forensic linguistics expert to assist criminal investigations and give evidence in court.

Team Fernando-Pessa at SemEval-2019 Task 4: Back to Basics in Hyperpartisan News Detection

Cruz

Rocha

et al. 2019

This paper describes our submission 1 to the SemEval 2019 Hyperpartisan News Detection task. Our system aims for a linguistics-based document classification from a minimal set of interpretable features, while maintaining good performance. To this goal, we follow a feature-based approach and perform several experiments with different machine learning classifiers. On the main task, our model achieved an accuracy of 71.7%, which was improved after the task's end to 72.9%. We also participate in the meta-learning sub-task, for classifying documents with the binary classifications of all submitted systems as input, achieving an accuracy of 89.9%.

Comparing Sentence-Level Features for Authorship Analysis in Portuguese

Sarmento

Grant

et al. 2010

Plagiarism Across Languages and Cultures: A (Forensic) Linguistic Analysis

Sousa‐Silva¹

2019

Biased Language Detection in Court Decisions

Pinto

Cardoso

Duarte

et al. 2020