This paper presents two anonymisation methods to process an SMS corpus. The first one is based on an unsupervised approach called Seek&Hide. The implemented system uses several dictionaries and rules in order to predict if a SMS needs anonymisation process. The second method is based on a supervised approach using machine learning techniques. We evaluate the two approaches and we propose a way to use them together. Only when the two methods do not agree on their prediction, will the SMS be checked by a human expert. This greatly reduces the cost of anonymising the corpus.
Nous présentons notre approche fondée sur les données authentiques, en nous concentrant sur des recherches récentes, portant sur le recueil, le traitement et l'analyse d'un grand corpus de SMS en français, intitulé 88milSMS (http://88milsms.huma-num.fr/, Panckhurst, Détrie, Lopez, Moïse, Roche, Verine, 2014), incluant un questionnaire sociolinguistique soumis aux donateurs au moment de la collecte ainsi que leurs réponses. Puis nous expliquons pourquoi, dans une démarche pluridisciplinaire (située entre sciences du langage, informatique et traitement automatique du langage naturel), nous avons décidé de fournir à la communauté scienti que et au grand public le corpus de SMS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.