Abstract. The negative consequences of cyberbullying are becoming more alarming every day and technical solutions that allow for taking appropriate action by means of automated detection are still very limited. Up until now, studies on cyberbullying detection have focused on individual comments only, disregarding context such as users' characteristics and profile information. In this paper we show that taking user context into account improves the detection of cyberbullying. IntroductionMore and more teenagers in online communities are exposed to and harmed by cyberbullying. Studies 1 show that in Europe about 18% of the children have been involved in cyberbullying, leading to severe depressions and even suicide attempts. Cyberbullying is defined as an aggressive, intentional act carried out by a group or individual, using electronic forms of contact repeatedly or over time, against a victim who cannot easily defend him-or herself [1]. Besides social measures, technical solutions have to be found to deal with this social problem. At present social network platforms rely on users alerting network moderators who in turn may remove bullying comments. The potential for alerting moderators can be improved by automatically detecting such comments allowing a moderator to act faster. Studies on automatic cyberbullying detection are few and typically limited to the individual comments and do not take context into account [2][3]. In this study we show that taking user context, such as a user's comments history and user characteristics [4], into account can improve the performance of detection tools for cyberbullying incidents considerably. We approach cyberbullying detection as a supervised classification task for which we investigated three incremental feature sets. In the next sections the experimental setup and results will be described, followed by a discussion of related work and conclusions.1 EU COST Action IS0801on Cyberbullying (https://sites.google.com/site/costis0801/). 694M. Dadvar et al. Experiment CorpusYouTube is the world's largest user-generated content site and its broad scope in terms of audience, videos, and users' comments make it a platform that is eligible for bullying and therefore an appropriate platform for collecting datasets for cyberbullying studies. As no cyberbullying dataset was publicly available, we collected a dataset of comments on YouTube movies. To cover a variety of topics, we collected the comments from the top 3 videos in the different categories found in YouTube. For each comment the user id, its date and time were also stored. Only the users with public profiles (78%) were kept. The final dataset consists of 4626 comments from 3858 distinct users. The comments were manually labelled as bullying (9.7%) and non-bullying based on the definition of cyberbullying in this study (inter-annotator agreement 93%). For each user we collected the comment history, consisting of up to 6 months of comments, on average 54 comments per user. Feature Space DesignThe following three feature sets were...
Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems.Results: We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone.Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.Contact: trieschn@ewi.utwente.nlSupplementary information: Supplementary data are available at Bioinformatics online.
Abstract. Cyberbullying is becoming a major concern in online environments with troubling consequences. However, most of the technical studies have focused on the detection of cyberbullying through identifying harassing comments rather than preventing the incidents by detecting the bullies. In this work we study the automatic detection of bully users on YouTube. We compare three types of automatic detection: an expert system, supervised machine learning models, and a hybrid type combining the two. All these systems assign a score indicating the level of "bulliness" of online bullies. We demonstrate that the expert system outperforms the machine learning models. The hybrid classifier shows an even better performance. IntroductionWith the growth of the use of Internet as a social medium, a new form of bullying has emerged, called cyberbullying. Cyberbullying is defined as an aggressive, intentional act carried out by a group or individual, using electronic forms of contact repeatedly and over time against a victim who cannot easily defend him or herself [1]. One of the most common forms is the posting of hateful comments about someone in social networks. Many social studies have been conducted to provide support and training for adults and teenagers [2,3]. The majority of the existing technical studies on cyberbullying have concentrated on the detection of bullying or harassing comments [4-6], while there is hardly work on the more challenging task of detecting cyberbullies and studies for this area of research are largely missing. There are few exceptions however, that point out an interesting direction for the incorporation of user information in detecting offensive contents, but more advanced user information or personal characteristics such as writing style or possible network activities has not been included in these studies [7,8]. Cyberbullying prevention based on user profiles was addressed for the first time in our latest study in which an expert system was developed that assigns scores to social network users to indicate their level of 'bulliness' and their potential for future misbehaviour based on the history of their activities [9]. In the previous work we did not investigate machine learning models. In this study we focus again on the detection of bully users in online social networks but now we look into the efficiency of both expert systems and machine learning models for identifying the potential bully users. We compare the performance of both systems for the task of assigning a score to social network users that indicates their level of bulliness. We demonstrate that the expert system outperforms the machine learner and can be effectively combined in a hybrid classifier. The approach we propose can be used for building monitoring tools to stop potential bullies from conducting further harm. Data Collection and Feature SelectionIn this section we will explain the characteristics of the corpus used in this study. We also describe the feature space and the three feature categories that have been used...
a b s t r a c tOver the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model.Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms.Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.