Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning is applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. First, non-experts annotated the tweets with binary labels ('hate' vs. 'no-hate'). Then, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. The hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.
Research data management is rapidly becoming a regular concern for researchers, and institutions need to provide them with platforms to support data organization and preparation for publication. Some institutions have adopted institutional repositories as the basis for data deposit, whereas others are experimenting with richer environments for data description, in spite of the diversity of existing workflows. This paper is a synthetic overview of current platforms that can be used for data management purposes. Adopting a pragmatic view on data management, the paper focuses on solutions that can be adopted in the longtail of science, where investments in tools and manpower are modest. First, a broad set of data management platforms is presented-some designed for institutional repositories and digital libraries-to select a short list of the more promising ones for data management. These platforms are compared considering This paper is an extended version of a previously published comparative study. Please refer to the WCIST 2015 conference proceedings
Abstract. Research data management is acknowledged as an important concern for institutions and several platforms to support data deposits have emerged. In this paper we start by overviewing the current practices in the data management workflow and identifying the stakeholders in this process. We then compare four recently proposed data repository platforms-DSpace, CKAN, Zenodo and Figshare-considering their architecture, support for metadata, API completeness, as well as their search mechanisms and community acceptance. To evaluate these features, we take into consideration the identified stakeholders' requirements. In the end, we argue that, depending on local requirements, different data repositories can meet some of the stakeholders requirements. Nevertheless, there is still room for improvements, mainly regarding the compatibility with the description of data from different research domains, to further improve data reuse.
Most current research data management solutions rely on a fixed set of descriptors (e.g. Dublin Core Terms) for the description of the resources that they manage. These are easy to understand and use, but their semantics are limited to general concepts, leaving out domain-specific metadata. The textual values for descriptors are easily indexed through free-text indexes, but faceted search and dataset interlinking becomes limited. From the point of view of the relational database schema modeler, designing a more flexible metadata model represents a non-trivial challenge because it means representing entities with attributes unknown at the time of modeling and that can change in time. Those traits, combined with the presence of hierarchies among the entities, can make the relational schema quite complex. This work demonstrates the approaches followed by current opensource platforms and proposes a graph-based model for achieving modular, ontology-based metadata for interlinked data assets in the Semantic Web. The proposed model was implemented in a collaborative research data management platform currently under development at the University of Porto.
Archives have well-established description standards, namely the ISAD(G) and ISAAR(CPF) with a hierarchical structure adapted to the nature of archival assets. However, as archives connect to a growing diversity of data, they aim to make their representations more apt to the so-called linked data cloud. The corresponding move from hierarchical, ISAD-conforming descriptions to graph counterparts requires stateof-the-art technologies, data models and vocabularies. Our approach addresses this problem from two perspectives. The first concerns the data model and description vocabularies, as we adopt and build upon the CIDOC-CRM standard. The second is the choice of technologies to support a knowledge graph, including a graph database and an Object Graph Mapping library. The case study is the Portuguese National Archives, Torre do Tombo, and the overall goal is to build a CIDOC-CRM-compliant system for document description and retrieval, to be used by professionals and the public. The early stages described here include the design of the core data model for archival records represented as the ArchOnto ontology and its embodiment in the ArchGraph knowledge graph. The goal of a semantic archival information systems will be pursued in the migration of existing records to the richer representation and the development of applications supported on the graph.
Research data are the cornerstone of science and their current fast rate of production is disquieting researchers. Adequate research data management strongly depends on accurate metadata records that capture the production context of the datasets, thus enabling data interpretation and reuse. This chapter reports on the authors' experience in the development of the metadata models, formalized as ontologies, for several research domains, involving members from small research teams in the overall process. This process is instantiated with four case studies: vehicle simulation; hydrogen production; biological oceanography and social sciences. The authors also present a data description workflow that includes a research data management platform, named Dendro, where researchers can prepare their datasets for further deposit in external data repositories.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.