Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-ofspeech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.
Removal of noise and pectoral muscles are the two important pre-processing steps in CAD system for the diagnosis of breast cancer. This work combines Robust Outlyingness Ratio (ROR) mechanism with extended NL-Means (ROR-NLM) filter based on Discrete Cosine Transform (DCT) for the detection and removal of noise. This method removes Gaussian and impulse noise very effectively without any loss of desired data. For segmenting and removing pectoral muscles, this paper uses global thresholding to identify pectoral muscles, edge detection processes to identify the edge of the full breast and connected component labelling to identify and remove the connected pixels outside the breast region. The result shows that our approach removes Gaussian and impulse noise effectively without any loss of desired data and overall gives 90.06% accuracy.
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.
Sentiment analysis of Dravidian languages has received attention in recent years. However, most social media text is code-mixed, and there is no research available on the sentiment analysis of code-mixed Dravidian languages. The Dravidian-CodeMix-FIRE 2020 https://dravidian-codemix.github.io/2020/, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem. Two language tracks, Tamil and Malayalam, were created as a part of Dravidian-CodeMix-FIRE 2020. The goal of this shared task was to identify the sentiment of a given code-mixed comment (from YouTube) into five classespositive, negative, neutral, mixed-feeling and comment not in the intended language. The performance of the systems (developed by participants) has been evaluated in terms of weighted-F1 score.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.