This paper describes Kata.ai's submission for the Social Media Mining for Health (SMM4H) 2021 shared task. We participated in three tasks: classifying adverse drug effect, COVID-19 self-report, and COVID-19 symptoms. Our system is based on BERT model pre-trained on the domain-specific text. In addition, we perform data cleaning and augmentation, as well as hyperparameter optimization and model ensemble to further boost the BERT performance. We achieved the first rank in both classifying adverse drug effects and COVID-19 selfreport tasks.
Indonesian language is heavily riddled with colloquialism whether in written or spoken forms. In this paper, we identify a class of Indonesian colloquial words that have undergone morphological transformations from their standard forms, categorize their word formations, and propose a benchmark dataset of Indonesian Colloquial Lexicons (IndoCollex) consisting of informal words on Twitter expertly annotated with their standard forms and their word formation types/tags. We evaluate several models for character-level transduction to perform morphological word normalization on this testbed to understand their failure cases and provide baselines for future work. As IndoCollex catalogues word formation phenomena that are also present in the non-standard text of other languages, it can also provide an attractive testbed for methods tailored for cross-lingual word normalization and non-standard word formation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.