Jaimeen Ahn scite author profile

Jaimeen Ahn

8Publications

36Citation Statements Received

200Citation Statements Given

How they've been cited

How they cite others

142

197

Affiliations

Korea Advanced Institute of Science and Technology

Publications

Order By: Most citations

Mitigating Language-Dependent Ethnic Bias in BERT

Ahn¹,

Oh²

2021

View full text Add to dashboard Cite

BERT and other large-scale language models (LMs) contain gender and racial bias. They also exhibit other dimensions of social bias, most of which have not been studied in depth, and some of which vary depending on the language. In this paper, we study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual BERT for English, German, Spanish, Korean, Turkish, and Chinese. To observe and quantify ethnic bias, we develop a novel metric called Categorical Bias score. Then we propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models. We compare our proposed methods with monolingual BERT and show that these methods effectively alleviate the ethnic bias. Which of the two methods works better depends on the amount of NLP resources available for that language. We additionally experiment with Arabic and Greek to verify that our proposed methods work for a wider variety of languages.

show abstract

KOLD: Korean Offensive Language Dataset

Jeong¹,

Oh²,

Ahn³

et al. 2022

Preprint

View full text Add to dashboard Cite

Warning: this paper contains content that may be offensive or upsettingAlthough large attention has been paid to the detection of hate speech, most work has been done in English, failing to make it applicable to other languages. To fill this gap, we present a Korean offensive language dataset (KOLD), 40k comments labeled with offensiveness, target, and targeted group information. We also collect two types of span, offensive and target span that justifies the decision of the categorization within the text. Comparing the distribution of targeted groups with the existing English dataset, we point out the necessity of a hate speech dataset fitted to the language that best reflects the culture. Trained with our dataset, we report the baseline performance of the models built on top of large pretrained language models. We also show that title information serves as context and is helpful to discern the target of hatred, especially when they are omitted in the comment.

show abstract

KOLD: Korean Offensive Language Dataset

Jeong¹,

Oh²,

Lee³

et al. 2022

View full text Add to dashboard Cite

Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT

Ahn¹,

Lee²,

Kim³

et al. 2022

View full text Add to dashboard Cite

Knowledge distillation is widely used to transfer the language understanding of a large model to a smaller model. However, after knowledge distillation, it was found that the smaller model is more biased by gender compared to the source large model. This paper studies what causes gender bias to increase after the knowledge distillation process. Moreover, we suggest applying a variant of the mixup on knowledge distillation, which is used to increase generalizability during the distillation process, not for augmentation. By doing so, we can significantly reduce the gender bias amplification after knowledge distillation. We also conduct an experiment on the GLUE benchmark to demonstrate that even if the mixup is applied, it does not have a significant adverse effect on the model's performance.

show abstract

Mitigating Language-Dependent Ethnic Bias in BERT

Ahn¹,

Oh²

2021

Preprint

View full text Add to dashboard Cite

show abstract

Suicidal Risk Detection for Military Personnel

Park¹,

Park²,

Ahn³

et al. 2020

View full text Add to dashboard Cite

We analyze social media for detecting the suicidal risk of military personnel, which is especially crucial for countries with compulsory military service such as the Republic of Korea. From a widely-used Korean social Q&A site, we collect posts containing military-relevant content written by active-duty military personnel. We then annotate the posts with two groups of experts: military experts and mental health experts. Our dataset includes 2,791 posts with 13,955 corresponding expert annotations of suicidal risk levels, and this dataset is available to researchers who consent to research ethics agreement. Using various finetuned state-of-the-art language models, we predict the level of suicide risk, reaching .88 F1 score for classifying the risks.

show abstract

Models and Benchmarks for Representation Learning of Partially Observed Subgraphs

Kim

Jin

Ahn

et al. 2022

View full text Add to dashboard Cite

Models and Benchmarks for Representation Learning of Partially Observed Subgraphs

Kim¹,

Jin²,

Ahn³

et al. 2022

Preprint

View full text Add to dashboard Cite

Subgraphs are rich substructures in graphs, and their nodes and edges can be partially observed in real-world tasks. Under partial observation, existing node-or subgraph-level message-passing produces suboptimal representations. In this paper, we formulate a novel task of learning representations of partially observed subgraphs. To solve this problem, we propose Partial Subgraph InfoMax (PSI) framework and generalize existing InfoMax models, including DGI, InfoGraph, MVGRL, and GraphCL, into our framework. These models maximize the mutual information between the partial subgraph's summary and various substructures from nodes to full subgraphs. In addition, we suggest a novel two-stage model with 𝑘-hop PSI, which reconstructs the representation of the full subgraph and improves its expressiveness from different local-global structures. Under training and evaluation protocols designed for this problem, we conduct experiments on three real-world datasets and demonstrate that PSI models outperform baselines. CCS CONCEPTS• Computing methodologies → Learning latent representations; Neural networks; Supervised learning by classification.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jaimeen Ahn

Mitigating Language-Dependent Ethnic Bias in BERT

KOLD: Korean Offensive Language Dataset

KOLD: Korean Offensive Language Dataset

Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT

Mitigating Language-Dependent Ethnic Bias in BERT

Suicidal Risk Detection for Military Personnel

Models and Benchmarks for Representation Learning of Partially Observed Subgraphs

Models and Benchmarks for Representation Learning of Partially Observed Subgraphs

Contact Info

Product

Resources

About