Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets

Yimam, Seid Muhie; Ayele, Abinew Ali; Venkatesh, G.; Gashaw, Ibrahim; Biemann, Chris

doi:10.3390/fi13110275

Cited by 11 publications

(7 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The success of GPT and similar models has led to the development of conversational AI models by other companies and research organizations. For instance, Google’s Bidirectional Encoder Representations from Transformers (BERT) and Facebook’s RoBERTa models (a reimplementation of BERT with some modifications to the key hyperparameters and minor embedding tweaks) were trained on even larger text datasets and achieved state-of-the-art results in a range of NLP tasks [ 9 , 10 ].…”

Section: Resultsmentioning

confidence: 99%

From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing

Dergaa¹,

Chamari²,

Żmijewski³

et al. 2023

232

View full text Add to dashboard Cite

Natural language processing (NLP) has been studied in computing for decades. Recent technological advancements have led to the development of sophisticated artificial intelligence (AI) models, such as Chat Generative Pre-trained Transformer (ChatGPT). These models can perform a range of language tasks and generate human-like responses, which offers exciting prospects for academic efficiency. This manuscript aims at (i) exploring the potential benefits and threats of ChatGPT and other NLP technologies in academic writing and research publications; (ii) highlights the ethical considerations involved in using these tools, and (iii) consider the impact they may have on the authenticity and credibility of academic work. This study involved a literature review of relevant scholarly articles published in peer-reviewed journals indexed in Scopus as quartile 1. The search used keywords such as “ChatGPT,” “AI-generated text,” “academic writing,” and “natural language processing.” The analysis was carried out using a quasi-qualitative approach, which involved reading and critically evaluating the sources and identifying relevant data to support the research questions. The study found that ChatGPT and other NLP technologies have the potential to enhance academic writing and research efficiency. However, their use also raises concerns about the impact on the authenticity and credibility of academic work. The study highlights the need for comprehensive discussions on the potential use, threats, and limitations of these tools, emphasizing the importance of ethical and academic principles, with human intelligence and critical thinking at the forefront of the research process. This study highlights the need for comprehensive debates and ethical considerations involved in their use. The study also recommends that academics exercise caution when using these tools and ensure transparency in their use, emphasizing the importance of human intelligence and critical thinking in academic work.

show abstract

Section: Resultsmentioning

confidence: 99%

From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing

Dergaa¹,

Chamari²,

Żmijewski³

et al. 2023

232

View full text Add to dashboard Cite

show abstract

“…Segmentation of sentences essentially involves the disambiguation of end-ofsentence punctuation. For Amharic, we have used the available Python-based Amharic sentence segmentation module (pip install amseg) (Yimam et al, 2021;Belay et al, 2021).…”

Section: Data Pre-processingmentioning

confidence: 99%

AmhEn: Amharic-English Large Parallel Corpus for Machine Translation

Tonja

Belay

Kolesnikova

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, using deep neural networks for machine translation (MT) tasks has received great attention. In order for these networks to learn abstract representations of the input and store them as continuous vectors, they need a lot of data. However, very few research studies have been conducted on low-resource languages like Amharic. The progress of an Amharic-English machine translation task in both directions is affected by the lack of clean, easy-to-find, and up-to-date parallel language corpora. This paper presents the first relatively large-scale Amharic-English parallel corpora (above 1.1 million) for machine translation tasks. We ran experiments with recurrent neural networks (RNN) and Transformer in various hyper-parameter settings to investigate the usability of our dataset. Additionally, we explore the effects of Amharic homophone character normalization on machine translation. We have released the dataset in both unnormalized and normalized forms. Our dataset is available in train, test, and validation split files.

show abstract

“…Some research has been conducted to create pre-trained Amharic models. To mention them: fastText [37,38], word2vec [26,[38][39][40], and XLMR [41]. Some of these models were trained for cross-lingual purposes and are not usable for the needs of most NLP tasks.…”

Section: Amharic Languagementioning

confidence: 99%

“…ave been generated for many languages. Although language in Ethiopia and a large volume of text is hind in computational analysis including BERT and s been conducted to create pre-trained Amharic mod-], word2vec [26,[38][39][40], and XLMR [41]. Some of these l purposes and are not usable for the needs of most are not publicly accessible.…”

Section: Amharic Languagementioning

confidence: 99%

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

2023

View full text Add to dashboard Cite

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

show abstract

Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets

Cited by 11 publications

References 33 publications

From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing

From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing

AmhEn: Amharic-English Large Parallel Corpus for Machine Translation

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Contact Info

Product

Resources

About