Analyzing Stemming Approaches for Turkish Multi-Document Summarization

NuzumlalÄ, Muhammed Yavuz; Özgür, Arzucan

doi:10.3115/v1/d14-1077

Cited by 9 publications

(17 citation statements)

References 14 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given a query as a natural language statement, EvidenceMiner retrieves textual evidence at the sentence level from the CORD-19 corpus for life sciences. More recently, Raza et al (2022) present an Information Retrieval System that uses latent information to select relevant works related to specific concepts. Otegi et al (2022) develop a Question Answering system that receives a set of questions asked by experts about the disease COVID-19 and SARS-CoV-2 virus, and provides a ranked list of expert-level answers to each question.…”

Section: Related Workmentioning

confidence: 99%

“…The first component, responsible for extracting the latent concepts learned by a model is based on work done by Dalvi et al (2022), called Latent Concept Analysis. At a high level, feature vectors (contextualized representations) are first generated by performing a forward pass on the model.…”

Section: Concept Discoverymentioning

confidence: 99%

“…Abusive Language Detection Kirk et al (2022) investigated the detection of abusive language using transformer-based active learning on six datasets of which two exhibited a balanced and four an imbalanced class distribution. They evalu-ated a pool-based binary active learning setup, and their main finding is that, when using active learning, a model for abusive language detection can be efficiently trained using only a fraction of the data.…”

Section: Library Adoptionmentioning

confidence: 99%

See 2 more Smart Citations

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

2023

View full text Add to dashboard Cite

Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, lowresource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting. We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system 1 .

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Concept Discoverymentioning

confidence: 99%

Section: Library Adoptionmentioning

confidence: 99%

See 1 more Smart Citation

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

2023

View full text Add to dashboard Cite

show abstract

“…It is important to observe that this complexity constrains implementation of state-ofthe-art models and algorithms developed for example for English. In order to overcome data-sparsity in Turkish, dense pre-processing tasks such as stemming or lemmatization (possibly followed with a feature selection step) should be introduced before NLP pipelines [44]. Both stemming and lemmatization has goal to reduce inflectional or derivational forms of words into a common base form.…”

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

et al. 2021

View full text Add to dashboard Cite

Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.

show abstract

“…In this context, Turkish words take numerous inflectional and derivational suffixes and it is possible to derive a Turkish word that correspond to an English sentence (Oflazer, 2014): yap+abil+ecek+se+niz -> if you will be able to do (it) One of the main problem of Turkish morphology arises while obtaining vector space model for machine learning classifiers. More specifically, Turkish words are in general composed of morphemes that may result in data sparsity that may decrease performance of classifiers (Nuzumlalı & Özgür, 2014). The solution to this problem is relatively handled with stemming and lemmatization whose goals are obtaining base forms of words with reducing inflectional forms.…”

Section: Introductionmentioning

confidence: 99%

Advanced Turkish Fake News Prediction With Bidirectional Encoder Representations From Transformers

Bozuyla

2022

Konya Journal of Engineering Sciences

View full text Add to dashboard Cite

The increasing usage of social media and internet generates a significant amount of information to be analyzed from various perspectives. In particular, fake news is defined as the false news that is presented as factual news. Fake news are in general fabricated toward a manipulation aim. Fake news identification is in general a natural language analysis problem and machine learning algorithms are emerged as automated predictors. Well-known machine learning algorithms such as Naïve Bayes (NB) and Random Forest (RF) are successfully used for fake-news identification problem. Turkish is a morphologically rich language and it has agglutinative complexity that requires dense language pre-processing steps and feature selection. Recent neural language models such as Bidirectional Encoder Representations from Transformers (BERT) proposes an opportunity for Turkish-like morphologically rich languages a relatively straightforward pipeline in the solution of natural language problems. In this work, we compared NB, RF, Support Vector Machine (SVM), Naïve Bayes Multinomial (NBM) and Logistics Regression (LR) on top of correlation based feature selection and newly proposed Turkish-BERT (BERTurk) to identify Turkish fake news. And we obtained 99.90 % accuracy in fake news identification which is a highly efficient model without substantial language pre-processing tasks.

show abstract

Analyzing Stemming Approaches for Turkish Multi-Document Summarization

Cited by 9 publications

References 14 publications

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Advanced Turkish Fake News Prediction With Bidirectional Encoder Representations From Transformers

Contact Info

Product

Resources

About