Fatiha Sadat scite author profile

Fatiha Sadat

5Publications

252Citation Statements Received

68Citation Statements Given

How they've been cited

295

249

How they cite others

Affiliations

Université du Québec, Université du Québec à Montréal, Nara Institute of Science and Technology

Publications

Order By: Most citations

Arabic preprocessing schemes for statistical machine translation

Habash

Sadat

2006

115

View full text Add to dashboard Cite

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

show abstract

Combination of Arabic preprocessing schemes for statistical machine translation

Sadat

Habash

2006

View full text Add to dashboard Cite

Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality.

show abstract

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

Sadat¹,

Kazemi²,

Farzindar³

2014

View full text Add to dashboard Cite

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%.

show abstract

An approach based on multilingual thesauri and model combination for bilingual lexicon extraction

Dejean

Gaussier

Sadat

2002

View full text Add to dashboard Cite

This paper focuses on exploiting different models and methods in bilingual lexicon extraction, either from parallel or comparable corpora, in specialized domains. First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to combine the different models for bilingual lexicon extraction is presented. Our results show that the combination of the models significantly improves results, and that the use of the hierarchical information contained in our thesaurus, UMLS/MeSH, is of primary importance. Lastly, methods for bilingual terminology extraction and thesaurus enrichment are discussed.

show abstract

Arabic preprocessing for Statistical Machine Translation

Habash¹,

Sadat²

2012

View full text Add to dashboard Cite

We describe an approach to automatic source-language syntactic preprocessing in the context of Arabic-English phrase-based machine translation. Source-language labeled dependencies, that are word aligned with target language words in a parallel corpus, are used to automatically extract syntactic reordering rules in the same spirit of Xia and McCord (2004) and Zhang et al. (2007). The extracted rules are used to reorder the source-language side of the training and test data. Our results show that when using monotonic decoding and translations for unigram source-language phrases only, source-language reordering gives very significant gains over no reordering (25% relative increase in BLEU score). With decoder distortion turned on and with access to all phrase translations, the differences in BLEU scores are diminished. However, an analysis of sentence-level BLEU scores shows reordering outperforms no-reordering in over 40% of the sentences. These results suggest that the approach holds big promise but much more work on Arabic parsing may be needed.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Fatiha Sadat

Arabic preprocessing schemes for statistical machine translation

Combination of Arabic preprocessing schemes for statistical machine translation

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

An approach based on multilingual thesauri and model combination for bilingual lexicon extraction

Arabic preprocessing for Statistical Machine Translation

Contact Info

Product

Resources

About