Masoud Jalili Sabet scite author profile

Masoud Jalili Sabet

18Publications

42Citation Statements Received

290Citation Statements Given

How they've been cited

How they cite others

276

288

Affiliations

Ludwig-Maximilians-Universität München, University of Tehran

Publications

Order By: Most citations

SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings

Sabet¹,

Dufter²,

Yvon³

et al. 2020

View full text Add to dashboard Cite

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings -both static and contextualized -for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners -even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F 1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

show abstract

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

ImaniGooghari¹,

Sabet²,

Dufter³

et al. 2021

View full text Add to dashboard Cite

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

show abstract

Automatic translation memory cleaning

Negri

Ataman

Sabet

et al. 2017

Machine Translation

View full text Add to dashboard Cite

CaMEL: Case Marker Extraction without Labels

Leonie¹,

Hofmann²,

Sabet³

et al. 2022

View full text Add to dashboard Cite

We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarities and differences between the case systems of different languages as well as to annotate fine-grained deep cases in languages in which they are not overtly marked.

show abstract

Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective

Poerner¹,

Sabet²,

Roth³

et al. 2018

Preprint

View full text Add to dashboard Cite

LICD: A Language-Independent Approach for Aspect Category Detection

Ghadery

Movahedi

Sabet³

et al. 2019

View full text Add to dashboard Cite

Low-dimensional Query Projection based on Divergence Minimization Feedback Model for Ad-hoc Retrieval

Dadashkarimi¹,

Sabet²,

Faili³

et al. 2016

Preprint

View full text Add to dashboard Cite

Low-dimensional word vectors have long been used in a wide range of applications in natural language processing. In this paper we shed light on estimating query vectors in ad-hoc retrieval where a limited information is available in the original query. Pseudorelevance feedback (PRF) is a well-known technique for updating query language models and expanding the queries with a number of relevant terms. We formulate the query updating in lowdimensional spaces first with rotating the query vector and then with scaling. These consequential steps are embedded in a queryspecific projection matrix capturing both angle and scaling. In this paper we propose a new but not the most effective technique necessarily for PRF in language modeling, based on the query projection algorithm. We learn an embedded coefficient matrix for each query, whose aim is to improve the vector representation of the query by transforming it to a more reliable space, and then update the query language model. The proposed embedded coefficient divergence minimization model (ECDMM) takes top-ranked documents retrieved by the query and obtains a couple of positive and negative sample sets; these samples are used for learning the coefficient matrix which will be used for projecting the query vector and updating the query language model using a softmax function. Experimental results on several TREC and CLEF data sets in several languages demonstrate effectiveness of ECDMM. The experimental results reveal that the new formulation for the query works as well as state-of-the-art PRF techniques and outperforms state-of-the-art PRF techniques in a TREC collection in terms of MAP,P@5, and P@10 significantly.

show abstract

Graph Algorithms for Multiparallel Word Alignment

Imani¹,

Sabet²,

Senel³

et al. 2021

Preprint

View full text Add to dashboard Cite

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in F 1 of up to 28% over the baseline bilingual word aligner in different datasets.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.