Aitor Ormazabal scite author profile

Aitor Ormazabal

4Publications

58Citation Statements Received

65Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

Ormazabal¹,

Artetxe²,

Labaka³

et al. 2019

View full text Add to dashboard Cite

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning crosslingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.

show abstract

Principled Paraphrase Generation with Parallel Corpora

Ormazabal¹,

Artetxe²,

Soroa³

et al. 2022

View full text Add to dashboard Cite

Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to nonparaphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match, and implement a relaxation of it through the Information Bottleneck method. Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible, while keeping as little information about the input as possible. Paraphrases can be generated by decoding back to the source from this representation, without having to generate pivot translations. In addition to being more principled and efficient than round-trip MT, our approach offers an adjustable parameter to control the fidelity-diversity trade-off, and obtains better results in our experiments. S(xp,xs)Z(xs) , where S is given by Equation 3, and Z is a normalizing factor that does not depend on x p .

show abstract

Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring

Ormazabal¹,

Artetxe²,

Soroa³

et al. 2021

View full text Add to dashboard Cite

Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.

show abstract

PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Ormazabal¹,

Artetxe²,

Agirrezabal³

et al. 2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Aitor Ormazabal

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

Principled Paraphrase Generation with Parallel Corpora

Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring

PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Contact Info

Product

Resources

About