Mayank Jobanputra scite author profile

Mayank Jobanputra

5Publications

35Citation Statements Received

92Citation Statements Given

How they've been cited

How they cite others

119

Affiliations

Indian Institute of Technology Madras, Ahmedabad University

Publications

Order By: Most citations

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Ramesh¹,

Doddapaneni²,

Bheemaraj³

et al. 2022

View full text Add to dashboard Cite

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.

show abstract

Question Answering for Fact-Checking

Jobanputra¹

2019

View full text Add to dashboard Cite

Recent Deep Learning (DL) models have succeeded in achieving human-level accuracy on various natural language tasks such as question-answering, natural language inference (NLI), and textual entailment. These tasks not only require the contextual knowledge but also the reasoning abilities to be solved efficiently. In this paper, we propose an unsupervised question-answering based approach for a similar task, fact-checking. We transform the FEVER dataset into a Clozetask by masking named entities provided in the claims. To predict the answer token, we utilize pre-trained Bidirectional Encoder Representations from Transformers (BERT). The classifier computes label based on the correctly answered questions and a threshold. Currently, the classifier is able to classify the claims as "SUPPORTS" and "MANUAL REVIEW". This approach achieves a label accuracy of 80.2% on the development set and 80.25% on the test set of the transformed dataset.

show abstract

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Ramesh¹,

Doddapaneni²,

Bheemaraj³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 46.9 million sentence pairs between English and 11 Indic languages (from two language families). In particular, we compile 12.4 million sentence pairs from existing, publiclyavailable parallel corpora, and we additionally mine 34.6 million sentence pairs from the web, resulting in a 2.8× increase in publicly available sentence pairs. We mine the parallel sentences from the web by combining many corpora, tools, and methods. In particular, we use (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 language pairs. Further, we extracted 82.7 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar and compared with other baselines and previously reported results on publicly available benchmarks. Our models outperform existing models on these benchmarks, establishing the utility of Samanantar. Our data and models will be available publicly 1 and we hope they will help advance research in Indic NMT and multilingual NLP for Indic languages.

show abstract

Mining Similar Methods for Test Adaptation

Sondhi

Jobanputra

Rani³

et al. 2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

OversampledML at SemEval-2022 Task 8: When multilingual news similarity met Zero-shot approaches

Jobanputra¹,

Rodríguez²

2022

View full text Add to dashboard Cite

We investigate the capabilities of pre-trained models without any fine-tuning, for a document-level multilingual news similarity task of SemEval-2022. We utilize title and news content with appropriate pre-processing techniques. Our system derives 14 different similarity features using a combination of pre-trained MPNet with well-known statistical methods (i.e. TF-IDF, Word Mover's distance). We formulate the multilingual news similarity task as a regression task and approximate the overall similarity between two news articles using these features. Our best performing system achieved a correlation score of 70.1% and was ranked 20 th among the 34 participating teams. In this paper, in addition to a system description, we also provide further analysis of our results and an ablation study highlighting the strengths and limitations of our features. We make our code publicly available at https://github.com/cicliscl/multinewssimilarity.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.