Machine transliteration and transliterated text retrieval: a survey

Prabhakar, Dinesh Kumar; Pal, Sukomal

doi:10.1007/s12046-018-0828-8

Cited by 11 publications

(2 citation statements)

References 89 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By design, the silver data provides only a single translation for each English NE. However, multiple translations are often correct, due to the variability of morphology, transliteration, naming conventions and dialects (Prabhakar and Pal, 2018). For example, the English NE "Paul" can be aligned to "Pavel" and "Pavla".…”

Section: Silver Evaluationmentioning

confidence: 99%

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Severini¹,

Imani²,

Dufter³

et al. 2022

Preprint

View full text Add to dashboard Cite

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

show abstract

Section: Silver Evaluationmentioning

confidence: 99%

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Severini¹,

Imani²,

Dufter³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The content written on Facebook , more than 33% comments are written using phonetic text and more than 38% comments are written using code mixed phonetic text (bilingual) [4]. These text do not follow any standard spelling rules, but are based on the pronunciation of the words [5]. So, the development of phonetic dataset(s) is required for the text mining, opinion mining, information retrieval, feedback analysis, business intelligence, data analytics, etc.…”

Section: Introductionmentioning

confidence: 99%

GRT: Gurmukhi to Roman Transliteration System using Character Mapping and Handcrafted Rules

Singh¹,

Sachan²

2019

IJITEE

View full text Add to dashboard Cite

In the last two decades, the transliteration system has got significant research attention. It is observed that Punjabi to English transliteration for all type of part-of-speech words is comparably less studied. Currently, some research work in this area is carried out but only for proper nouns and some technical terms. So, there is need to focus on all type of words. The Gurmukhi to Roman transliteration (GRT) system is the first proposed system for transliteration of all kind of part-of-speech words. This system uses the handcrafted-rules and character mapping (CM) approach for transliteration between languages involved. The CM is done for Gurmukhi script with its equivalent to Roman script. It transliterates text written in Punjabi language into English language. It is tested on 65,130 Punjabi words and achieved accuracy of 99.27%, which is better than other state-of-art system results. The developed system can be used in social media text normalization, translation, sentiment analysis of multilingual text, text summarization of multilingual text, etc.

show abstract

Sentiment Analysis of Multilingual Mixed-Code, Twitter Data Using Machine Learning Approach

Swamy

Kundale

Jadhav

2021

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Machine transliteration and transliterated text retrieval: a survey

Cited by 11 publications

References 89 publications

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

GRT: Gurmukhi to Roman Transliteration System using Character Mapping and Handcrafted Rules

Sentiment Analysis of Multilingual Mixed-Code, Twitter Data Using Machine Learning Approach

Contact Info

Product

Resources

About