Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrati 2021
DOI: 10.18653/v1/2021.eacl-demos.3
|View full text |Cite
|
Sign up to set email alerts
|

Finite-state script normalization and processing utilities: The Nisaba Brahmic library

Abstract: This paper presents an opensource library for efficient lowlevel processing of ten ma jor South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finitestate transducer formalism. We survey some com mon Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rational… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 23 publications
0
5
0
Order By: Relevance
“…Firstly, normalize the information attributes of the power grid around each airport in the unit [7][8], which means that all attribute values are mapped between 0 and 1. The processing formula is as follows:…”
Section: Information Fusion Of the Power Grid Around The Airportmentioning
confidence: 99%
“…Firstly, normalize the information attributes of the power grid around each airport in the unit [7][8], which means that all attribute values are mapped between 0 and 1. The processing formula is as follows:…”
Section: Information Fusion Of the Power Grid Around The Airportmentioning
confidence: 99%
“…All the native script data was normalized using Unicode NFC (Whistler, 2021). The data was then further transformed using language-specific visual normalization for Brahmic and Perso-Arabic writing systems using the Nisaba script normalization library (Johny et al, 2021;Gutkin et al, 2022). Both NFC and visual normalization operations preserve visual invariance of the input text, with visual normalization handling many ambiguous cases that fall outside the scope of standard NFC.…”
Section: D43 Data Preparationmentioning
confidence: 99%
“…Our transliterator uses the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et al, 2021). 12 The script operations in Nisaba are efficiently and succinctly represented as weighted finite-state transducers (WFSTs) using Pynini finite-state grammars (Gorman, 2016;Gorman & Sproat, 2021).…”
Section: Transliteration For Meiteilonmentioning
confidence: 99%
“…The four component WFSTs are compiled into the final transliteration WFST T = N • R • P • R −1 , where "•" denotes FST composition operation (Mohri, 2009). The first component transducer N implements visual normalization of the Bengali script input that consists of visually invariant normalization transformations including NFC (Johny et al, 2021). This is followed by the Meiteilon-specific Bengali to Latin script many-to-one mapping R that produces Latin script output in ISO 15919 format (ISO, 2001) augmented with some placeholder markers required for the next processing stage.…”
Section: Transliteration For Meiteilonmentioning
confidence: 99%