2008
DOI: 10.3366/e1749503208000166
|View full text |Cite
|
Sign up to set email alerts
|

Construction and annotation of a corpus of contemporary Nepali

Abstract: In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…MATERIALS USED. One of the corpora used for this project is the Nepali National Corpus (NNC) (Yadava et al 2008), which consists of 14 million words from papers, books, and websites spanning a wide variety of topics. The original data consists of works between 1990 through 1992, referred to hereafter as NNC-O (for NNC-Original).…”
Section: Methodsmentioning
confidence: 99%
“…MATERIALS USED. One of the corpora used for this project is the Nepali National Corpus (NNC) (Yadava et al 2008), which consists of 14 million words from papers, books, and websites spanning a wide variety of topics. The original data consists of works between 1990 through 1992, referred to hereafter as NNC-O (for NNC-Original).…”
Section: Methodsmentioning
confidence: 99%
“…We train two different types of embeddings, Monolingual and Multilingual. Monolingual embedding is trained on Nepali texts collected only from Nepali National Corpus (NNC) [28]. Multilingual embedding is trained on text collected from NNC + Nepali OSCAR [29] + English texts extracted 4 from latest 169899 articles from Wikipedia dump 5 .…”
Section: Methodsmentioning
confidence: 99%
“…They also annotated their dataset for POS tagging but didn't perform manual annotations for it. They used the Nepali National Corpus (NCC) [40] to train a BiLSTM model and used the model to annotate the NepaliNER dataset with POS tags. Adding to the contribution in datasets, Sitaula et al [41] designed a dataset by crawling news documents from popular Nepali online news portals like Kantipur, Ratopati, and Nagarik to collect a total of 35,651 documents across 17 news categories.…”
Section: The Pressing Need Of Data In Nepali Lan-guagementioning
confidence: 99%