2019
DOI: 10.1109/access.2019.2947898
|View full text |Cite
|
Sign up to set email alerts
|

Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

Abstract: As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world's largest text repository, our process of n-gram collect… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 14 publications
(12 citation statements)
references
References 18 publications
0
8
0
Order By: Relevance
“…However, the n-gram collection used in our research greatly exceeds the corpora available for training machine learning models. Hascheck is an online spell checking service that has been collecting n-grams for more than two decades, and its processed corpora are estimated to total more than 7 billion tokens [4]. Therefore, we assumed that using an n-gram collection of such volume could provide better results than expected.…”
Section: Croatian Language Network In a Smart Environmentmentioning
confidence: 99%
See 2 more Smart Citations
“…However, the n-gram collection used in our research greatly exceeds the corpora available for training machine learning models. Hascheck is an online spell checking service that has been collecting n-grams for more than two decades, and its processed corpora are estimated to total more than 7 billion tokens [4]. Therefore, we assumed that using an n-gram collection of such volume could provide better results than expected.…”
Section: Croatian Language Network In a Smart Environmentmentioning
confidence: 99%
“…Currently, N-gram search service is based on 3-gram system, which contains word sequences comprised of three words. However, this can be changed in future research since Hascheck contains n-gram collections of different lengths, with 2 <= n <= 7 [4]. N-gram system used in scope of our work is represented by a directed graph.…”
Section: A3 N-gram Search Servicementioning
confidence: 99%
See 1 more Smart Citation
“…Currently, the n-gram service is based on 3-gram system, which contains word sequences comprised of three words. However, Hascheck contains n-gram collections of different lengths, with 2 <= n <= 7 [34].…”
Section: N-gram Servicementioning
confidence: 99%
“…Morphological lexicon was constructed using a segment of the morphological database available in Hascheck [34], summing up to approximately 700 000 entries. The implemented service provides the morphological descriptor for the given word, thus determining its exact morphological form.…”
Section: Morphological Lexicon Servicementioning
confidence: 99%