2014
DOI: 10.1016/j.procs.2014.08.123
|View full text |Cite
|
Sign up to set email alerts
|

An Economic Approach to Big Data in a Minority Language

Abstract: Google's n-gram project brought recently big data benefits to several main world languages, like English, Chinese etc. Any attempt to derive such systems, aimed to accelerate the development of NLP applications for world minority languages, in the manner in which it has been done in the project, encounters many obstacles. This paper presents an innovative and economic approach to large-scale n-gram system creation applied to the Croatian language case. Instead of using the Web as the world's biggest text repos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(10 citation statements)
references
References 13 publications
0
5
0
Order By: Relevance
“…Second, results from psycholinguistic research (based on the frequency data of the terms used in the partition of the color spectrum) enabled comparison to the data collected via other empirically-based methods. For example, the EOSS data for Croatian were compared to the frequency data from the Croatian n-gram system (based on the Web as Corpus approach) consisting of 1.72 billion tokens (Dembitz et al, 2014). The 165 different Croatian color terms (types) from the EOSS project were checked in the Croatian n-gram system in order to provide evidence about their attestation in a large language resource.…”
Section: Color Termsmentioning
confidence: 99%
“…Second, results from psycholinguistic research (based on the frequency data of the terms used in the partition of the color spectrum) enabled comparison to the data collected via other empirically-based methods. For example, the EOSS data for Croatian were compared to the frequency data from the Croatian n-gram system (based on the Web as Corpus approach) consisting of 1.72 billion tokens (Dembitz et al, 2014). The 165 different Croatian color terms (types) from the EOSS project were checked in the Croatian n-gram system in order to provide evidence about their attestation in a large language resource.…”
Section: Color Termsmentioning
confidence: 99%
“…The update of the n-gram database is performed monthly. Further details about n-gram system creation and maintenance are given in [8].…”
Section: Croatian N-gram System Characteristicsmentioning
confidence: 99%
“…In [8], we have presented a figure, copied here as Fig. 1, that implies the 4-grams are the richest n-grams in Croatian, and no n-grams, n > 4, can ever overcome them.…”
Section: Heaps' Law Applied To Croatian N-gramsmentioning
confidence: 99%
See 2 more Smart Citations