2020
DOI: 10.3233/faia200614
|View full text |Cite
|
Sign up to set email alerts
|

Lessons Learned from Creating a Balanced Corpus from Online Data

Abstract: This paper describes lessons learned from developing the most recent Balanced Corpus of Modern Latvian (LVK2018) from various online sources. Most of the new corpora are created from data obtained from various text holders, which requires cooperation agreements with each of the text holders. Reaching these cooperation agreements is a difficult and time consuming task and may not be necessary if the resource to be created is not of hundred millions of size. Although there are many different resources available … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 5 publications
0
0
0
Order By: Relevance
“…Also, a large data set is necessary for ML algorithms to learn and it is not always possible to collect enough data for learning purposes, especially in languages with a relatively small number of users, such as Latvian (Jasmonts et al, 2022). Integration with other systems is possible, but this specific set of biology data may not be sufficient for neural network training (Darģis et al, 2020). In addition, not all information systems permit access via application programming interface to collect data dynamically and more effectively.…”
Section: Introductionmentioning
confidence: 99%
“…Also, a large data set is necessary for ML algorithms to learn and it is not always possible to collect enough data for learning purposes, especially in languages with a relatively small number of users, such as Latvian (Jasmonts et al, 2022). Integration with other systems is possible, but this specific set of biology data may not be sufficient for neural network training (Darģis et al, 2020). In addition, not all information systems permit access via application programming interface to collect data dynamically and more effectively.…”
Section: Introductionmentioning
confidence: 99%