Proceedings of the 9th Web as Corpus Workshop (WaC-9) 2014
DOI: 10.3115/v1/w14-0405
|View full text |Cite
|
Sign up to set email alerts
|

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Abstract: In this paper we present the construction process of top-level-domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
44
0
10

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 79 publications
(54 citation statements)
references
References 6 publications
0
44
0
10
Order By: Relevance
“…To this end we concatenated the hrWaC corpus (Ljubešić and Klubička, 2014) with the target side of the aforementioned parallel corpora.…”
Section: Mt Systems and Datasetsmentioning
confidence: 99%
“…To this end we concatenated the hrWaC corpus (Ljubešić and Klubička, 2014) with the target side of the aforementioned parallel corpora.…”
Section: Mt Systems and Datasetsmentioning
confidence: 99%
“…Baza po~iva na empirijski utvr|enim vrijednostima za kategorije konkretnosti, predo~ivosti, subjektivne ~esto}e i dobi usvajanja 3.000 leksema (imenica, glagola i pridjeva) hrvatskoga jezika iz korpusa hrWaC (Ljube{i} i Klubi~ka 2014). Kako slu~ajnim odabirom leksema iz korpusa nisu obuhva}eni mnogi naj~e{}i leksemi hrvatskoga jezika, baza }e tijekom 2018. biti nadopunjena s jo{ 1.500 naj~e{}ih leksema iz Hrvatskog ~estotnog rje~nika (Mogu{, Bratani} i Tadi} 1999) za koje jo{ nisu prikupljene procjene promatranih kategorija i jo{ 1.500 leksema ekscerpiranih iz ud`benika za hrvatski jezik, prirodu, matematiku, povijest i geografiju za 4., 5. i 6. razred osnovne {kole najzastupljenijih u izboru profesora koji ih upotrebljavaju u nastavi.…”
Section: Uvodunclassified
“…In the Croatian National Corpus 3.0 (CNC 3.0) [27] which contains about 234,000,000 tokens, we found altogether 2,908,182 occurrences of the identified homograms and 2,488 occurrences of unique homograms, which make up 72% of all homograms that we obtained in the lexicon. The Croatian web corpus hrWaC [28] is a web corpus that contains 1.9 billion tokens and is annotated with the lemma, morphosyntax, and dependency syntax layers. There were 20,694,430 total occurrences of the homograms and 3,180 of unique homograms (92% of the homograms from the lexicon) identified in the hrWaC corpus.…”
Section: A Identifying and Disambiguating Homograms In Corpusmentioning
confidence: 99%