2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) 2018
DOI: 10.1109/snams.2018.8554689
|View full text |Cite
|
Sign up to set email alerts
|

LinguaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0
5

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
2

Relationship

3
6

Authors

Journals

citations
Cited by 30 publications
(34 citation statements)
references
References 17 publications
0
28
0
5
Order By: Relevance
“…Thus, each of them contains 50 million tokens from Wikipedia, 20 million from the Europarl corpus (Koehn, 2005), 10 million from OpenSubtitles (Lison and Tiedemann, 2016), and a set of 20 million tokens formed by news, web pages, and small corpora from the Universal Dependencies 2018 and PARSEME 1.1 shared tasks (Zeman et al, 2018;Ramisch et al, 2018). The texts were tokenized, PoS-tagged and lemmatized by LinguaKit (Gamallo et al, 2018), and parsed by UDPipe, a state-of-the-art dependency parser based on neural networks (Straka and Straková, 2017). We used the Universal Dependencies formalism, which yielded the best results in a similar comparison (Uhrig et al, 2018), training the models with the 2.3 version of the UD treebanks .…”
Section: Datamentioning
confidence: 99%
“…Thus, each of them contains 50 million tokens from Wikipedia, 20 million from the Europarl corpus (Koehn, 2005), 10 million from OpenSubtitles (Lison and Tiedemann, 2016), and a set of 20 million tokens formed by news, web pages, and small corpora from the Universal Dependencies 2018 and PARSEME 1.1 shared tasks (Zeman et al, 2018;Ramisch et al, 2018). The texts were tokenized, PoS-tagged and lemmatized by LinguaKit (Gamallo et al, 2018), and parsed by UDPipe, a state-of-the-art dependency parser based on neural networks (Straka and Straková, 2017). We used the Universal Dependencies formalism, which yielded the best results in a similar comparison (Uhrig et al, 2018), training the models with the 2.3 version of the UD treebanks .…”
Section: Datamentioning
confidence: 99%
“…As well as the rest of NLP tasks and algorithms, the development of methods and resources for Portuguese are increasing day by day. Some important examples are HAREM and Second HAREM [32], Linguakit [13], or SIEMÊS [33] algorithms and resources for unsupervised named entity recognition, joint with well-known suites such as FreeLing [34] or Standford CoreNLP [35] for Portuguese and related supervised initiatives based on conditional random fields [36]. It is important to mention here similar works only focused on semantic relation extraction [37].…”
Section: Unsupervised Information Extraction In Portuguese: Linguakitmentioning
confidence: 99%
“…The paper is organized as follows: The remainder of this Introduction section presents the historical context, analysis criteria, and motivation for this work, as well as a review of existing initiatives of natural language application in similar forensic contexts. Section 2 describes the materials and methods employed, including the particularities of the natural language suite Linguakit [13] for Portuguese, which was used as the basis for information extraction, as well as the forensic corpus analyzed. Section 3 presents the results obtained according to the expert criteria adopted: (1) Common causes of death, (2) relevant body locations, (3) personal belongings terminology, and (4) correlations between actors.…”
Section: Introductionmentioning
confidence: 99%
“…In order to build bilingual compositional vectors, we made use of English and Spanish wikipedias (dumps files of December 2018), with 21 and 5 billion words, respectively. The two wikipedias were PoS tagged and syntactically analyzed with LinguaKit (Gamallo et al, 2018). The syntactically analyzed corpus was the basis for the elaboration of the salient lexico-syntactic contexts with which we constructed selectional preferences and contextualized vectors.…”
Section: Corpora and Distributional Modelsmentioning
confidence: 99%