LinguaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction

Gamallo, Pablo; García, Marcos; Piñeiro, César; Martínez-Castaño, Rodrigo; Pichel, Juan C.

doi:10.1109/snams.2018.8554689

Cited by 30 publications

(34 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Thus, each of them contains 50 million tokens from Wikipedia, 20 million from the Europarl corpus (Koehn, 2005), 10 million from OpenSubtitles (Lison and Tiedemann, 2016), and a set of 20 million tokens formed by news, web pages, and small corpora from the Universal Dependencies 2018 and PARSEME 1.1 shared tasks (Zeman et al, 2018;Ramisch et al, 2018). The texts were tokenized, PoS-tagged and lemmatized by LinguaKit (Gamallo et al, 2018), and parsed by UDPipe, a state-of-the-art dependency parser based on neural networks (Straka and Straková, 2017). We used the Universal Dependencies formalism, which yielded the best results in a similar comparison (Uhrig et al, 2018), training the models with the 2.3 version of the UD treebanks .…”

Section: Datamentioning

confidence: 99%

A comparison of statistical association measures for identifying dependency-based collocations in various languages.

García¹,

Salido²,

Alonso-Ramos³

2019

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Self Cite

View full text Add to dashboard Cite

This paper presents an exploration of different statistical association measures to automatically identify collocations from corpora in English, Portuguese, and Spanish. To evaluate the impact of the association measures we manually annotated corpora with three different syntactic patterns of collocations (adjective-noun, verb-object and nominal compounds). We took advantage of the PARSEME 1.1 Shared Task corpora by selecting a subset of 155k tokens in the three referred languages, in which we annotated 1, 526 collocations with their Lexical Functions according to the Meaning-Text Theory. Using the resulting gold-standard, we have carried out a comparison between frequency data and several well-known association measures, both symmetric and asymmetric. The results show that the combination of dependency triples with raw frequency information is as powerful as the best association measures in most syntactic patterns and languages. Furthermore, and despite the asymmetric behaviour of collocations, directional approaches perform worse than the symmetric ones in the extraction of these phraseological combinations.

show abstract

Section: Datamentioning

confidence: 99%

A comparison of statistical association measures for identifying dependency-based collocations in various languages.

García¹,

Salido²,

Alonso-Ramos³

2019

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Self Cite

View full text Add to dashboard Cite

show abstract

“…As well as the rest of NLP tasks and algorithms, the development of methods and resources for Portuguese are increasing day by day. Some important examples are HAREM and Second HAREM [32], Linguakit [13], or SIEMÊS [33] algorithms and resources for unsupervised named entity recognition, joint with well-known suites such as FreeLing [34] or Standford CoreNLP [35] for Portuguese and related supervised initiatives based on conditional random fields [36]. It is important to mention here similar works only focused on semantic relation extraction [37].…”

Section: Unsupervised Information Extraction In Portuguese: Linguakitmentioning

confidence: 99%

“…The paper is organized as follows: The remainder of this Introduction section presents the historical context, analysis criteria, and motivation for this work, as well as a review of existing initiatives of natural language application in similar forensic contexts. Section 2 describes the materials and methods employed, including the particularities of the natural language suite Linguakit [13] for Portuguese, which was used as the basis for information extraction, as well as the forensic corpus analyzed. Section 3 presents the results obtained according to the expert criteria adopted: (1) Common causes of death, (2) relevant body locations, (3) personal belongings terminology, and (4) correlations between actors.…”

Section: Introductionmentioning

confidence: 99%

Assisting Forensic Identification through Unsupervised Information Extraction of Free Text Autopsy Reports: The Disappearances Cases during the Brazilian Military Dictatorship

2019

View full text Add to dashboard Cite

Anthropological, archaeological, and forensic studies situate enforced disappearance as a strategy associated with the Brazilian military dictatorship (1964–1985), leaving hundreds of persons without identity or cause of death identified. Their forensic reports are the only existing clue for people identification and detection of possible crimes associated with them. The exchange of information among institutions about the identities of disappeared people was not a common practice. Thus, their analysis requires unsupervised techniques, mainly due to the fact that their contextual annotation is extremely time-consuming, difficult to obtain, and with high dependence on the annotator. The use of these techniques allows researchers to assist in the identification and analysis in four areas: Common causes of death, relevant body locations, personal belongings terminology, and correlations between actors such as doctors and police officers involved in the disappearances. This paper analyzes almost 3000 textual reports of missing persons in São Paulo city during the Brazilian dictatorship through unsupervised algorithms of information extraction in Portuguese, identifying named entities and relevant terminology associated with these four criteria. The analysis allowed us to observe terminological patterns relevant for people identification (e.g., presence of rings or similar personal belongings) and automate the study of correlations between actors. The proposed system acts as a first classificatory and indexing middleware of the reports and represents a feasible system that can assist researchers working in pattern search among autopsy reports.

show abstract

“…In order to build bilingual compositional vectors, we made use of English and Spanish wikipedias (dumps files of December 2018), with 21 and 5 billion words, respectively. The two wikipedias were PoS tagged and syntactically analyzed with LinguaKit (Gamallo et al, 2018). The syntactically analyzed corpus was the basis for the elaboration of the salient lexico-syntactic contexts with which we constructed selectional preferences and contextualized vectors.…”

Section: Corpora and Distributional Modelsmentioning

confidence: 99%

Unsupervised Compositional Translation of Multiword Expressions

Gamallo

García

2019

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Self Cite

View full text Add to dashboard Cite

This article describes a dependency-based strategy that uses compositional distributional semantics and cross-lingual word embeddings to translate multiword expressions (MWEs). Our unsupervised approach performs translation as a process of word contextualization by taking into account lexico-syntactic contexts and selectional preferences. This strategy is suited to translate phraseological combinations and phrases whose constituent words are lexically restricted by each other. Several experiments in adjective-noun and verb-object compounds show that mutual contextualization (co-compositionality) clearly outperforms other compositional methods. The paper also contributes with a new freely available dataset of English-Spanish MWEs used to validate the proposed compositional strategy.

show abstract

LinguaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction

Cited by 30 publications

References 17 publications

A comparison of statistical association measures for identifying dependency-based collocations in various languages.

A comparison of statistical association measures for identifying dependency-based collocations in various languages.

Assisting Forensic Identification through Unsupervised Information Extraction of Free Text Autopsy Reports: The Disappearances Cases during the Brazilian Military Dictatorship

Unsupervised Compositional Translation of Multiword Expressions

Contact Info

Product

Resources

About