2020
DOI: 10.3233/faia200600
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Extraction of Lithuanian Cybersecurity Terms Using Deep Learning Approaches

Abstract: The paper presents the results of research on deep learning methods aiming to determine the most effective one for automatic extraction of Lithuanian terms from a specialized domain (cybersecurity) with very restricted resources. A semi-supervised approach to deep learning was chosen for the research as Lithuanian is a less resourced language and large amounts of data, necessary for unsupervised methods, are not available in the selected domain. The findings of the research show that Bi-LSTM network with Bidir… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
2

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 7 publications
(15 reference statements)
0
7
0
Order By: Relevance
“…In their work, pretrained language models clearly outperformed a classification method based on a variety of features, such as statistical descriptors and the domain-specificity measure termhood (Kageura and Umino, 1996). A recently published approach (Rokas et al, 2020) relies on LSTM, GRU and BERT embeddings and achieves high F1 scores for ATE of Lithuanian terms in the cybersecurity domain. Several approaches build on word embed-dings to perform ATE on specific domains, such as medicine (e.g.…”
Section: Related Workmentioning
confidence: 98%
“…In their work, pretrained language models clearly outperformed a classification method based on a variety of features, such as statistical descriptors and the domain-specificity measure termhood (Kageura and Umino, 1996). A recently published approach (Rokas et al, 2020) relies on LSTM, GRU and BERT embeddings and achieves high F1 scores for ATE of Lithuanian terms in the cybersecurity domain. Several approaches build on word embed-dings to perform ATE on specific domains, such as medicine (e.g.…”
Section: Related Workmentioning
confidence: 98%
“…The best results were achieved with Bidirectional Long Short-Term Memory model (Bi-LSTM) using multilingual Bidirectional Encoder Representations from Transformers (BERT) embeddings reaching F1 score of 78.6%. The achieved high score suggests that the semi-supervised deep learning approach is a way to go (Rokas et al, 2020).…”
Section: Development Of Gold Standard Corporamentioning
confidence: 99%
“…Current terminology extraction methods employ machine learning and deep learning approaches. Our pilot study on automatic extraction of monolingual (Lithuanian) cybersecurity terms proved that this methodology allows achieving high results even with very limited resources (Rokas et al, 2020). In the pilot study, several setups of different neural network configurations were iteratively tested by comparing their results to the gold standard which was pre-trained on a very small manually annotated training data (66,706 words of which 1,258 cybersecurity terms were manually annotated) compiled specifically for extraction of cybersecurity terminology.…”
Section: Development Of Gold Standard Corporamentioning
confidence: 99%
“…The project aims at employing current deep learning terminology extraction methods. In 2020, the project team (Rokas et al, 2020) completed a pilot study on semi-supervised automatic extraction of Lithuanian CS terms from a Lithuanian monolingual corpus. A small-scale manually annotated dataset (66,706 word corpus with 1,258 annotated cybersecurity terms) was used as a training data.…”
Section: Introductionmentioning
confidence: 99%
“…A small-scale manually annotated dataset (66,706 word corpus with 1,258 annotated cybersecurity terms) was used as a training data. The pilot study was performed in several stages: firstly, various baseline LSTM and GRU networks were tested using the Adam optimiser and FastText embeddings; secondly, each of the best baseline LSTM and GRU networks were tested with various optimisers; and finally, the best model was compared with a model that has been trained using multilingual BERT embeddings (Rokas et al, 2020). The latter approach proved to be the most efficient: Bidirectional Long Short-Term Memory model (Bi-LSTM) using multilingual Bidirectional Encoder Representations from Transformers (BERT) embeddings reached F1 score of 78.6%.…”
Section: Introductionmentioning
confidence: 99%