Italian in the Trenches: Linguistic Annotation and Analysis of Texts of the Great War

Felice, Irene De; Dell’Orletta⋄, Felice; Venturi, Giulia; Lenci, Alessandro; Montemagni, Simonetta

doi:10.4000/books.aaccademia.3273

Cited by 2 publications

(4 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These also include lexical variants corresponding to archaisms, neologisms, as well as dialectal forms or terminology of a specific domain. We report below, by way of example, some cases recorded in the Voci della Grande Guerra corpus, which collects texts of different genres and linguistic registers from the period of the First World War (De Felice et al 2018): obsolete forms rarely used in contemporary Italian (e.g., costì, tardanza); literary forms, such as pelago and nocumento; variants of current forms and/or lemmas, such as comperare for comprare, spedale for ospedale; diatopically marked forms, typical of a regional variety of Italian like cocuzza or mencio, or dialectal forms like batajun or preive. In addition to these, there are graphical variants of contemporary forms (such as pei for per i, pur troppo for purtroppo) that also have an impact on sentence segmentation.…”

Section: Challengesmentioning

confidence: 99%

“…More recently, POS tagging and lemmatization adaptation experiments have been carried out by using (relatively small) manually revised historical corpora to retrain the tools trained on contemporary language, with significantly improved results. This is the case of De Felice et al (2018) for the Voci della Grande Guerra Corpus, of Montemagni (2021, 2022a) for a subset of the VoDIM corpus (see below), and of Favaro et al (2022) for the the quotations in the Grande dizionario della lingua italiana ('Great Dictionary of Italian Language', in short GDLI). Last but not least, Palmero Aprosio, Menini, and Tonelli (2022) introduce BERToldo, one of the BERT-like models, trained from scratch on historical data.…”

Section: Solutionsmentioning

confidence: 99%

“…In the revision, we took advantage of the experience gained in the project Voci della Grande Guerra 'Voices of the Great War' (VGG) (De Felice et al 2018;Lenci et al 2020), especially for what concerns sentence splitting, tokenization and lemmatization problems derived from the automatic processing of a non-standard historical language variety. With regard to tokenization, some of the annotation problems observed in VGG are the same as those experienced in the annotation of the VoDIM subcorpus.…”

Section: The Corpusmentioning

confidence: 99%

“…); an excessive number of suspension marks is another feature creating sentence splitting problems. Furthermore, scientific texts include a large number of acronyms and symbols, causing various hyposegmentation issues (De Felice et al 2018).…”

Section: The Corpusmentioning

confidence: 99%

See 3 more Smart Citations

POS Tagging and Lemmatization of Historical Varieties of Languages. The Challenge of Old Italian

Favaro,

Biffi,

Montemagni

2023

ijcol

View full text Add to dashboard Cite

The paper discusses the challenges of POS tagging and lemmatization of historical varieties of Italian, and reports for both tasks the results of experiments carried out in a classical supervised domain adaptation scenario using the diachronic and typologically differentiated corpus built for the "Vocabolario Dinamico dell'Italiano Moderno" (VoDIM). For what concerns POS tagging, the effectiveness of retrained models is illustrated and substantiated with quantitative data, with a specific view to linguistic annotation results obtained with respect to specific language evolution stages, domains and textual genres. For lemmatization, different customized models have been developed, including lexicon-assisted ones and models retrained with historical annotated texts. In both cases, a detailed error analysis is provided.

show abstract

Section: Challengesmentioning

confidence: 99%

Section: Solutionsmentioning

confidence: 99%

Section: The Corpusmentioning

confidence: 99%

Section: The Corpusmentioning

confidence: 99%

See 2 more Smart Citations

POS Tagging and Lemmatization of Historical Varieties of Languages. The Challenge of Old Italian

Favaro,

Biffi,

Montemagni

2023

ijcol

View full text Add to dashboard Cite

show abstract

An AI framework to support decisions on GDPR compliance

et al. 2023

View full text Add to dashboard Cite

The Italian Public Administration (PA) relies on costly manual analyses to ensure the GDPR compliance of public documents and secure personal data. Despite recent advances in Artificial Intelligence (AI) have benefited many legal fields, the automation of workflows for data protection of public documents is still only marginally affected. The main aim of this work is to design a framework that can be effectively adopted to check whether PA documents written in Italian meet the GDPR requirements. The main outcome of our interdisciplinary research is INTREPID (art ficial i elligence for gdp complianc of ublic adm nistration ocuments), an AI-based framework that can help the Italian PA to ensure GDPR compliance of public documents. INTREPID is realized by tuning some linguistic resources for Italian language processing (i.e. SpaCy and Tint) to the GDPR intelligence. In addition, we set the foundations for a text classification methodology to recognise the public documents published by the Italian PA, which perform data breaches. We show the effectiveness of the framework over a text corpus of public documents that were published online by the Italian PA. We also perform an inter-annotator study and analyse the agreement of the annotation predictions of the proposed methodology with the annotations by domain experts. Finally, we evaluate the accuracy of the proposed text classification model in detecting breaches of security.

show abstract

Italian in the Trenches: Linguistic Annotation and Analysis of Texts of the Great War

Cited by 2 publications

References 1 publication

POS Tagging and Lemmatization of Historical Varieties of Languages. The Challenge of Old Italian

POS Tagging and Lemmatization of Historical Varieties of Languages. The Challenge of Old Italian

An AI framework to support decisions on GDPR compliance

Contact Info

Product

Resources

About