Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Qi, Peng; Zhang, Yuhao; Zhang, Yuhui; Bolton, Jason; Manning, Christopher D.

doi:10.18653/v1/2020.acl-demos.14

Cited by 955 publications

(735 citation statements)

References 9 publications

Supporting

Mentioning

607

Contrasting

Unclassified

Order By: Relevance

“…The current versions of MuST-C and MuST-Cinema do not include Japanese as a source language, however we will still perform the analysis on JESC and OpenSubtitles. For the Chink-Chunk algorithm we preprocess the data using the Stanza toolkit (Qi et al 2020). We first tokenise and perform Multi-Word Token (MWT) expansion to split the words into syntactic units.…”

Section: Methodsmentioning

confidence: 99%

Towards Automatic Subtitling: Assessing the Quality of Old and New Resources

Karakanta¹,

Negri²,

Turchi³

2020

ijcol

View full text Add to dashboard Cite

Growing needs in localising multimedia content for global audiences have resulted in Neural Machine Translation (NMT) gradually becoming an established practice in the field of subtitling in order to reduce costs and turn-around times. Contrary to text translation, subtitling is subject to spatial and temporal constraints, which greatly increase the post-processing effort required to restore the NMT output to a proper subtitle format. In our previous work (Karakanta, Negri, and Turchi 2019), we identified several missing elements in the corpora available for training NMT systems specifically tailored for subtitling. In this work, we compare the previously studied corpora with MuST-Cinema, a corpus enabling end-to-end speech to subtitles translation, in terms of the conformity to the constraints of: 1) length and reading speed; and 2) proper line breaks. We show that MuST-Cinema conforms to these constraints and discuss the recent progress the corpus has facilitated in end-to-end speech to subtitles translation.

show abstract

Section: Methodsmentioning

confidence: 99%

Towards Automatic Subtitling: Assessing the Quality of Old and New Resources

Karakanta¹,

Negri²,

Turchi³

2020

ijcol

View full text Add to dashboard Cite

show abstract

“…AMR parsers in the literature rely on several preand postprocessing rules. We extend these rules for the cross-lingual AMR parsing task based on several multilingual resources such as Wikipedia, BabelNet 4.0 (Navigli and Ponzetto, 2010), DBpedia Spotlight API (Daiber et al, 2013) cation in all languages but Chinese, for which we use Babelfy (Moro et al, 2014) instead, Stanford CoreNLP for English preprocessing pipeline, the Stanza Toolkit (Qi et al, 2020) for Chinese, German and Spanish sentences, and Tint 3 (Aprosio and Moretti, 2016) for Italian. The preprocessing steps consist of: i) lemmatization, ii) PoS tagging, iii) NER, iv) re-categorization of entities and senses, v) removal of wiki links and polarity attributes.…”

Section: Pre-and Postprocessingmentioning

confidence: 99%

“…Preprocessing This step consists of: i) lemmatization, ii) PoS-tagging, iii) NER, iv) re-categorization of entities and senses and v) removal of wiki links and polarity attributes. As NLP pipelines (steps i-iii) we use Stanford CoreNLP for English sentences, the Stanza Toolkit (Qi et al, 2020) for Chinese, German and Spanish sentences, and Tint 13 (Aprosio and Moretti, 2016) for Italian. Re-categorization and anonymization of entities is often used in English AMR parsing to reduce data sparsity Lyu and Titov, 2018;Peng et al, 2017;Konstas et al, 2017).…”

Section: A Cross-lingual Amr Pre-and Postprocessingmentioning

confidence: 99%

XL-AMR: Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques

Blloshmi¹,

Tripodi²,

Navigli³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Meaning Representation (AMR) is a popular formalism of natural language that represents the meaning of a sentence as a semantic graph. It is agnostic about how to derive meanings from strings and for this reason it lends itself well to the encoding of semantics across languages. However, cross-lingual AMR parsing is a hard task, because training data are scarce in languages other than English and the existing English AMR parsers are not directly suited to being used in a cross-lingual setting. In this work we tackle these two problems so as to enable cross-lingual AMR parsing: we explore different transfer learning techniques for producing automatic AMR annotations across languages and develop a crosslingual AMR parser, XL-AMR. This can be trained on the produced data and does not rely on AMR aligners or source-copy mechanisms as is commonly the case in English AMR parsing. The results of XL-AMR significantly surpass those previously reported in Chinese, German, Italian and Spanish. Finally we provide a qualitative analysis which sheds light on the suitability of AMR across languages. We release XL-AMR at github.com/SapienzaNLP/xlamr.

show abstract

“…For the rule-based model, we used the "GUM" model 2 of the Stanford's Stanza toolkit [16] for tokenisation and the "GENIA+PubMed" model 3 of the BLLIP parser [4] for parsing. We converted the resulting trees into Universal Dependencies using the Stanford Dependencies Converter [18].…”

Section: Training and Evaluationmentioning

confidence: 99%

BoneBert: A BERT-based Automated Information Extraction System of Radiology Reports for Bone Fracture Detection and Diagnosis

Dai

Zhong²,

Han

2021

Advances in Intelligent Data Analysis XIX

View full text Add to dashboard Cite

Radiologists make the diagnoses of bone fractures through examining X-ray radiographs and document them in radiology reports. Applying information extraction techniques on such radiology reports to retrieve the information of bone fracture diagnosis could yield a source of structured data for medical cohort studies, image labelling and decision support concerning bone fractures. In this study, we proposed an information extraction system of Bone X-ray radiology reports to retrieve the details of bone fracture detection and diagnosis, based on a bio-medically pre-trained Bidirectional Encoder Representations from Transformers (BERT) natural language processing (NLP) model by Google. The model, named as BoneBert, was first trained on annotations automatically generated by a handcrafted rule-based labelling system using a dataset of 6,048 X-ray radiology reports and then finetuned on a small set of 4,890 expert annotations. Thus, the model was trained in a "semi-supervised" fashion. We evaluated the performance of the proposed model and compared it with the conventional rule-based labelling system on two typical tasks: Assertion Classification (AC) for bone fracture status detection (positive, negative or uncertainty) and Named Entity Recognition (NER) related to the fracture type, the bone type and location of a fracture occurs. BoneBert outperformed the rulebased system in both tasks, showing great potential for automated information extraction of the detection and diagnosis of bone fracture from radiology reports, such as, the clinical status, type and location of bone fracture, and more related observations.

show abstract

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Cited by 955 publications

References 9 publications

Towards Automatic Subtitling: Assessing the Quality of Old and New Resources

Towards Automatic Subtitling: Assessing the Quality of Old and New Resources

XL-AMR: Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques

BoneBert: A BERT-based Automated Information Extraction System of Radiology Reports for Bone Fracture Detection and Diagnosis

Contact Info

Product

Resources

About