Automation of treebank annotation

Brants, Thorsten; Skut, Wojciech

doi:10.3115/1603899.1603909

Cited by 27 publications

(23 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…There is a long history of scaling for language models, for both the model and dataset sizes. Brants et al (2007) showed the benefits of using language models trained on 2 trillion tokens, resulting in 300 billion n-grams, on the quality of machine translation. In the context of neural language models, Jozefowicz et al ( 2016) obtained state-of-the-art results on the Billion Word benchmark by scaling LSTMs to 1 billion parameters.…”

Section: Related Workmentioning

confidence: 99%

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Touvron

Bojanowski²,

Caron³

et al. 2023

IEEE Trans. Pattern Anal. Mach. Intell.

517

180

View full text Add to dashboard Cite

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community 1 .

show abstract

Section: Related Workmentioning

confidence: 99%

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Touvron

Bojanowski²,

Caron³

et al. 2023

IEEE Trans. Pattern Anal. Mach. Intell.

517

180

View full text Add to dashboard Cite

show abstract

“…Dickinson and Meurers [24] introduced an algorithm to detect POS tags errors in gold-standard annotations. They present three error detection methods, which are related to the common inter-annotator agreement evaluation strategy [25]. Thiele et al [26] applied a similar technique to detect POS errors and developed a graphical interface that enables users to find and evaluate annotation errors.…”

Section: Visualization and Computational Linguisticsmentioning

confidence: 99%

Visual Interactive Comparison of Part-of-Speech Models for Domain Adaptation

John

Heimerl

Sudra³

et al. 2019

Proceedings of the Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

Interactive visual analysis of documents relies critically on the ability of machines to process and analyze texts. Important techniques for text processing include text summarization, classification, or translation. Many of these approaches are based on part-of-speech tagging, a core natural language processing technique. Part-ofspeech taggers are typically trained on collections of modern newspaper, magazine, or journal articles. They are known to have high accuracy and robustness when applied to contemporary newspaper style texts. However, the performance of these taggers deteriorates quickly when applying them to more domain specific writings, such as older or even historical documents. Large training sets tend to be scarce for these types of texts due to the limited availability of source material and costly digitization and annotation procedures. In this paper, we present an interactive visualization approach that facilitates analysts in determining part-of-speech tagging errors by comparing several standard part-of-speech tagger results graphically. It allows users to explore, compare, evaluate, and adapt the results through interactive feedback in order to obtain a new model, which can then be applied to similar types of texts. A use case shows successful applications of the approach and demonstrates its benefits and limitations. In addition, we provide insights generated through expert feedback and discuss the effectiveness of our approach.

show abstract

“…Since in ¾ All trees in this contribution follow the data format for trees defined by the NEGRA project of the Sonderforschungsbereich 378 at the University of the Saarland, Saarbrücken. They were printed by the NEGRA annotation tool [5]. ¿ Memory-based learning has recently been applied to a variety of NLP classification tasks, including part-of-speech tagging, noun phrase chunking, grapheme-phoneme conversion, word sense disambiguation, and pp attachment (see [9], [14], [15] for details).…”

Section: Similarity-based Tree Construc-tionmentioning

confidence: 99%

“…67,000 fully annotated sentences or sentence fragments. 5 The evaluation consisted of a ten-fold cross-validation test, where the training data provide an instance base of already seen cases for TüSBL's tree construction module.…”

Section: Quantitative Evaluationmentioning

confidence: 99%

TüSBL

Kübler

Hinrichs

2001

Proceedings of the First International Conference on Human Language Technology Research - HLT '01

View full text Add to dashboard Cite

Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances. The TüSBL parser extends current chunk parsing techniques by a tree-construction component that extends partial chunk parses to complete tree structures including recursive phrase structure as well as function-argument structure. TüSBL's tree construction algorithm relies on techniques from memory-based learning that allow similarity-based classification of a given input structure relative to a pre-stored set of tree instances from a fully annotated treebank. A quantitative evaluation of TüSBL has been conducted using a semi-automatically constructed treebank of German that consists of appr. 67,000 fully annotated sentences. The basic PARSEVAL measures were used although they were developed for parsers that have as their main goal a complete analysis that spans the entire input. This runs counter to the basic philosophy underlying TüSBL, which has as its main goal robustness of partially analyzed structures.

show abstract

Automation of treebank annotation

Cited by 27 publications

References 8 publications

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Visual Interactive Comparison of Part-of-Speech Models for Domain Adaptation

TüSBL

Contact Info

Product

Resources

About