The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from the last few years, and surveys other corpus tools and websites.
Gorman and Curran (2006) argue that thesaurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made publicly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is implemented in the Sketch Engine (Kilgarriff et al., 2004).Another innovative development in the same tool is the presentation of the grammatical behaviour of a word against the background of how all other words of the same word class behave. Thus, the English noun constraint occurs 75% in the plural. Is this a salient lexical fact? To form a judgement, we need to know the distribution for all nouns. We use histograms to present the distribution in a way that is easy to grasp.
Abstract. This paper deals with Czech disambiguated corpus DESAM. It is a tagged corpus which has been manually disambiguated and can be used in various applications. We discuss the structure of the corpus, tools used for its managing, linguistic applications, and also possible use of machine learning techniques relying on the disambiguated data. Possible ways of developing the procedures for complete automatic disambiguation are considered. IntroductionIn computational linguistics, "corpus" is a collection of written (or sometimes spoken) texts. Corpora could be used in several application areas: building dictionaries, general linguistic research, natural language processing, information retrieval, machine translation etc.In corpus exploration, a user must be able to express the query as precisely as possible in order to minimize the number of concordance items searched for. It should be possible to refer to linguistic or structural information in corpus. We use the term "tagged (annotated) corpus" for a corpus which contains not only of a sequence of words but also comprises an additional information. Typically, this includes linguistic information which is associated with the particular word forms in corpus: the most common linguistic tags are lemma (the basic word form), part of speech (POS) and the respective grammatical categories. Another level of annotations concerns structural information which identifies a metatext structure of the text in corpus. For example, we can mark (annotate) that the sequence of word forms is a part of the headline or a regular sentence in a paragraph [1].The most reasonable way how to build large annotated corpora is an automatic tagging of the texts by computer programmes. However, natural languages display rather complex structure and therefore it is no surprise that the attempts to process them by the simple deterministic algorithms do not always yield satisfactory results. The result is that the present tagging programmes are not able to give fully reliable results and there are many ambiguities in their output.Various strategies trying to resolve the ambiguities in the tagged corpora have been developed and applied within the field of corpus linguistics. The most frequently used are the following:
For many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a`corpus factory' where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. The corpora we have developed are available for use in the Sketch Engine corpus query tool.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.