The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article - CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library On-line database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels – morphological, syntactic, semantic and discursive – in order to improve implementation and manipulation of automatic text processing.
RESUMO: A diversidade dos recursos de linguagem, que possibilita a construção de aplicações em Processamento de Linguagem Natural, provoca a necessidade da criação de ferramentas que sejam igualmente flexíveis. Além disso, essas ferramentas devem ser tão amigáveis quanto úteis, a fim de reduzir o esforço para usuários iniciantes e, ao mesmo tempo, promover um eficiente desempenho para usuários avançados. O presente artigo apresenta o AnoTex, que é um anotador textual capaz de executar a filtragem de dados estruturados do gênero artigo científico, coletados dos arquivos disponíveis na base de dados da Biblioteca Eletrônica SciELO – Scientific Electronic Library On-line. Como produto do processo de extração, obteve-se uma base de dados com as informações filtradas e estruturadas no formato XML, que delimitam e identificam as marcações do gênero em análise, disponível para uso em várias ferramentas e aplicações. São apresentadas outras ferramentas de anotação de textos, atualmente existentes, e argumenta-se que o AnoTex é o primeiro a combinar um bom nível de facilidade de uso com recursos estruturados, constitutivos do gênero, de alta qualidade linguística. Os resultados demonstram como a categorização dos elementos constitutivos do gênero, por meio de sua representação em bancos de árvore, pode condensar as informações disponíveis de forma hierarquizada e dinâmica, construídas durante a compilação. Essas características podem indicar novas estratégias de uso para as marcações coletadas, de modo a atender às necessidades no melhoramento do acesso e da recuperação da informação proporcionados pelo uso das ferramentas de processamento de texto.PALAVRAS-CHAVE: Processamento de Linguagem Natural; gênero textual; anotador textual; anotação de corpus. ABSTRACT: The diversity of language resources, which enables the construction of applications in Natural Language Processing, causes the need to create tools that are equally flexible. In addition, these tools should be as user-friendly as useful, in order to reduce the effort for new users and at the same time promote efficient performance for expert users. This article presents the AnoTex, which is a textual annotator capable of performing the filtering of structured data of the textual genre scientific article, collected from the available archives in the database of SciELO – Scientific Electronic Library Online. As a product of the extraction process, we have obtained a database structured in the XML format that delimit and identify the markings of the genre under analysis, available for use in various tools and applications. Other textual annotation tools are currently available, and it is argued that AnoTex is the first to combine a good level of ease-of-use with structured, basic text-based features of high linguistic quality. The results demonstrate how the categorization of the constituent elements of the genre, through its representation in tree banks, can concentrate the information available in a hierarchical and dynamic way. These features may indicate new usage strategies for the collected tags to meet the needs for improvement in the access and retrieval of information through the use of word processing tools.KEYWORDS: Natural Language Processing; textual genre; textual annotator; annotation of corpus.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.