Beáta Megyesi scite author profile

This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance -as measured by micro-averaged F-measure on a standard text categorization collection -is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving higher weights to words in the full-texts that are also extracted as keywords. We also present results for experiments in which the keywords are the only input to the categorizer, either represented as unigrams or intact. Of these two experiments, the unigrams have the best performance, although neither performs as well as headlines only.

show abstract

A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text

Pettersson

Megyesi

Nivre

2014

View full text Add to dashboard Cite

We present a multilingual evaluation of approaches for spelling normalisation of historical text based on data from five languages: English, German, Hungarian, Icelandic, and Swedish. Three different normalisation methods are evaluated: a simplistic filtering model, a Levenshteinbased approach, and a character-based statistical machine translation approach. The evaluation shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.

show abstract

Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts

Baró

Chen

Fornés

et al. 2019

View full text Add to dashboard Cite

The SweLL Language Learner Corpus

Volodina

Granstedt

Matsson

et al. 2019

NEJLT

View full text Add to dashboard Cite

The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.

show abstract

Decipherment of Historical Manuscript Images

Yin

Aldarrab

Megyesi

et al. 2019

View full text Add to dashboard Cite

European libraries and archives are filled with enciphered manuscripts from the early modern period. These include military and diplomatic correspondence, records of secret societies, private letters, and so on. Although they are enciphered with classical cryptographic algorithms, their contents are unavailable to working historians. We therefore attack the problem of automatically converting cipher manuscript images into plaintext. We develop unsupervised models for character segmentation, character-image clustering, and decipherment of cluster sequences. We experiment with both pipelined and joint models, and we give empirical results for multiple ciphers.

show abstract

Transcription of Historical Ciphers and Keys

Megyesi¹

2020

View full text Add to dashboard Cite

Historical ciphertexts and keys contain a wide range of symbols from digits and letters from known alphabets to various types of graphic signs. To be able to study ciphertexts and keys empirically in large(r) scale, consistent representation of the symbol systems used in ciphers is inevitable. In this paper, we present guidelines for transcription of ciphertexts, keys and cipher-related cleartext documents. We hope that the guidelines contribute not only to the systematic and consistent text representation across ciphertexts and keys, but also help in more accurate and reliable transcriptions.

show abstract

Secrets of the Copiale Cipher

Knight

Megyesi

Schaefer

2012

JRFF

View full text Add to dashboard Cite

The Copiale Cipher is a105-pages long, hand-written encrypted manuscript from the mid-eighteenth century. Its code was cracked and the text was deciphered by using modern computational technology combined with philological methods. We describe the book, the features of the text, and give a brief summary of the method by which we deciphered it. Finally, we present the content and the secret society, namely the Oculists, who were hiding behind the cipher.

show abstract

Professional language in Swedish clinical text: Linguistic characterization and comparative studies

et al. 2014

View full text Add to dashboard Cite

This study investigates the linguistic characteristics of Swedish clinical text in radiology reports and doctor's daily notes from electronic health records (EHRs) in comparison to general Swedish and biomedical journal text. We quantify linguistic features through a comparative register analysis to determine how the free text of EHRs differ from general and biomedical Swedish text in terms of lexical complexity, word and sentence composition, and common sentence structures. The linguistic features are extracted using state-of-the-art computational tools: a tokenizer, a part-of-speech tagger, and scripts for statistical analysis. Results show that technical terms and abbreviations are more frequent in clinical text, and lexical variance is low. Moreover, clinical text frequently omit subjects, verbs, and function words resulting in shorter sentences. Clinical text not only differs from general Swedish, but also internally, across its sub-domains, e.g. sentences lacking verbs are significantly more frequent in radiology reports. These results provide a foundation for future development of automatic methods for EHR simplification or clarification.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Beáta Megyesi

A study on automatically extracted keywords in text categorization

A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text

Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts

The SweLL Language Learner Corpus

Decipherment of Historical Manuscript Images

Transcription of Historical Ciphers and Keys

Secrets of the Copiale Cipher

Professional language in Swedish clinical text: Linguistic characterization and comparative studies

Contact Info

Product

Resources

About