Despite increasing amounts of data and ever improving natural language generation techniques, work on automated journalism is still relatively scarce. In this paper, we explore the field and challenges associated with building a journalistic natural language generation system. We present a set of requirements that should guide system design, including transparency, accuracy, modifiability and transferability. Guided by the requirements, we present a data-driven architecture for automated journalism that is largely domain and language independent. We illustrate its practical application in the production of news articles upon a user request about the 2017 Finnish municipal elections in three languages, demonstrating the successfulness of the data-driven, modular approach of the design. We then draw some lessons for future automated journalism.
We address the problem of automatically acquiring knowledge of event sequences from text, with the aim of providing a predictive model for use in narrative generation systems. We present a neural network model that simultaneously learns embeddings for words describing events, a function to compose the embeddings into a representation of the event, and a coherence function to predict the strength of association between two events. We introduce a new development of the narrative cloze evaluation task, better suited to a setting where rich information about events is available. We compare models that learn vector-space representations of the events denoted by verbs in chains centering on a single protagonist. We find that recent work on learning vector-space embeddings to capture word meaning can be effectively applied to this task, including simple incorporation of a verb's arguments in the representation by vector addition. These representations provide a good initialization for learning the richer, compositional model of events with a neural network, vastly outperforming a number of baselines and competitive alternatives.
Hierarchical structure similar to that associated with prosody and syntax in language can be identified in the rhythmic and harmonic progressions that underlie Western tonal music. Analysing such musical structure resembles natural language parsing: it requires the derivation of an underlying interpretation from an unstructured sequence of highly ambiguous elementsin the case of music, the notes. The task here is not merely to decide whether the sequence is grammatical, but rather to decide which among a large number of analyses it has. An analysis of this sort is a part of the cognitive processing performed by listeners familiar with a musical idiom, whether musically trained or not.Our focus is on the analysis of the structure of expectations and resolutions created by harmonic progressions. Building on previous work, we define a theory of tonal harmonic progression, which plays a role analogous to semantics in language. Our parser uses a formal grammar of jazz chord sequences, of a kind widely used for natural language processing (NLP), to map music, in the form of chord sequences used by performers, onto a representation of the structured relationships between chords. It uses statistical modelling techniques used for wide-coverage parsing in NLP to make practical parsing feasible in the face of considerable ambiguity in the grammar. Using machine learning over a small corpus of jazz chord sequences annotated with harmonic analyses, we show that grammar-based musical interpretation using simple statistical parsing models is more accurate than a baseline HMM. The experiment demonstrates that statistical techniques adapted from NLP can be profitably applied to the analysis of harmonic structure.
We address the problem of cognate identification across vocabularies of any pair of languages. In particular, we focus on the case where the examined languages are low-resource, to the extent that no training data whatsoever in these languages, or even closely related ones, is available for the task. We investigate the extent to which training data from another, unrelated language family can be used instead. Our approach consists of learning a similarity metric from example cognates in Indo-European languages and applying it to low-resource Sami languages of the Uralic family. We apply two models, following previous work: a Siamese convolutional neural network (S-CNN) and a support vector machine (SVM), and compare them with a Levenshtein distance baseline. We test performance on three Sami languages and find that the S-CNN outperforms the other approaches, suggesting that it is better able to learn such general characteristics of cognateness that carry over across language families. We also experiment with fine-tuning the S-CNN model with data from within the language family in order to quantify how well this model can make use of a small amount of target-domain data to adapt.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.