A deeply annotated testbed for geographical text analysis

Rayson, Paul; Reinhold, Alex; Butler, James Odelle; Donaldson, Christopher; Gregory, Ian; Taylor, J.

doi:10.1145/3149858.3149865

Cited by 20 publications

(14 citation statements)

References 19 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a final exploration of our corpora, we set out to discover not only what was talked about, but where. Place names in the CLDW have been georeferenced using toponym recognition and resolution (Rayson et al 2017) and used to explore broader themes, including local variation in the use of aesthetic language ) and acoustic experience . In contrast, the Geograph corpus explicitly links descriptions to 1km grid squares, and texts can therefore be mapped without any additional processing.…”

Section: Mapping the Silencementioning

confidence: 99%

Hearing the silence: finding the middle ground in the spatial humanities? Extracting and comparing perceived silence and tranquillity in the English Lake District

Chesnokova

Taylor

Gregory

et al. 2018

International Journal of Geographical Information Science

View full text Add to dashboard Cite

We analyse silence and tranquillity in historical and contemporary corpora to understand ways landscapes were-and are-perceived in the Lake District National Park in England. Through macro and microreading we develop a taxonomy of aural experiences, and explore how changes to categories of silence from our taxonomy-for instance, the overall decline in mentions of absolute silence-provide clues to changes in the landscape and soundscape of the Lake District. Modern authors often contrast silence with anthropogenic sounds, while historical authors adhere to a cultural construction where the Lake District is presented as a tranquil area by ignoring industrial sounds. Using sentiment analysis we show that silence and tranquil sounds in our corpora are, as a whole, more positively associated than random text from the corpora, with this difference being especially marked in contemporary descriptions. Focusing closely on individual texts allows us to illustrate how this increased positivity can be related to the emergence of silence and tranquillity as valuable components of landscape. Mapping our corpora confirmed the influence of Wordsworth's writing on descriptions of silence; and revealed the co-location of pockets of tranquillity near to transport arteries in contemporary descriptions. ARTICLE HISTORY

show abstract

Section: Mapping the Silencementioning

confidence: 99%

Hearing the silence: finding the middle ground in the spatial humanities? Extracting and comparing perceived silence and tranquillity in the English Lake District

Chesnokova

Taylor

Gregory

et al. 2018

International Journal of Geographical Information Science

View full text Add to dashboard Cite

show abstract

“…To process the plague reports, we used the Edinburgh Geoparser [Grover et al, 2010], a text mining pipeline which has been previously applied to other types of historical text [Rupp et al, 2013, Clifford et al, 2016, Rayson et al, 2017, Alex et al, 2019. This tool is made up of a series of processing components.…”

Section: Automatic Annotation and Text Miningmentioning

confidence: 99%

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Casey¹,

Bennett²,

Tobin³

et al. 2021

Journal of Data Mining &Amp; Digital Humanities

View full text Add to dashboard Cite

The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.

show abstract

“…Much of this work relies on the ability to extract entities accurately, including work focused on modeling (Bamman et al, 2014;Iyyer et al, 2016;Chaturvedi et al, 2017). And yet, with notable exceptions (Vala et al, 2015;Brooke et al, 2016), nearly all of this work tends to use NER models that have been trained on non-literary data, for the simple reason that labeled data exists for domains like news through standard datasets like ACE (Walker et al, 2006), CoNLL (Tjong Kim Sang and De Meulder, 2003) and OntoNotes (Hovy et al, 2006)-and even historical non-fiction (De-Lozier et al, 2016;Rayson et al, 2017)-but not for literary texts. This is naturally problematic for several reasons: models trained on out-of-domain data surely degrade in performance when applied to a very different domain, and especially for NER, as Augenstein et al (2017) has shown; and without indomain test data, it is difficult to directly estimate the severity of this degradation.…”

Section: Introductionmentioning

confidence: 99%

An annotated dataset of literary entities

Bamman¹,

Popat²,

Shen³

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

We present a new dataset comprised of 210,532 tokens evenly drawn from 100 different Englishlanguage literary texts annotated for ACE entity categories (person, location, geo-political entity, facility, organization, and vehicle). These categories include non-named entities (such as "the boy", "the kitchen") and nested structure (such as [[the cook]'s sister]). In contrast to existing datasets built primarily on news (focused on geopolitical entities and organizations), literary texts offer strikingly different distributions of entity categories, with much stronger emphasis on people and description of settings. We present empirical results demonstrating the performance of nested entity recognition models in this domain; training natively on in-domain literary data yields an improvement of over 20 absolute points in F-score (from 45.7 to 68.3), and mitigates a disparate impact in performance for male and female entities present in models trained on news data.

show abstract

A deeply annotated testbed for geographical text analysis

Cited by 20 publications

References 19 publications

Hearing the silence: finding the middle ground in the spatial humanities? Extracting and comparing perceived silence and tranquillity in the English Lake District

Hearing the silence: finding the middle ground in the spatial humanities? Extracting and comparing perceived silence and tranquillity in the English Lake District

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

An annotated dataset of literary entities

Contact Info

Product

Resources

About