Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities 2017
DOI: 10.1145/3149858.3149865
|View full text |Cite
|
Sign up to set email alerts
|

A deeply annotated testbed for geographical text analysis

Abstract: This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake Dist… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(14 citation statements)
references
References 19 publications
(21 reference statements)
0
14
0
Order By: Relevance
“…In a final exploration of our corpora, we set out to discover not only what was talked about, but where. Place names in the CLDW have been georeferenced using toponym recognition and resolution (Rayson et al 2017) and used to explore broader themes, including local variation in the use of aesthetic language ) and acoustic experience . In contrast, the Geograph corpus explicitly links descriptions to 1km grid squares, and texts can therefore be mapped without any additional processing.…”
Section: Mapping the Silencementioning
confidence: 99%
“…In a final exploration of our corpora, we set out to discover not only what was talked about, but where. Place names in the CLDW have been georeferenced using toponym recognition and resolution (Rayson et al 2017) and used to explore broader themes, including local variation in the use of aesthetic language ) and acoustic experience . In contrast, the Geograph corpus explicitly links descriptions to 1km grid squares, and texts can therefore be mapped without any additional processing.…”
Section: Mapping the Silencementioning
confidence: 99%
“…To process the plague reports, we used the Edinburgh Geoparser [Grover et al, 2010], a text mining pipeline which has been previously applied to other types of historical text [Rupp et al, 2013, Clifford et al, 2016, Rayson et al, 2017, Alex et al, 2019. This tool is made up of a series of processing components.…”
Section: Automatic Annotation and Text Miningmentioning
confidence: 99%
“…Much of this work relies on the ability to extract entities accurately, including work focused on modeling (Bamman et al, 2014;Iyyer et al, 2016;Chaturvedi et al, 2017). And yet, with notable exceptions (Vala et al, 2015;Brooke et al, 2016), nearly all of this work tends to use NER models that have been trained on non-literary data, for the simple reason that labeled data exists for domains like news through standard datasets like ACE (Walker et al, 2006), CoNLL (Tjong Kim Sang and De Meulder, 2003) and OntoNotes (Hovy et al, 2006)-and even historical non-fiction (De-Lozier et al, 2016;Rayson et al, 2017)-but not for literary texts. This is naturally problematic for several reasons: models trained on out-of-domain data surely degrade in performance when applied to a very different domain, and especially for NER, as Augenstein et al (2017) has shown; and without indomain test data, it is difficult to directly estimate the severity of this degradation.…”
Section: Introductionmentioning
confidence: 99%