Automating the Curation Process of Historical Literature on Marine Biodiversity Using Text Mining: The DECO Workflow

Paragkamian, Savvas; Sarafidou, Georgia; Mavraki, Dimitra; Pavloudi, Christina; Beja, Joana; Eliezer, Menashè; Lipizer, Marina; Boicenco, Laura; Vandepitte, L.; Perez-Perez, Ruben; Zafeiropoulos, Haris; Arvanitidis, Christos; Pafilis, Evangelos; Gerovasileiou, Vasilis

doi:10.3389/fmars.2022.940844

Cited by 1 publication

(2 citation statements)

References 105 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The growing volume of scientific literature on biodiversity has led to a focus on the development of computational methods for extracting meaningful information from unstructured textual data (Farrell et al, 2022 ; Paragkamian et al, 2022 ). This computational task is known as text mining, and it has been used to identify trends, patterns, and relationships that would otherwise be difficult to detect.…”

Section: Related Workmentioning

confidence: 99%

“…Information extraction (IE) is an umbrella term for tasks that seek to automatically extract structured information from unstructured text. With the exponential growth of digitized literature over the years, IE has become increasingly pertinent, due to its role in (semi-)automatically populating databases with content (Ravikumar et al, 2015 ; Lee et al, 2018 ; Paragkamian et al, 2022 ). Relation extraction (RE) is an IE task that is concerned with the identification of semantic relationships between entities or concepts in text.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species

Gabud,

Lapitan,

Mariano

et al. 2024

Front. Artif. Intell.

View full text Add to dashboard Cite

IntroductionFine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species' time-sensitive reproductive conditions and location-specific habitats.MethodsWe sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches.ResultsEvaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.

show abstract

Section: Related Workmentioning

confidence: 99%