Abstract-Due to the increased amount of available biodiversity data, many biodiversity research institutions are now making their databases openly available on the web. Researchers in the field use this databases to extract new knowledge and also share their own discoveries. However, when these researchers need to find relevant information in the data, they still rely on the traditional search approach, based on text matching, that is not appropriate to be used in these large amounts of heterogeneous biodiversity's data, leading to search results with low precision and recall.We present a new architecture that tackle this problem using a semantic search system for biodiversity data. Semantic search aims to improve search accuracy by using ontologies to understand user objectives and the contextual meaning of terms used in the search to generate more relevant results. Biodiversity data is mapped to terms from relevant ontologies, such as Darwin Core, DBpedia, Ontobio and Catalogue of Life, stored using semantic web formats and queried using semantic web tools (such as triple stores). A prototype semantic search tool was successfully implemented and evaluated by users from the National Research Institute for the Amazon (INPA). Our results show that the semantic search approach has a better precision (28% improvement) and recall (25% improvement) when compared to keyword based search, when used in a big set of representative biodiversity data (206,000 records) from INPA and the Emilio Gueldi Museum in Pará (MPEG). We also show that, because the biodiversity data is now in semantic web format and mapped to ontology terms, it is easy to enhance it with information from other sources, an example using deforestation data (from the National Institute of Space Research -INPE) to enrich collection data is shown.
Abstract-Biodiversity studies all life forms that we find in nature. The maintenance of biological diversity is important because it is essential to life on Earth. The lack of accurate spatial geographic information in species occurrence data, especially from diversity rich regions (like the Amazon Forest), leads to problems in many conservation activities, such as systematic planning for the protection of endangered species. In this paper, we present a gazetteer (a geographical directory that associate name places to geographic coordinates) for biodiversity data that is available as an Linked Open Data resource (using a GeoSPARQL Endpoint) and show how it can be used to improve inaccurate geographic collection data. We compared the efficiency of our Gazetteer with three openly available resources, Geonames, WikiMapia and Wikipedia, and got a 10% better recall rate than these endpoints. We also used the Gazetteer to correct geographic data from a big record sample (327,000 occurrence records) from SpeciesLink and GBIF (two big open access repositories of biodiversity occurrence data). In this data set, we were able to add geographic coordinates to around 14% of records that did not have them before.
Biological diversity is of essential value to life sustainability on Earth and motivates many efforts to collect data about species. That gives rise to a large amount of information. Biodiversity data, in most cases, is stored in relational databases. Researchers use this data to extract knowledge and share their new discoveries about living things. However, nowadays the traditional search approach (based basically on keywords matching) is not appropriate to be used in large amounts of heterogeneous biodiversity data. Search by keyword has low precision and recall in this kind of data. This work presents a new architecture to tackle this problem using a semantic search system for biodiversity data and semantic web formats and tools to represent this data. Semantic search aims to improve search accuracy by using ontologies to understand user objectives and the contextual meaning of terms used in the search to generate more relevant results. This work also presents test results using a set of representative biodiversity data from the National Research Institute for the Amazon (INPA) and the Emilio Gueldi Museum in Pará (MPEG). Ontologies allow knowledge to be organized into conceptual spaces in accordance to its meaning. For semantic search to work, a key point is to create mappings between the data (in this case, INPA's and MPEG's biodiversity data) and the ontologies describing it, in this case: the species taxonomy (a taxonomy is an ontology where each class can have just one parent) and OntoBio, INPA's biodiversity ontology. These mappings were created after we extracted the taxonomic classification from the Catalogue of Life (CoL) website and created a new version of OntoBio. A prototype of the architecture was built and tested using INPA's and MPEG's use cases and data. The results showed that the semantic search approach had a better precision (28% improvement) and recall (25% improvement) when compared to keyword based search. They also showed that it was possible to easily connect the mapped data to other Linked Open Data sources, such as the Amazon Forest Linked Data from the National Institute for Space Research (INPE).
Abstract-Nowadays, the Web has become one of the main sources of biodiversity information. An increasing number of biodiversity research institutions add new specimens and their related information to their biological collections and make this information available on the Web. However, mechanisms which are currently available provide insufficient provenance of biodiversity information. In this paper, we propose a new biodiversity provenance model extending the W3C PROV Data Model. Biodiversity data is mapped to terms from relevant ontologies, such as Dublin Core and GeoSPARQL, stored in triple stores and queried using SPARQL endpoints. Additionally, we provide a use case using our provenance model to enrich collection data.
I would like to thank Erik for his unconditionally support on my work. I would like to acknowledge the researchers from the Data Science Laboratory for their feedback and collaboration on my research. I would like to acknowledge the Laboratory of Molecular Biodiversity and Conservation of the Federal University of Sao Carlos for their help with discovering issues and providing useful suggestions for this thesis. There were many friends and family members who supported me during my PhD. First and foremost, I would like to thank my Mom Florencia, my siblings Josimar and Stefany, my boyfriend Evan, my syster in law Carolina, my niece Miranda, for their constant love and support. My friends from Perú, Brazil and Belgium. I am lucky to have met friends from different countries. Thank you to all of them for their support.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.