A variety of detailed data about geological topics and geoscience knowledge are buried in the geoscience literature and rarely used. Named entity recognition (NER) provides both opportunities and challenges to leverage this wealth of data in the geoscience literature for data analysis and further information extraction. Existing NER models and techniques are mainly based on rule‐based and supervised approaches, and developing such systems requires a costly manual effort. In this paper, we first design a generic stepwise framework for domain‐specific NER. Following this framework, domain‐specific entities and domain‐general words are collected and selected as seed terms. Normalization and grouping processes are then applied to these seed terms for further analysis. A random extraction algorithm based on a unigram language model is used to generate a large‐scale training data set consisting of probabilistically labeled pseudosentences. Each generated sentence is then used as input to the self‐training and learning algorithm. Experimental results on two constructed data sets demonstrate that the proposed model effectively recognizes and identifies geological named entities.
is used to capture the abundant word level features, grammatical structure features and semantic features in sentences. The self-learning strategy assisted by domain knowledge can automatically construct the domain training corpus without manual intervention. A set of experiments to verify the effectiveness of the proposed method on an available manually constructed hybrid dataset.
Spatial relation extraction (e.g., topological relations, directional relations, and distance relations) from natural language descriptions is a fundamental but challenging task in several practical applications. Current state‐of‐the‐art methods rely on rule‐based metrics, either those specifically developed for extracting spatial relations or those integrated in methods that combine multiple metrics. However, these methods all rely on developed rules and do not effectively capture the characteristics of natural language spatial relations because the descriptions may be heterogeneous and vague and may be context sparse. In this article, we present a spatially oriented piecewise convolutional neural network (SP‐CNN) that is specifically designed with these linguistic issues in mind. Our method extends a general piecewise convolutional neural network with a set of improvements designed to tackle the task of spatial relation extraction. We also propose an automated workflow for generating training datasets by integrating new sentences with those in a knowledge base, based on string similarity and semantic similarity, and then transforming the sentences into training data. We exploit a spatially oriented channel that uses prior human knowledge to automatically match words and understand the linguistic clues to spatial relations, finally leading to an extraction decision. We present both the qualitative and quantitative performance of the proposed methodology using a large dataset collected from Wikipedia. The experimental results demonstrate that the SP‐CNN, with its supervised machine learning, can significantly outperform current state‐of‐the‐art methods on constructed datasets.
Unlike English and other western languages, Chinese does not delimit words using white-spaces. Chinese Word Segmentation (CWS) is the crucial first step towards natural language processing. However, for the geoscience subject domain, the CWS problem remains unresolved with many challenges. Although traditional methods can be used to process geoscience documents, they lack the domain knowledge for massive geoscience documents. Considering the above challenges, this motivated us to build a segmenter specifically for the geoscience domain. Currently, most of the state-of-the-art methods for Chinese word segmentation are based on supervised learning, whose features are mostly extracted from a local context. In this paper, we proposed a framework for sequence learning by incorporating cyclic self-learning corpus training. Following this framework, we build the GeoSegmenter based on the Bi-directional Long Short-Term Memory (Bi-LSTM) network model to perform Chinese word segmentation. It can gain a great advantage through iterations of the training data. Empirical experimental results on geoscience documents and benchmark datasets showed that geological documents can be identified, and it can also recognize the generic documents.
Geological reports are frequently used by geologists involved in geological surveys and scientific research to record the results and outcomes of geological surveys. With such a rich data source, a substantial amount of knowledge has yet to be mined and analyzed. This paper focuses on automatically information extraction from geological reports, namely, geological named entity recognition. Geological named entity recognition has an important role in data mining, knowledge discovery and Knowledge graph construction. Existing general named entity recognition models/tools are limited in the domain of geoscience due to the various language irregularities associated with geological text, such as informal sentence structures, several domain‐geoscience words, large character lengths and multiple combinations of independent words. We present Bidirectional encoder representations from transformers (BERT)‐(Bidirectional gated recurrent unit network) BiGRU‐ (Conditional random field) CRF, which is a deep learning‐based geological named entity recognition model that is designed specifically with these linguistic irregularities in mind. Based on the pretrained language model, an integrated deep learning model incorporating BERT, BiGRU and CRF is constructed to obtain character vectors rich in semantic information through the BERT pretrained language model to alleviate for the lack of specificity of static word vectors (e.g., word2vec) and to improve the extraction capability of complex geological entities. We demonstrate our proposed model by applying it to four test datasets, including a geoscience NER data set from regional geological reports, and by comparing its performance with those of five baseline models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.