Expert Systems, computer programs able to capture human expertise and mimic experts' reasoning, can support the design of future space missions by assimilating and facilitating access to accumulated knowledge. To organise these data, the virtual assistant needs to understand the concepts characterising space systems engineering. In other words, it needs an ontology of space systems. Unfortunately, there is currently no official European space systems ontology. Developing an ontology is a lengthy and tedious process, involving several human domain experts, and therefore prone to human error and subjectivity. Could the foundations of an ontology be instead semi-automatically extracted from unstructured data related to space systems engineering? This paper presents an implementation of the first layers of the Ontology Learning Layer Cake, an approach to semi-automatically generate an ontology. Candidate entities and synonyms are extracted from three corpora: a set of 56 feasibility reports provided by the European Space Agency, 40 books on space mission design publicly available and a collection of 273 Wikipedia pages. Lexica of relevant space systems entities are semi-automatically generated based on three different methods: a frequency analysis, a term frequency-inverse document frequency analysis, and a Weirdness Index filtering. The frequency-based lexicon of the combined corpora is then fed to a word embedding method, word2vec, to learn the context of each entity. With a cosine similarity analysis, concepts with similar contexts are matched.
The transformers architecture and transfer learning have radically modified the Natural Language Processing (NLP) landscape, enabling new applications in fields where open source labelled datasets are scarce. Space systems engineering is a field with limited access to large labelled corpora and a need for enhanced knowledge reuse of accumulated design data. Transformers models such as the Bidirectional Encoder Representations from Transformers (BERT) and the Robustly Optimised BERT Pretraining Approach (RoBERTa) are however trained on general corpora. To answer the need for domainspecific contextualised word embedding in the space field, we propose SpaceTransformers, a novel family of three models, SpaceBERT, SpaceRoBERTa and SpaceSciBERT, respectively further pre-trained from BERT, RoBERTa and SciBERT on our domain-specific corpus. We collect and label a new dataset of space systems concepts based on space standards. We fine-tune and compare our domain-specific models to their general counterparts on a domain-specific Concept Recognition (CR) task. Our study rightly demonstrates that the models further pre-trained on a space corpus outperform their respective baseline models in the Concept Recognition task, with SpaceRoBERTa achieving significant higher ranking overall.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.