CL-SciSumm Shared Task at EMNLP 2020 Workshop consists of three subtasks about automatic summarization for research papers. This paper introduces the systems of Task 1A and Task 1B submitted by team NLP-PINGAN-TECH. TASK1A is to identify the cited text spans in the reference paper, and Task 1B is to determine the discourse facet of the cited text. Task 1A is regarded as a binary classification task of sentence pairs and the strategies based on language models are proposed. Integration with contextualized embedding with extra information is further explored in this article. For Task 1B, the pre-trained language models are finetuned to accomplish a multi-label classification task. The results show that extra information can improve the identification of cited text spans. The endto-end trained models outperform models trained with two stages, and the averaged prediction of multi-models is more accurate than an individual one.
BACKGROUND
With the increase in the number of biomedical scientific publications, it is of great value to characterize the research status of subtopics in this field, especially in the specific field of diseases. However, there has not been a fully automated pipeline for mining and analysing research hotspots in this field.
OBJECTIVE
We propose a completely automatic method based on natural language processing technology to analyize scientific innovations in a specific disease area.
METHODS
The whole pipeline consists of three steps, i.e. keyphrase extraction, clustering and cluster naming. The pipeline expands the existing literature analysis methods (including keyphrase extraction, document clustering, and paper ranking), adds advanced semantic mining technology (contextualized embeddings from pre-trained language models), and designs a document cluster naming strategy based on core document mining and topic-related phrase mining. With this pipeline, a full picture of the field of a specific disease is established. Distinct document clusters are generated to describe various subfields in disease-related research. Core documents and topic-related phrases are used to name clusters to interpret the concerns that researchers care about. Besides, the relations between clusters are analysed. Finally, several important clusters are analysed, whose core citation paths illustrate the research roadmap for a certain subfield and whose phrases directly describe the hotspots in each subfield.
RESULTS
We applied the method in the field of cataracts. From the 35117 cataract publications, the proposed method has extracted phrases with a high frequency like cataract extraction, cataract formation, intraocular pressure, etc. The method also found the most important documents in this field, which reveal the flow of research hotspots over time. 23 communities are generated and the top 10 topic-related phrases and core documents are extracted to name the communities. The cluster with the most paper is mainly about cataract formation. The cluster with the most high-impact papers focuses on common cataract diseases related to cataract epidemiology surveys. The cluster with the highest novelty and the highest progressiveness is related to the femtosecond laser technique.
CONCLUSIONS
This fully automated method can achieve the full picture of the research status of the field of a specific disease, without expert annotation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.