“…Many studies pre-trained BERT models with biomedical literature (Lee et al, 2020;Beltagy et al, 2019) or clinical notes (Alsentzer et al, 2019;Peng et al, 2019; to develop the domain-specific language model, and these studies showed that domain-specific models generally outperform off-the-shelf models in varied clinical NLP tasks, such as clinical NER (Yang et al, 2020b;Greenspan et al, 2020;, relation extraction , sentence similarity (Peng et al, 2019), negation detection (Lin et al, 2020), and concept normalization . However, for clinical text classification, which generally requires a series of clinical notes as input (e.g., automatic ICD coding, clinical outcome prediction), BERT does not always perform well probably because of its restriction in computational resources and the fixed-length setting (Li and Yu, 2020;Makarenkov and Rokach, 2020;. In keeping more closely with the spirit of Transformers, our work is also built on top of Transformers with an emphasized focus on effective representation of document sequences, such as all of a patient's clinical notes in an inpatient visit.…”