“…We conduct experiments on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al, 2018). We compare our method with the baseline methods on two single-sentence classification tasks (CoLA (Warstadt et al, 2018), SST-2 (Socher et al, 2013)), two similarity and paraphrase tasks (MRPC (Dolan & Brockett, 2005), QQP (Chen et al, 2018)), and three inference tasks (MNLI (Williams et al, 2018), QNLI (Rajpurkar et al, 2016), RTE (Dagan et al, 2005;Haim et al, 2006;Giampiccolo et al, 2007;Bentivogli et al, 2009)) 1 . We report accuracy for MNLI, QNLI, QQP, SST-2, RTE, report f1 for MRPC, and report Matthew's correlation for CoLA.…”