“…Following prior work (Kim et al, 2019a;Shen et al, 2018Shen et al, , 2019Cao et al, 2020), we remove punctuation and collapse unary chains before evaluation, and calculate F 1 ignoring trivial spans, i.e., single-word spans and whole-sentence spans, and we perform the averaging at sentence-level (macro average) rather than span-level (micro average), which means that we compute F 1 for each sentence and later average over all sentences. We also mention the oracle (Shen et al, 2019) 47.7 49.4 63.9 -Tree Transformer † (Wang et al, 2019) 50.5 52.0 66.2 -Neural PCFG † (Kim et al, 2019a) 50.8 52.6 64.6 -DIORA (Drozdov et al, 2019) -58.9 60.5 -Compound PCFG † (Kim et al, 2019a) 55.2 60.1 70.5 -S-DIORA † (Drozdov et al, 2020) 57.6 64.0 71.8 -Constituency Test (Cao et al, 2020) 62.8 65.9 (2019a) and take the baseline numbers of certain models from (Kim et al, 2019a;Cao et al, 2020). † denotes models trained without punctuation and denotes models trained on additional data.…”