“…We observe that even on PTB, there is enough variation in setups across prior work on grammar induction to render a meaningful comparison difficult. Some important dimensions along which prior works vary include, (1) lexicalization: earlier work on grammar induction generally assumed gold (or induced) partof-speech tags (Klein and Manning, 2004;Smith and Eisner, 2004;Bod, 2006;Snyder et al, 2009), while more recent works induce grammar directly from words (Spitkovsky et al, 2013;Shen et al, 2018); (2) use of punctuation: even within papers that induce a grammar directly from words, some papers employ heuristics based on punctuation as punctuation is usually a strong signal for start/end of constituents (Seginer, 2007;Ponvert et al, 2011;Spitkovsky et al, 2013), some train with punctuation (Jin et al, 2018;Drozdov et al, 2019;Kim et al, 2019), while others discard punctuation altogether for training (Shen et al, 2018(Shen et al, , 2019; (3) train/test data: some works do not explicitly separate out train/test sets (Reichart and Rappoport, 2010;Golland et al, 2012) while some do (Huang et al, 2012;Parikh et al, 2014;Htut et al, 2018). Maintaining train/test splits is less of an issue for unsupervised structure learning, however in this work we follow the latter and separate train/test data.…”