“…For this reason, EM is typically only used to train log-linear model weights when Z(θ) = 1, e.g., for hidden Markov models, probabilistic context-free grammars, and models composed of locally-normalized log-linear models (Berg-Kirkpatrick et al, 2010), among others. There have been efforts at approximating the summation over elements of X, whether by limiting sequence length (Haghighi and Klein, 2006), only summing over observations in the training data (Riezler, 1999), restricting the observation space based on the task , or using Gibbs sampling to obtain an unbiased sample of the full space (Della Pietra et al, 1997;Rosenfeld, 1997).…”