We present a joint model of three core tasks in the entity analysis stack: coreference resolution (within-document clustering), named entity recognition (coarse semantic typing), and entity linking (matching to Wikipedia entities). Our model is formally a structured conditional random field. Unary factors encode local features from strong baselines for each task. We then add binary and ternary factors to capture cross-task interactions, such as the constraint that coreferent mentions have the same semantic type. On the ACE 2005 and OntoNotes datasets, we achieve state-of-the-art results for all three tasks. Moreover, joint modeling improves performance on each task over strong independent baselines.
We present a discriminative model for single-document summarization that integrally combines compression and anaphoricity constraints.Our model selects textual units to include in the summary based on a rich set of sparse features whose weights are learned on a large corpus. We allow for the deletion of content within a sentence when that deletion is licensed by compression rules; in our framework, these are implemented as dependencies between subsentential units of text. Anaphoricity constraints then improve cross-sentence coherence by guaranteeing that, for each pronoun included in the summary, the pronoun's antecedent is included as well or the pronoun is rewritten as a full mention. When trained end-to-end, our final system 1 outperforms prior work on both ROUGE as well as on human judgments of linguistic quality.
A key challenge in entity linking is making effective use of contextual information to disambiguate mentions that might refer to different entities in different contexts. We present a model that uses convolutional neural networks to capture semantic correspondence between a mention's context and a proposed target entity. These convolutional networks operate at multiple granularities to exploit various kinds of topic information, and their rich parameterization gives them the capacity to learn which n-grams characterize different topics. We combine these networks with a sparse linear model to achieve state-of-the-art performance on multiple entity linking datasets, outperforming the prior systems of Durrett and Klein (2014) and Nguyen et al. (2014).
Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. For each task, we consider in-domain as well as challenging outof-domain settings, where models face more examples they should be uncertain about. We show that: (1) when used out-of-the-box, pretrained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5× lower;(2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain. 1
A hallmark of variational autoencoders (VAEs) for text processing is their combination of powerful encoder-decoder models, such as LSTMs, with simple latent distributions, typically multivariate Gaussians. These models pose a difficult optimization problem: there is an especially bad local optimum where the variational posterior always equals the prior and the model does not use the latent variable at all, a kind of "collapse" which is encouraged by the KL divergence term of the objective. In this work, we experiment with another choice of latent distribution, namely the von Mises-Fisher (vMF) distribution, which places mass on the surface of the unit hypersphere. With this choice of prior and posterior, the KL divergence term now only depends on the variance of the vMF distribution, giving us the ability to treat it as a fixed hyperparameter. We show that doing so not only averts the KL collapse, but consistently gives better likelihoods than Gaussians across a range of modeling conditions, including recurrent language modeling and bag-ofwords document modeling. An analysis of the properties of our vMF representations shows that they learn richer and more nuanced structures in their latent representations than their Gaussian counterparts. 1 S w w F i J h b F I 9 E N q o 8 p x n d a q 7 a S 2 U 1 F C f w c L A w i x 8 m P Y + D e 4 b Q Z o e d J J T + / d 6 e 5 e m H C m j e d 9 O 2 v r G 5 t b 2 4 W d 4 u 7 e / s F h 6 e i 4 o e N U E V o n M Y 9 V K 8 S a c i Z p 3 T D D a S t R F I u Q 0 2 Y 4 v J n 5 z T F V m s X y 3 k w S G g j c l y x i B B s r B a N u 1 k k G b F p 5 f H q 4 6 J b K n u v N g V a J n 5 M y 5 K h 1 S 1 + d X k x S Q a U h H G v d 9 r 3 E B B l W h h F O p 8 V O q m m C y R D 3 a d t S i Q X V Q T Y / e o r O r d J D U a x s S Y P m 6 u + J D A u t J y K 0 n Q K b g V 7 2 Z u J / X j s 1 0 X W Q M Z m k h k q y W B S l H J k Y z R J A P a Y o M X x i C S a K 2 V s R G W C F i b E 5 F W 0 I / v L L q 6 R x 6 f q e 6 9 / 5 5 a q b x 1 G A U z i D C v h w B V W 4 h R r U g c A I n u E V 3 p y x 8 + K 8 O x + L 1 j U n n z m B P 3 A + f w D P 2 J I J < / l a t e x i t > q (z|x) < l a t e x i t s h a 1 _ b a s e 6 4 = " e R n Y s h r 1 8 z 9 i d M J S w w F i J h b F I 9 E N q o 8 p x n d a q 7 a S 2 U 1 F C f w c L A w i x 8 m P Y + D e 4 b Q Z o e d J J T + / d 6 e 5 e m H C m j e d 9 O 2 v r G 5 t b 2 4 W d 4 u 7 e / s F h 6 e i 4 o e N U E V o n M Y 9 V K 8 S a c i Z p 3 T D D a S t R F I u Q 0 2 Y 4 v J n 5 z T F V m s X y 3 k w S G g j c l y x i B B s r B a N u 1 k k G b F p 5 f H q 4 6 J b K n u v N g V a J n 5 M y 5 K h 1 S 1 + d X k x S Q a U h H G v d 9 r 3 E B B l W h h F O p 8 V O q m m C y R D 3 a d t S i Q X V Q T Y / e o r O r d J D U a x s S Y P m 6 u + J D A u t J y K 0 n Q K b g V 7 2 Z u J / X j s 1 0 X W Q M Z m k h k q y W B S l H J k Y z R J A P a Y o M X x i C S a
This paper describes a parsing model that combines the exact dynamic programming of CRF parsing with the rich nonlinear featurization of neural net approaches. Our model is structurally a CRF that factors over anchored rule productions, but instead of linear potential functions based on sparse features, we use nonlinear potentials computed via a feedforward neural network. Because potentials are still local to anchored rules, structured inference (CKY) is unchanged from the sparse case. Computing gradients during learning involves backpropagating an error signal formed from standard CRF sufficient statistics (expected rule counts). Using only dense features, our neural CRF already exceeds a strong baseline CRF model (Hall et al., 2014). In combination with sparse features, our system 1 achieves 91.1 F 1 on section 23 of the Penn Treebank, and more generally outperforms the best prior single parser results on a range of languages.
Recent neural network approaches to summarization are largely either selection-based extraction or generation-based abstraction. In this work, we present a neural model for single-document summarization based on joint extraction and syntactic compression. Our model chooses sentences from the document, identifies possible compressions based on constituency parses, and scores those compressions with a neural model to produce the final summary. For learning, we construct oracle extractive-compressive summaries, then learn both of our components jointly with this supervision. Experimental results on the CNN/Daily Mail and New York Times datasets show that our model achieves strong performance (comparable to state-of-the-art systems) as evaluated by ROUGE. Moreover, our approach outperforms an off-theshelf compression module, and human and manual evaluation shows that our model's output generally remains grammatical.
Neural entity linking models are very powerful, but run the risk of overfitting to the domain they are trained in. For this problem, a “domain” is characterized not just by genre of text but even by factors as specific as the particular distribution of entities, as neural models tend to overfit by memorizing properties of frequent entities in a dataset. We tackle the problem of building robust entity linking models that generalize effectively and do not rely on labeled entity linking data with a specific entity distribution. Rather than predicting entities directly, our approach models fine-grained entity properties, which can help disambiguate between even closely related entities. We derive a large inventory of types (tens of thousands) from Wikipedia categories, and use hyperlinked mentions in Wikipedia to distantly label data and train an entity typing model. At test time, we classify a mention with this typing model and use soft type predictions to link the mention to the most similar candidate entity. We evaluate our entity linking system on the CoNLL-YAGO dataset (Hoffart et al. 2011) and show that our approach outperforms prior domain-independent entity linking systems. We also test our approach in a harder setting derived from the WikilinksNED dataset (Eshel et al. 2017) where all the mention-entity pairs are unseen during test time. Results indicate that our approach generalizes better than a state-of-the-art neural model on the dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.