We contribute the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists. We present an in-depth analysis of the dataset, highlighting characteristics and challenges. Further, we present results for automatic veracity prediction, both with established baselines and with a novel method for joint ranking of evidence pages and predicting veracity that outperforms all baselines. Significant performance increases are achieved by encoding evidence, and by modelling metadata. Our best-performing model achieves a Macro F1 of 49.2%, showing that this is a challenging testbed for claim veracity prediction.
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multiword phrases is critical in any application of semantic processing, such as search engines [9]; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase "green card" is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality 1 , manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.
Most fact checking models for automatic fake news detection are based on reasoning: given a claim with associated evidence, the models aim to estimate the claim veracity based on the supporting or refuting content within the evidence. When these models perform well, it is generally assumed to be due to the models having learned to reason over the evidence with regards to the claim. In this paper, we investigate this assumption of reasoning, by exploring the relationship and importance of both claim and evidence. Surprisingly, we find on political fact checking datasets that most often the highest effectiveness is obtained by utilizing only the evidence, as the impact of including the claim is either negligible or harmful to the effectiveness. This highlights an important problem in what constitutes evidence in existing approaches for automatic fake news detection.
We provide a general framework for investigating partial identification of structural dynamic discrete choice models and their counterfactuals, along with uniformly valid inference procedures. In doing so, we derive sharp bounds for the model parameters, counterfactual behavior, and low-dimensional outcomes of interest, such as the average welfare effects of hypothetical policy interventions. We characterize the properties of the sets analytically and show that when the target outcome of interest is a scalar, its identified set is an interval whose endpoints can be calculated by solving well-behaved constrained optimization problems via standard algorithms. We obtain a uniformly valid inference procedure by an appropriate application of subsampling. To illustrate the performance and computational feasibility of the method, we consider both a Monte Carlo study of firm entry/exit, and an empirical model of export decisions applied to plant-level data from Colombian manufacturing industries. In these applications, we demonstrate how the identified sets shrink as we incorporate alternative model restrictions, providing intuition regarding the source and strength of identification.
The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
Introdução: O câncer gástrico precoce é definido como adenocarcinoma restrito à mucosa ou submucosa independente do comprometimento linfonodal. Quando diagnosticado nesse estágio, a taxa de sobrevida em cinco anos supera 90%. Objetivo: Caracterizar o perfil clinicoepidemiológico dos pacientes com câncer gástrico precoce tratados no Hospital de Referência em Oncologia de Teresina no período de 2004 a 2009. Método: Realizou-se um studo descritivo, retrospectivo e quantitativo através de revisão prontuários com os casos confirmados de câncer gástrico precoce atendidos no hospital de 2004 a 2009. Resultados: Foram estudados 22 pacientes correspondendo a 3,8% de todos os casos de câncer gástrico, sendo 13 (59%) do sexo masculino, a média de idade foi 59,7 anos. O sintoma mais comum foi a epigastralgia (n=10; 38,5%). O local predominante foi o antro gástrico (n=12, 54%) e a camada mais acometida foi a submucosa (n=14, 63,6%). O grau de diferenciação mais prevalente foi o G3 (n=14, 68,2%). Os tipos macroscópicos mais registrados foram o II c e o III com 7 (31%) casos cada. O tratamento realizado foi o cirúrgico com a gastrectomia total a técnica mais utilizada (n=13, 59%). Conclusão: O câncer gástrico precoce é pouco diagnosticado, sendo uma realidade observada em estudos ocidentais e segue o perfil semelhante aos descritos na literatura.
Most fact checking models for automatic fake news detection are based on reasoning: given a claim with associated evidence, the models aim to estimate the claim veracity based on the supporting or refuting content within the evidence. When these models perform well, it is generally assumed to be due to the models having learned to reason over the evidence with regards to the claim. In this paper, we investigate this assumption of reasoning, by exploring the relationship and importance of both claim and evidence. Surprisingly, we find on political fact checking datasets that most often the highest effectiveness is obtained by utilizing only the evidence, as the impact of including the claim is either negligible or harmful to the effectiveness. This highlights an important problem in what constitutes evidence in existing approaches for automatic fake news detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.