Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality

Mozer, Reagan; Miratrix, Luke; Kaufman, Aaron R.; Anastasopoulos, L. Jason

doi:10.1017/pan.2020.1

Cited by 46 publications

(68 citation statements)

References 65 publications

(94 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We see several opportunities for pushing forward the text-based confounding literature. We hope that scholars will extend our work to a proposed alternative to TIRM, a task already started by Mozer et al (2020) and Veitch, Sridhar, and Blei (2019). A central challenge is developing general-purpose methods for evaluating these new models.…”

Section: Resultsmentioning

confidence: 95%

“…Our approach of augmenting propensity scores with information about topic balance is most similar to the covariate-balancing propensity scores of Imai and Ratkovic (2014). Mozer et al (2020) and Veitch, Sridhar, and Blei (2019) directly build on our framework to propose alternative text adjustment approaches, and the related literature is reviewed in Keith, Jensen, and O'Connor (2020).…”

Section: Related Workmentioning

confidence: 99%

“…A strength of matching relative to other conditioning strategies is that analysts can evaluate the quality of 1 Our primary contribution is to pose the problem of text-based confounding and offer TIRM as one possible solution. Since our paper started circulating in July 2015, Mozer et al (2020) and Veitch, Sridhar, and Blei (2019) have introduced alternative approaches to text confounding. We hope that there will be further developments in this area.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Adjusting for Confounding with Text Matching

Roberts

Stewart

Nielsen

2020

American J Political Sci

View full text Add to dashboard Cite

We identify situations in which conditioning on text can address confounding in observational studies. We argue that a matching approach is particularly well-suited to this task, but existing matching methods are ill-equipped to handle high-dimensional text data. Our proposed solution is to estimate a low-dimensional summary of the text and condition on this summary via matching. We propose a method of text matching, topical inverse regression matching, that allows the analyst to match both on the topical content of confounding documents and the probability that each of these documents is treated. We validate our approach and illustrate the importance of conditioning on text to address confounding with two applications: the effect of perceptions of author gender on citation counts in the international relations literature and the effects of censorship on Chinese social media users. Verification Materials: The materials required to verify the computational reproducibility of the results, procedures, and analyses in this article are available on the American Journal of Political Science Dataverse within the Harvard Dataverse Network, at: https://doi.org/10.7910/DVN/HTMX3K. S ocial media users in China are censored every day, but it is largely unknown how the experience of being censored affects their future online experience. Are social media users who are censored for the first time flagged by censors for increased scrutiny in the future? Is censorship "targeted" and "customized" toward specific users? Do social media users avoid writing after being censored? Do they continue to write on sensitive topics or do they avoid them? Experimentally manipulating censorship would allow us to make credible causal inferences about the effects of experiencing censorship, but this is impractical

show abstract

Section: Resultsmentioning

confidence: 95%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Adjusting for Confounding with Text Matching

Roberts

Stewart

Nielsen

2020

American J Political Sci

View full text Add to dashboard Cite

show abstract

“…Our method performs substantially better than methods limited to one-to-one correspondence and without using grouping information. Our methodological framework is particularly appealing because it can be extended to a wide range of applications, including confounding adjustment via text matching using text data in social science (Roberts et al 2018, Mozer et al 2018, and cross-language record linkage (Song et al 2016, McNamee et al 2011). The learned mapping matrix Π and translation matrix W have key practical value in transferring statistical models across systems (Torrey & Shavlik 2010), capturing the pose of objects (Zhou et al 2014), estimating the relative angle of proteins (Sael & Kihara 2010) and so on.…”

Section: Discussionmentioning

confidence: 99%

Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

Shi

Cai

2020

Journal of the American Statistical Association

View full text Add to dashboard Cite

Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix W ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for W by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of W in both fixed and high-dimensional setting. We demonstrate that the refined estimate of W achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data.

show abstract

“…The second will take advantage of stochastic variational inference (Hoffman et al, 2013) to enable Bayesian Word Embeddings to scale to massive corpora. Finally, the third track for future word will involve tying the anchoring approach discussed above with the emerging literature on making casual claims from text (Fong and Grimmer, 2016;Mozer et al, 2018), and taking advantage of the word similarities to identify appropriate linguistic counterfactuals. Figure 4: There is no significant difference between foreign and international topics before and after 1945, uncertainty is displayed with 95% confidence intervals.…”

Section: Materials Conflict Events Increase In Response To Bellicositymentioning

confidence: 99%

Identification, Interpretability, and

Lauretig

2019

Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science

View full text Add to dashboard Cite

Social scientists have recently turned to analyzing text using tools from natural language processing like word embeddings to measure concepts like ideology, bias, and affinity. However, word embeddings are difficult to use in the regression framework familiar to social scientists: embeddings are are neither identified, nor directly interpretable. I offer two advances on standard embedding models to remedy these problems. First, I develop Bayesian Word Embeddings with Automatic Relevance Determination priors, relaxing the assumption that all embedding dimensions have equal weight. Second, I apply work identifying latent variable models to anchor the dimensions of the resulting embeddings, identifying them, and making them interpretable and usable in a regression. I then apply this model and anchoring approach to two cases, the shift in internationalist rhetoric in the American presidents' inaugural addresses, and the relationship between bellicosity in American foreign policy decision-makers' deliberations. I find that inaugural addresses became less internationalist after 1945, which goes against the conventional wisdom, and that an increase in bellicosity is associated with an increase in hostile actions by the United States, showing that elite deliberations are not cheap talk, and helping confirm the validity of the model.

show abstract

Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality

Cited by 46 publications

References 65 publications

Adjusting for Confounding with Text Matching

Adjusting for Confounding with Text Matching

Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

Identification, Interpretability, and

Contact Info

Product

Resources

About