Modern deep learning models for natural language processing rely heavily on large amounts of annotated texts. However, obtaining such texts may be difficult when they contain personal or confidential information, for example, in health or legal domains. In this work, we propose a method of de-identifying free-form text documents by carefully redacting sensitive data in them. We show that our method preserves data utility for text classification, sequence labeling, and question answering tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.