Christine P. Chai scite author profile

Christine P. Chai

5Publications

24Citation Statements Received

502Citation Statements Given

How they've been cited

How they cite others

437

500

Affiliations

Microsoft (United States), Making View (Norway), Duke University

Publications

Order By: Most citations

Comparison of text preprocessing methods

Chai

2022

Nat. Lang. Eng.

View full text Add to dashboard Cite

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

show abstract

The Importance of Data Cleaning: Three Visualization Examples

Chai

2020

CHANCE

View full text Add to dashboard Cite

Text Mining in Survey Data

Chai¹

2019

Surv Pract

View full text Add to dashboard Cite

Modeling Community Structure and Topics in Dynamic Text Networks

et al. 2019

View full text Add to dashboard Cite

The last decade has seen great progress in both dynamic network modeling and topic modeling. This paper draws upon both areas to create bespoke Bayesian model applied to a dataset consisting of the top 467 US political blogs in 2012, their posts over the year, and their links to one another. Our model allows dynamic topic discovery to inform the latent network model and the network structure to facilitate topic identification. Our results find complex community structure within this set of blogs, where community membership depends strongly upon the set of topics in which the blogger is interested. We examine the time varying nature of the Sensational Crime topic, as well as the network properties of the Election News topic, as notable and easily interpretable empirical examples.

show abstract

Modeling community structure and topics in dynamic text networks

Henry¹,

Banks²,

Chai³

et al. 2016

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Christine P. Chai

Comparison of text preprocessing methods

The Importance of Data Cleaning: Three Visualization Examples

Text Mining in Survey Data

Modeling Community Structure and Topics in Dynamic Text Networks

Modeling community structure and topics in dynamic text networks

Contact Info

Product

Resources

About