Sparse Principal Component Analysis for Natural Language Processing

Drikvandi, Reza; Lawal, Olamide

doi:10.1007/s40745-020-00277-x

Cited by 16 publications

(14 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PCA, a commonly used analysis in exploratory factor analysis, is a dimensionality technique used to reduce the complexity, or components, of data while still maintaining the integrity of the data [ 49 , 50 ]. For text mining analysis, all words assigned weights by TF-IDF that have been assigned into one of the k-clusters are reduced into simple X and Y coordinates.…”

Section: Methodsmentioning

confidence: 99%

Using Natural Language Processing to Explore “Dry January” Posts on Twitter: Longitudinal Infodemiology Study

Russell¹,

Valdez²,

Chiang³

et al. 2022

J Med Internet Res

View full text Add to dashboard Cite

Background Dry January, a temporary alcohol abstinence campaign, encourages individuals to reflect on their relationship with alcohol by temporarily abstaining from consumption during the month of January. Though Dry January has become a global phenomenon, there has been limited investigation into Dry January participants’ experiences. One means through which to gain insights into individuals’ Dry January-related experiences is by leveraging large-scale social media data (eg, Twitter chatter) to explore and characterize public discourse concerning Dry January. Objective We sought to answer the following questions: (1) What themes are present within a corpus of tweets about Dry January, and is there consistency in the language used to discuss Dry January across multiple years of tweets (2020-2022)? (2) Do unique themes or patterns emerge in Dry January 2021 tweets after the onset of the COVID-19 pandemic? and (3) What is the association with tweet composition (ie, sentiment and human-authored vs bot-authored) and engagement with Dry January tweets? Methods We applied natural language processing techniques to a large sample of tweets (n=222,917) containing the term “dry january” or “dryjanuary” posted from December 15 to February 15 across three separate years of participation (2020-2022). Term frequency inverse document frequency, k-means clustering, and principal component analysis were used for data visualization to identify the optimal number of clusters per year. Once data were visualized, we ran interpretation models to afford within-year (or within-cluster) comparisons. Latent Dirichlet allocation topic modeling was used to examine content within each cluster per given year. Valence Aware Dictionary and Sentiment Reasoner sentiment analysis was used to examine affect per cluster per year. The Botometer automated account check was used to determine average bot score per cluster per year. Last, to assess user engagement with Dry January content, we took the average number of likes and retweets per cluster and ran correlations with other outcome variables of interest. Results We observed several similar topics per year (eg, Dry January resources, Dry January health benefits, updates related to Dry January progress), suggesting relative consistency in Dry January content over time. Although there was overlap in themes across multiple years of tweets, unique themes related to individuals’ experiences with alcohol during the midst of the COVID-19 global pandemic were detected in the corpus of tweets from 2021. Also, tweet composition was associated with engagement, including number of likes, retweets, and quote-tweets per post. Bot-dominant clusters had fewer likes, retweets, or quote tweets compared with human-authored clusters. Conclusions The findings underscore the utility for using large-scale social media, such as discussions on Twitter, to study drinking reduction attempts and to monitor the ongoing dynamic needs of persons contemplating, preparing for, or actively pursuing attempts to quit or cut down on their drinking.

show abstract

Section: Methodsmentioning

confidence: 99%

Using Natural Language Processing to Explore “Dry January” Posts on Twitter: Longitudinal Infodemiology Study

Russell¹,

Valdez²,

Chiang³

et al. 2022

J Med Internet Res

View full text Add to dashboard Cite

show abstract

“…Matrices produced by the TF-IDF and k-means clustering algorithms are highly complex, multidimensional, and difficult to interpret [ 41 ]. To simplify these matrices for data visualization purposes, we apply a PCA Analysis, which reduces the data into two dimensions—a common setting for data visualization in NLP analyses.…”

Section: Methodsmentioning

confidence: 99%

Computational analyses identify addiction help-seeking behaviors on the social networking website Reddit: Insights into online social interactions and addiction support communities

Valdez

Patterson

2022

PLOS Digit Health

View full text Add to dashboard Cite

Introduction Although social connection to others with lived addiction experiences is a strong predictor of long-term recovery from substance use disorders (SUD), the COVID-19 pandemic greatly altered global abilities to physically connect with other people. Evidence suggests online forums for people with SUD may serve as a sufficient proxy for social connection, however efficacy of online spaces as addiction treatment adjuncts remains empirically understudied. Purpose The purpose of this study is to analyze a collection of Reddit posts germane to addiction and recovery collected between March-August 2022. Methods We collected (n = 9,066) Reddit posts (1) r/addiction; (2) r/DecidingToBeBetter, (3) r/SelfImprovement, (4) r/OpitatesRecovery, (5) r/StopSpeeding, (6) r/RedditorsInRecovery, and (7) r/StopSmoking subreddits. We applied several classes of natural language processing (NLP) methods to analyze and visualize our data including term frequency inverse document frequency (TF-IDF) calculations, k-means clustering, and principal components analysis (PCA). We also applied a Valence Aware Dictional and sEntiment [sic] Reasoner (VADER) sentiment analysis to determine affect in our data. Results Our analyses revealed three distinct clusters: (1) Personal addiction struggle, or sharing one’s recovery journey (n = 2,520), (2) Giving advice, or offering counseling based on first-hand experiences (n = 3,885), and (3) Seeking advice, or asking for support or advice related to addiction (n = 2,661). Discussion & conclusion Addiction, SUD, and recovery dialogue on Reddit is exceedingly robust. Much of the content mirrors tenets for established addiction-recovery programs, which suggests Reddit, and other social networking websites, may serve as efficient tools to promote social connection among people with SUD.

show abstract

“…Remark 6.7. In certain NLP [DL20] and biological tasks [TPK02] where d = n c for a positive integer c, Theorem 6.6 provides a fast algorithm for which the running time depends nearly linearly on nd. We also remark that the algorithm we use for Theorem 6.6 is inspired by the idea of [BPSW21] (their situation involves c = 4).…”

Section: Kernel Linear Systemsmentioning

confidence: 99%

“…Typically, applying the kernel function to each pair of data points takes O(d) time. This is especially undesirable in applications for natural language processing [DL20] and computational biology [TPK02], where d can be as large as poly(n), with n being the number of data points. To compute the kernel matrix, the algorithm does have to read the d × n input matrix.…”

Section: Introductionmentioning

confidence: 99%

Fast Sketching of Polynomial Kernels of Polynomial Degree

Song

Woodruff

et al. 2021

Preprint

View full text Add to dashboard Cite

Kernel methods are fundamental in machine learning, and faster algorithms for kernel approximation provide direct speedups for many core tasks in machine learning. The polynomial kernel is especially important as other kernels can often be approximated by the polynomial kernel via a Taylor series expansion. Recent techniques in oblivious sketching reduce the dependence in the running time on the degree q of the polynomial kernel from exponential to polynomial, which is useful for the Gaussian kernel, for which q can be chosen to be polylogarithmic. However, for more slowly growing kernels, such as the neural tangent and arc-cosine kernels, q needs to be polynomial, and previous work incurs a polynomial factor slowdown in the running time. We give a new oblivious sketch which greatly improves upon this running time, by removing the dependence on q in the leading order term. Combined with a novel sampling scheme, we give the fastest algorithms for approximating a large family of slow-growing kernels.

show abstract

Sparse Principal Component Analysis for Natural Language Processing

Cited by 16 publications

References 20 publications

Using Natural Language Processing to Explore “Dry January” Posts on Twitter: Longitudinal Infodemiology Study

Using Natural Language Processing to Explore “Dry January” Posts on Twitter: Longitudinal Infodemiology Study

Computational analyses identify addiction help-seeking behaviors on the social networking website Reddit: Insights into online social interactions and addiction support communities

Fast Sketching of Polynomial Kernels of Polynomial Degree

Contact Info

Product

Resources

About