2020
DOI: 10.1007/s40745-020-00277-x
|View full text |Cite
|
Sign up to set email alerts
|

Sparse Principal Component Analysis for Natural Language Processing

Abstract: High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(14 citation statements)
references
References 20 publications
0
10
0
Order By: Relevance
“…PCA, a commonly used analysis in exploratory factor analysis, is a dimensionality technique used to reduce the complexity, or components, of data while still maintaining the integrity of the data [ 49 , 50 ]. For text mining analysis, all words assigned weights by TF-IDF that have been assigned into one of the k-clusters are reduced into simple X and Y coordinates.…”
Section: Methodsmentioning
confidence: 99%
“…PCA, a commonly used analysis in exploratory factor analysis, is a dimensionality technique used to reduce the complexity, or components, of data while still maintaining the integrity of the data [ 49 , 50 ]. For text mining analysis, all words assigned weights by TF-IDF that have been assigned into one of the k-clusters are reduced into simple X and Y coordinates.…”
Section: Methodsmentioning
confidence: 99%
“…Matrices produced by the TF-IDF and k-means clustering algorithms are highly complex, multidimensional, and difficult to interpret [ 41 ]. To simplify these matrices for data visualization purposes, we apply a PCA Analysis, which reduces the data into two dimensions—a common setting for data visualization in NLP analyses.…”
Section: Methodsmentioning
confidence: 99%
“…Remark 6.7. In certain NLP [DL20] and biological tasks [TPK02] where d = n c for a positive integer c, Theorem 6.6 provides a fast algorithm for which the running time depends nearly linearly on nd. We also remark that the algorithm we use for Theorem 6.6 is inspired by the idea of [BPSW21] (their situation involves c = 4).…”
Section: Kernel Linear Systemsmentioning
confidence: 99%
“…Typically, applying the kernel function to each pair of data points takes O(d) time. This is especially undesirable in applications for natural language processing [DL20] and computational biology [TPK02], where d can be as large as poly(n), with n being the number of data points. To compute the kernel matrix, the algorithm does have to read the d × n input matrix.…”
Section: Introductionmentioning
confidence: 99%