Yanglei Song scite author profile

While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gramdiscovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.

show abstract

Asymptotically optimal, sequential, multiple testing procedures with prior information on the number of signals

Song

Fellouris

2017

Electron. J. Statist.

View full text Add to dashboard Cite

Assuming that data are collected sequentially from independent streams, we consider the simultaneous testing of multiple binary hypotheses under two general setups; when the number of signals (correct alternatives) is known in advance, and when we only have a lower and an upper bound for it. In each of these setups, we propose feasible procedures that control, without any distributional assumptions, the familywise error probabilities of both type I and type II below given, userspecified levels. Then, in the case of i.i.d. observations in each stream, we show that the proposed procedures achieve the optimal expected sample size, under every possible signal configuration, asymptotically as the two error probabilities vanish at arbitrary rates. A simulation study is presented in a completely symmetric case and supports insights obtained from our asymptotic results, such as the fact that knowledge of the exact number of signals roughly halves the expected number of observations compared to the case of no prior information.MSC 2010 subject classifications: Primary 62L10:60G40.

show abstract

RelSim: Relation Similarity Search in Schema-Rich Heterogeneous Information Networks

Wang

Sun

Song

et al. 2016

View full text Add to dashboard Cite

Sequential multiple testing with generalized error control: An asymptotic optimality theory

Song¹,

Fellouris²

2019

Ann. Statist.

View full text Add to dashboard Cite

The sequential multiple testing problem is considered under two generalized error metrics. Under the first one, the probability of at least k mistakes, of any kind, is controlled. Under the second, the probabilities of at least k1 false positives and at least k2 false negatives are simultaneously controlled. For each formulation, the optimal expected sample size is characterized, to a first-order asymptotic approximation as the error probabilities go to 0, and a novel multiple testing procedure is proposed and shown to be asymptotically efficient under every signal configuration. These results are established when the data streams for the various hypotheses are independent and each local log-likelihood ratio statistic satisfies a certain Strong Law of Large Numbers. In the special case of i.i.d. observations in each stream, the gains of the proposed sequential procedures over fixed-sample size schemes are quantified. MSC 2010 subject classifications: Primary 62L10

show abstract

Approximating high-dimensional infinite-order $U$-statistics: Statistical and computational guarantees

Song¹,

Chen²,

Kato³

2019

Electron. J. Statist.

View full text Add to dashboard Cite

We study the problem of distributional approximations to highdimensional non-degenerate U -statistics with random kernels of diverging orders. Infinite-order U -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yanglei Song

Scalable topical phrase mining from text corpora

Asymptotically optimal, sequential, multiple testing procedures with prior information on the number of signals

RelSim: Relation Similarity Search in Schema-Rich Heterogeneous Information Networks

Sequential multiple testing with generalized error control: An asymptotic optimality theory

Approximating high-dimensional infinite-order $U$-statistics: Statistical and computational guarantees

Contact Info

Product

Resources

About