Michal Rosen-Zvi scite author profile

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.

show abstract

Learning author-topic models from text corpora

Rosen-Zvi

Chemudugunta

Griffiths

et al. 2010

ACM Trans. Inf. Syst.

272

178

View full text Add to dashboard Cite

We propose a new unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1,740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing (for example) generalizations of the notion of an author, are also briefly discussed.

show abstract

Approximate inference by Markov chains on union spaces

2004

View full text Add to dashboard Cite

Predicting Breast Cancer by Applying Deep Learning to Linked Health Records and Mammograms

et al. 2019

View full text Add to dashboard Cite

reast cancer is the second leading cause of cancer-related deaths and the most commonly diagnosed cancer in women across the world (1). Digital mammography (DM) is the primary imaging modality of breast cancer screening in women who are asymptomatic. In a diagnostic workup setting (2), DM has been shown to reduce breast cancer mortality (3). In standard clinical practice, a radiologist reads mammograms and classifies the findings according to the American College of Radiology (4) Breast Imaging Reporting and Data System (BI-RADS) lexicon. An abnormal finding depicted at DM typically requires a diagnostic workup, which may include additional mammographic views or possibly additional imaging modalities. If a lesion is suspicious for cancer, further evaluation with a biopsy is recommended. Analyzing these images is challenging because of the subtle differences between lesions and background fibroglandular tissue, different lesion types, the nonrigid nature of the breast, and the relatively small proportion of cancers in a screening population of women at average risk (2). This leads to substantial intraobserver and interobserver variability (5). The average performance measures for screening mammography by a radiologist was reported by Lehman et al (6) to be 86.9% sensitivity and 88.9% specificity. Breast cancer risk prediction models on the basis of clinical features can help physicians estimate the probability of an individual or population to develop breast cancer within certain time frames. As a result, they are often used to recommend an individual screening plan. In a systematic survey of risk prediction models, Meads et al (7) reported a limited performance when applied to general populations (area under the receiver operating characteristic curve [AUC], 0.67; 95% confidence interval [CI]: 0.65, 0.68), and showed improved results when applied to high-risk populations (AUC, 0.76; 95% CI: 0.70, 0.82).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Michal Rosen-Zvi

Probabilistic author-topic models for information discovery

Learning author-topic models from text corpora

Approximate inference by Markov chains on union spaces

Predicting Breast Cancer by Applying Deep Learning to Linked Health Records and Mammograms

Contact Info

Product

Resources

About