We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.
We propose a new unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1,740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing (for example) generalizations of the notion of an author, are also briefly discussed.
reast cancer is the second leading cause of cancer-related deaths and the most commonly diagnosed cancer in women across the world (1). Digital mammography (DM) is the primary imaging modality of breast cancer screening in women who are asymptomatic. In a diagnostic workup setting (2), DM has been shown to reduce breast cancer mortality (3). In standard clinical practice, a radiologist reads mammograms and classifies the findings according to the American College of Radiology (4) Breast Imaging Reporting and Data System (BI-RADS) lexicon. An abnormal finding depicted at DM typically requires a diagnostic workup, which may include additional mammographic views or possibly additional imaging modalities. If a lesion is suspicious for cancer, further evaluation with a biopsy is recommended. Analyzing these images is challenging because of the subtle differences between lesions and background fibroglandular tissue, different lesion types, the nonrigid nature of the breast, and the relatively small proportion of cancers in a screening population of women at average risk (2). This leads to substantial intraobserver and interobserver variability (5). The average performance measures for screening mammography by a radiologist was reported by Lehman et al (6) to be 86.9% sensitivity and 88.9% specificity. Breast cancer risk prediction models on the basis of clinical features can help physicians estimate the probability of an individual or population to develop breast cancer within certain time frames. As a result, they are often used to recommend an individual screening plan. In a systematic survey of risk prediction models, Meads et al (7) reported a limited performance when applied to general populations (area under the receiver operating characteristic curve [AUC], 0.67; 95% confidence interval [CI]: 0.65, 0.68), and showed improved results when applied to high-risk populations (AUC, 0.76; 95% CI: 0.70, 0.82).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.