Identifying biological concepts from a protein-related corpus with a probabilistic topic model

Zheng, Bin; McLean, David C.; Lu, Xinghua

doi:10.1186/1471-2105-7-58

Cited by 55 publications

(43 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although the documents, or abstracts, are known and observed, the topics are hidden or latent (Piepenbrink & Nurmammadov, 2015). This allows for modelling at a fine granularity as it realistically sees texts as made up of different topics rather than being "about" one topic alone (Zheng, Mclean, & Lu, 2006). This allows for modelling at a fine granularity as it realistically sees texts as made up of different topics rather than being "about" one topic alone (Zheng, Mclean, & Lu, 2006).…”

Section: Topic Models-ldamentioning

confidence: 99%

“…First, it sees a document as a bag of words, where the order of words is inconsequential for our analysis (Blei et al, 2003;Grimmer & Stewart, 2013). The choice of the correct number of topics is crucial as it determines the granularity of the results and the fit of the model for the data, that is, how well the model describes the underlying data (Griffiths & Steyvers, 2004;Zheng et al, 2006). Second, it is based on the assumption that the number of topics k is fixed and known, which is an input parameter of the LDA.…”

Section: Topic Models-ldamentioning

confidence: 99%

“…The basic assumption of the model is that texts are composed of a mixture of different topics (k), which are in turn characterized by a distribution over words (w) Blei et al, 2003), where as opposed to the bibliometric analysis described above words can appear prominently in several topics. This allows for modelling at a fine granularity as it realistically sees texts as made up of different topics rather than being "about" one topic alone (Zheng, Mclean, & Lu, 2006). The observed words in a text can be used to deduce the proportion of each topic in it as well as the probability distribution over all words in the vocabulary for each topic.…”

Section: Topic Models-ldamentioning

confidence: 99%

“…Second, it is based on the assumption that the number of topics k is fixed and known, which is an input parameter of the LDA. The choice of the correct number of topics is crucial as it determines the granularity of the results and the fit of the model for the data, that is, how well the model describes the underlying data (Griffiths & Steyvers, 2004;Zheng et al, 2006). However, the path to finding the right number of topics is subject to some debate.…”

Section: Topic Models-ldamentioning

confidence: 99%

See 3 more Smart Citations

Firms with benefits: A systematic review of responsible entrepreneurship and corporate social responsibility literature

Tiba

Rijnsoever

Hekkert

2018

Corp Soc Responsibility Env

View full text Add to dashboard Cite

The scholarly literature has so far paid limited attention to responsibility by commercial entrepreneurs. This paper compares responsible entrepreneurship (RE) and corporate social responsibility (CSR) scholarship in order to identify future fields of research.For this purpose, we assess the strengths and weaknesses of extant RE scholarship through the lens of CSR. We have reviewed 11,260 papers via latent Dirichlet allocation for our work. We find that existing RE literature places disproportionate emphasis on how firms can benefit society instead of on how contributions to sustainable development can benefit a firm. Furthermore, the RE literature pays limited attention to employee well-being, customer preference, and civil society as a stakeholder. Also, environmental issues and their balancing with financial and social issues remain relatively under-researched. Overall, we hope that scholarly works inspired by this study may ultimately help to ensure responsible behaviour of start-ups.

show abstract

Section: Topic Models-ldamentioning

confidence: 99%

Section: Topic Models-ldamentioning

confidence: 99%

Section: Topic Models-ldamentioning

confidence: 99%

Section: Topic Models-ldamentioning

confidence: 99%

See 2 more Smart Citations

Firms with benefits: A systematic review of responsible entrepreneurship and corporate social responsibility literature

Tiba

Rijnsoever

Hekkert

2018

Corp Soc Responsibility Env

View full text Add to dashboard Cite

show abstract

“…Thus, LDA automatically nds topics in a text, or in other words, LDA attempts to go back from the document and nd the set of topics that may have generated it. Zheng, McLean, and Lu (2006) make use of LDA to identify biological topicsi.e. conceptsfrom a corpus composed of biomedical articles that belong to MEDLINE; to that end, rst, they use LDA to identify the most relevant concepts, and subsequently, these concepts are mapped to a biomedical vocabulary: Gene Ontology.…”

Section: Latent Dirichlet Allocationmentioning

confidence: 99%

Peer Review #1 of "Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach (v0.1)"

Sorensen¹

2015

View full text Add to dashboard Cite

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature standsout as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria-that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text-thus suffering from synonymy and polysemy-and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge-concretely Wikipedia-in order to create bag-of-concepts (BoC) representations of documents, understanding concept as "unit of meaning", and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

show abstract

The shifting sands of disciplinary development: Analyzing North American Library and Information Science dissertations using latent Dirichlet allocation

Sugimoto

Russell

et al. 2010

J. Am. Soc. Inf. Sci.

100

View full text Add to dashboard Cite

This work identifies changes in dominant topics IntroductionMany evaluations of library and information science (LIS) have been conducted, primarily using the methods of content analysis and cocitation analysis on journal articles (e.g., Järvelin & Vakkari, 1993;White & McCain, 1998). The majority of these focus on a single period in LIS (e.g., Järvelin & Vakkari, 1990;Kumpulainen, 1991). Subsequent analyses often use different source titles, techniques, or coding schemes, making comparison between the analyses difficult. Although these studies constitute one lens on the field, there are some major limitations to the current literature in the area. First, the focus on a single communicative genre (the journal article) provides a monocular view of the field. Research has shown that the writing and citing patterns of authors vary significantly by genre (Bazerman, 1988;Hyland, 2000). A different topic spectrum may be found by examining topics across multiple genres. Second, the focus has been on either a group of highly cited authors or a sample of journal articles. Previous analyses have been manually intensive, necessitating small sample sizes. This has the potential to skew the results in two ways: (a) highly cited works are not necessarily representative of the works produced, and (b) a few articles/authors can heavily influence the results. Lastly, the analyses have been largely synchronic, rather than diachronic. Therefore, trend data rely on replication studies, which are not prevalent in the literature.Understanding the development of the discipline and changes in topics over time is particularly necessary in LIS, which has an extended literature of questioning its own disciplinary identity and the relationship of the library science and information science components of the name. Dissertations may serve a critical function in the exploration of disciplinary identity. During the doctoral process, students are acculturated in the ways of the discipline and are taught the central theories, methods, and objects of scrutiny within the domain. The dissertation is seen as independent and original research that is meant to set the foundation for the rest of the scholar's career. Therefore, it should accomplish two goals: situate the research within the domain, and explore new and original territory. Furthermore, although the primary focus of doctoral education is on the production of researchers (Sugimoto, 2010a), a secondary focus is the creation of new faculty. Therefore, the topics explored in doctoral dissertations may have an indirect effect on the education of the next generation of master's students.This study provides a new lens on disciplinary identity, by identifying the main topics in LIS dissertations from . A topic modeling technique, latent Dirichlet allocation (LDA) modeling, is used. We have both an historical and methodological objective: (a) to identify the main topics in LIS diachronically, and (b) to examine the use of LDA in analyzing disciplinary development and change.This work addresses gaps...

show abstract

Identifying biological concepts from a protein-related corpus with a probabilistic topic model

Cited by 55 publications

References 24 publications

Firms with benefits: A systematic review of responsible entrepreneurship and corporate social responsibility literature

Firms with benefits: A systematic review of responsible entrepreneurship and corporate social responsibility literature

Peer Review #1 of "Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach (v0.1)"

The shifting sands of disciplinary development: Analyzing North American Library and Information Science dissertations using latent Dirichlet allocation

Contact Info

Product

Resources

About