Accurate and effective latent concept modeling for ad hoc information retrieval

Deveaud, Romain; SanJuan, Eric; Bellot, Patrice

doi:10.3166/dn.17.1.61-84

Cited by 372 publications

(259 citation statements)

References 39 publications

(35 reference statements)

Supporting

Mentioning

249

Contrasting

Unclassified

Order By: Relevance

“…Similarly to Arun et al (2010), Deveaud et al (2014) The last information we compiled regarding the number of topics is the perplexity, a common strategy to evaluate an LDA fitted model. The perplexity is a metric resulting from the comparison of probability models that assess how well a probability distribution predicts a sample.…”

Section: Selecting the Number Of Topics: Cross-validation Analysesmentioning

confidence: 99%

The Personality Lexicon in Brazilian Portuguese: Studies with Natural Language

Peres¹

2018

Preprint

View full text Add to dashboard Cite

iv Procura da poesia (...) Chega mais perto e contempla as palavras.Cada uma tem mil faces secretas sob a face neutra e te pergunta, sem interesse pela resposta, pobre ou terrível, que lhe deres:Trouxeste a chave? (...) Carlos Drummond de Andrade in A Rosa do Povo (1945) From March 1979Tired of all who come with words, words but no language I went to the snow-covered island.The wild does not have words.The unwritten pages spread out on all sides! I come upon the tracks of roe deer in the snow.Language but no words. Tomas Tranströmer in The Wild Market Square (1983, trad. Robin Fulton) v DedicatoryThis dissertation is dedicated to Professor Jacob Laros and to Professor Luiz Pasquali, who advised me during my postgraduate studies with inspiring wisdom, not only embracing my projects, but also kindly respecting my ideas and decisions, and patiently guiding me through my doubts and after my stumbles. It is an honor to be an apprentice and a friend of such masters.vi Agradecimentos (Acknowledgements)Um doutoramento não é resultado apenas da dedicação individual do estudante. Nesse sentido, faço os seguintes agradecimentos. À Renata e ao Bartholomeu, pelo amor e companheirismo. Aos meus pais, que sempre se esforçaram para oferecer aos filhos e a outros familiares oportunidades para o desenvolvimento educacional e profissional. Aos meus irmãos, cunhados e ao restante da minha grande e querida família mineira. À Thaís, Dudu e Raphael, a quem espero inspirar em seguir a carreira acadêmica. Aos meus professores da Educação Básica, que me ensinaram a aprender e inspiraram minha predileção pela ciência, artes e filosofia. Gostaria de lembrar especialmente das professoras Laudiene, Eliana, Magda Table 3. The Six-Topic Model with the 10 most relevant terms of the topics, reliability and presumed correspondence with other psycholexical models ……………………………………………………………..……….… 127 Table 4. The Seven-Topic Model with the 10 most relevant terms of the topics, reliability and presumed correspondence with other psycholexical models ……………………………………………………….… 128 Table 5. The Fourteen-Topic Model with the 10 most relevant terms of the topics, reliability and presumed correspondence with other psycholexical models ………………………………………………………… 130 Table 6. The Fifteen-Topic Model with the 10 most relevant terms of the topics, reliability and presumed correspondence with other psycholexical models …………………………………………………………. 132 Table 7. User frequency or the number of users that used the term, overall term frequency in the corpus, inverse document frequency, mean, minimum, maximum, and range …………………………..… 147xii General AbstractThis dissertation consists of three studies concerning the lexical approach of research in the field of personality, with a focus on Brazilian culture and natural language. The first study is of a theoretical nature and explores some of the criticisms regarding the lexical approach to personality research with its origin in the psychological study of natural language and crosscultural psychology, as well as methodolo...

show abstract

Section: Selecting the Number Of Topics: Cross-validation Analysesmentioning

confidence: 99%

The Personality Lexicon in Brazilian Portuguese: Studies with Natural Language

Peres¹

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…We tested three statistical methods to find the best number of topics: (1) Arun2010 [26], (2) Cao2009 [27], and (3) Deveaud2014 [28]. However, these methods did not converge on our Twitter corpus.…”

Section: Step 3: Topic Modelingmentioning

confidence: 99%

Mining Twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the United States

Zhang

Wheldon

Dunn

et al. 2019

Journal of the American Medical Informatics Association

View full text Add to dashboard Cite

ObjectivesTo test the feasibility of using Twitter data to assess determinants of consumers' health behavior towards Human papillomavirus (HPV) vaccination informed by the Integrated Behavior Model (IBM). MethodsWe used three Twitter datasets spanning from 2014 to 2018. We preprocessed and geocoded the tweets, and then built a rule-based model that classified each tweet into either promotional information or consumers' discussions. We applied topic modeling to discover major themes, and subsequently explored the associations between the topics learned from consumers' discussions and the responses of HPV-related questions in the Health Information National Trends Survey (HINTS). ResultsWe collected 2,846,495 tweets and analyzed 335,681 geocoded tweets. Through topic modeling, we identified 122 high-quality topics. The most discussed consumer topic is "cervical cancer screening"; while in promotional tweets, the most popular topic is to increase awareness of "HPV causes cancer". 87 out of the 122 topics are correlated between promotional information and consumers' discussions. Guided by IBM, we examined the alignment between our Twitter findings and the results obtained from HINTS. 35 topics can be mapped to HINTS questions by keywords, 112 topics can be mapped to IBM constructs, and 45 topics have statistically significant correlations with HINTS responses in terms of geographic distributions. ConclusionNot only mining Twitter to assess consumers' health behaviors can obtain results comparable to surveys but can yield additional insights via a theory-driven approach. Limitations exist; nevertheless, these encouraging results impel us to develop innovative ways of leveraging social media in the changing health communication landscape. BACKGROUND and SIGNIFICANCEHuman papillomavirus (HPV) is the most common sexually transmitted disease (STD) in the United States (US) [1]. Although HPV infections are transient, persistent infection can lead to cancer. An estimated 33,700 new patients are diagnosed with HPV-associated cancers (e.g., anal, penile, cervical, and oral cancers) each year [2] in US. HPV vaccine is effective in preventing most of these HPV-related cancers for individuals in early age [3]. Nevertheless, in 2017, only 48.6% of US adolescents received recommended HPV vaccination series, and 65.5% received ≥1dose of the series [4]. HPV vaccination coverage also varies greatly by state. Only three states (i.e., District of Columbia: 91.9%, Rhode Island: 88.6%, and Massachusetts: 81.9%) have more than 80% coverage for the first dose, while the bottom three states (i.e., Kentucky: 49.6%, and Mississippi: 49.6%, Wyoming: 46.9%) have coverage rates less than 50% [4]. There is a huge public health needs to increase the awareness of HPV-related issues to promote HPV vaccination.To increase HPV vaccination initiation and coverage, we first need to understand factors that affect people's health behavior towards vaccination uptake. Recognized by the Integrated

show abstract

“…The number of topics in our LDA model was selected using the optimization method proposed by Deveaud, SanJuan, and Bellot (2014). The number of topics in our LDA model was selected using the optimization method proposed by Deveaud, SanJuan, and Bellot (2014).…”

Section: Topic Modellingmentioning

confidence: 99%

“…Using the R package "ldatuning" (Murzintcev, 2014), we created 50 different LDA models by varying the K-parameter from 1 to 50. The number of topics in our LDA model was selected using the optimization method proposed by Deveaud, SanJuan, and Bellot (2014). The final LDA "best" model was fitted using the R package "topicmodels" (Hornik & Grün, 2011).…”

Section: Topic Modellingmentioning

confidence: 99%

Trait‐based ecology of fishes: A quantitative assessment of literature trends and knowledge gaps using topic modelling

et al. 2019

View full text Add to dashboard Cite

Species traits are a new data currency to enhance our understanding of ecological patterns and processes. Trait‐based studies of fishes are numerous in comparison with other animal groups, reflecting the diversity of fish forms and functions they provide to aquatic ecosystems. We conduct a retrospective examination of literature to identify knowledge gaps and provide guidance for future research in trait‐based fish ecology. We apply an automated text mining and topic modelling to track the evolution of research topics within peer‐reviewed articles of functional traits in marine and freshwater fishes published over the past half century, explore the inter‐connections among those topics and identify emerging avenues for investigation. By mapping the topic landscape of the literature, 16 latent topics emerged that vary in their prevalence. Our results show a decline in the frequency of studies using reproductive traits to model and explore the way fish allocate energy for reproduction, and increase in studies reporting functional diversity metrics and utilizing the concept of multivariate functional space. Research focused on contributions of fish traits to ecosystem functioning also has increased in frequency. We revealed large gaps in information between growing and decreasing topics and that these gaps were derived from different types of traits being considered. We suggest that scientists break‐free from the traditions of their research field by targeting investigations that: (a) apply functional diversity metrics to a broader assortment of traits, (b) focus on traits influencing energy allocation to growth/reproduction and (c) integrate trophic‐web and behavioural studies with other topics.

show abstract

Accurate and effective latent concept modeling for ad hoc information retrieval

Cited by 372 publications

References 39 publications

The Personality Lexicon in Brazilian Portuguese: Studies with Natural Language

The Personality Lexicon in Brazilian Portuguese: Studies with Natural Language

Mining Twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the United States

Trait‐based ecology of fishes: A quantitative assessment of literature trends and knowledge gaps using topic modelling

Contact Info

Product

Resources

About