A method for determining the number of documents needed for a gold standard corpus

Juckett, David A.

doi:10.1016/j.jbi.2011.12.010

Cited by 19 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The sparse literature suggests no standard rules for determining sizes of gold standard and training sets. One method of determining the size of gold standard/training corpus is by Juckett et al ., 2012; however, that paper also mentions how most studies decide on a gold standard or training size purely by ad hoc reasoning depending on the data, financial, time or personnel constraints 42 .…”

Section: Methodsmentioning

confidence: 99%

Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing

Fernandes

Dutta

Velupillai

et al. 2018

Sci Rep

129

View full text Add to dashboard Cite

Research into suicide prevention has been hampered by methodological limitations such as low sample size and recall bias. Recently, Natural Language Processing (NLP) strategies have been used with Electronic Health Records to increase information extraction from free text notes as well as structured fields concerning suicidality and this allows access to much larger cohorts than previously possible. This paper presents two novel NLP approaches – a rule-based approach to classify the presence of suicide ideation and a hybrid machine learning and rule-based approach to identify suicide attempts in a psychiatric clinical database. Good performance of the two classifiers in the evaluation study suggest they can be used to accurately detect mentions of suicide ideation and attempt within free-text documents in this psychiatric database. The novelty of the two approaches lies in the malleability of each classifier if a need to refine performance, or meet alternate classification requirements arises. The algorithms can also be adapted to fit infrastructures of other clinical datasets given sufficient clinical recording practice knowledge, without dependency on medical codes or additional data extraction of known risk factors to predict suicidal behaviour.

show abstract

Section: Methodsmentioning

confidence: 99%

Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing

Fernandes

Dutta

Velupillai

et al. 2018

Sci Rep

129

View full text Add to dashboard Cite

show abstract

“…Finally, a corpus’ size should be dependent on the questions that it is aimed to answer and the type of tasks where it can be applied [12, 13]. However, in practice it is largely restrained according to available resources (time, money, and people).…”

Section: Methodsmentioning

confidence: 99%

Similarity corpus on microbial transcriptional regulation

et al. 2019

View full text Add to dashboard Cite

Background The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource. Results Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed. Conclusions To the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.

show abstract

“…Finally, the corpus’ size should be dependent on the questions that it is aimed to answer and the type of tasks where it would be applied ([PCH07], [Juc12]). However, in practice it is largely restrained according to available resources (time, money and people).…”

Section: Methodsmentioning

confidence: 99%

Similarity corpus on microbial transcriptional regulation

Lithgow-Serrano

Gama-Castro

Ishida-Gutiérrez

et al. 2017

Preprint

View full text Add to dashboard Cite

he ility to express the sme mening in di'erent wys is well known property of ntE url lngugeF his mzing property is the soure of mjor di0ulties in nturl lnguge proessingF qiven the onstnt inrese in pulished litertureD its urtion nd informE tion extrtion would strongly ene(t y e0ient utomti proessesD for whihD orpor of sentenes evluted y experts is vlule resoureF qiven our interest in pplying suh pprohes to the ene(t of urtion of the iomedil litertureD spei(lly out gene regE ultion in miroil orgnismsD we deided to uild orpus with grded textul similrity evluted y urtorsD nd designed spei(lly oriented to our purposesF fsed on the predeE (ned sttistil power of future nlysesD we de(ned fetures of the design inluding smplingD seletion riteriD lneD nd size mong othersF e nonEfully rossedEdesign ws performed for eh pir of sentenes y Q evlutors from U di'erent groupsD dpting the iwiev sle to our gols in four suessive itertive sessions with ler improvement in the onsenE suted guidelines nd interErterEreliility resultsF elterntives for the orpus evlution re widely disussedF o the est of our knowledge this is the (rst similrity orpus in this domin of knowledgeF e hve initited its inorportion in our reserh towrds high throughput urtion strtegies sed in nturl lnguge proessingF

show abstract

A method for determining the number of documents needed for a gold standard corpus

Cited by 19 publications

References 9 publications

Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing

Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing

Similarity corpus on microbial transcriptional regulation

Similarity corpus on microbial transcriptional regulation

Contact Info

Product

Resources

About