Text Mining the History of Medicine

Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

doi:10.1371/journal.pone.0144717

Cited by 54 publications

(51 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Some PDF files without texts are scans of the original article (point 1). We did not attempt to make an optical character recognition conversion (OCR) as the old typesetting fonts often are less compatible with present day OCR programs, and this can lead to text recognition errors [ 28 , 29 ]. For any discarded document, we still used the meta-data to calculate summary statistics.…”

Section: Methodsmentioning

confidence: 99%

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

et al. 2018

View full text Add to dashboard Cite

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

show abstract

Section: Methodsmentioning

confidence: 99%

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

et al. 2018

View full text Add to dashboard Cite

show abstract

“…All data must be digital before it can be processed; but not all data that requires processing is in a usable digital format. While it is rare for data scientists to interact with non-digital data, many clinicians [1], historians [2], educators and field researchers [3] still regularly capture or must work with historical archives of paper-based spreadsheet data. For small datasets, manual transcription of these records is feasible, but as data requirements grow, researchers and professionals are required to invest considerable resources to transcribe their paper-based data into digital form [4], [5].…”

Section: Introductionmentioning

confidence: 99%

“…Data in the educational and healthcare domains, for instance, often contain sensitive personal information requiring specialized authorization to share with third parties. These constraints make transcription arduous and costly [1].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An open-source tool for the transcription of paper-spreadsheet data: Code and supplemental materials available online: Https://github.com/deskool/images to spreadsheets

Ghassemi¹,

Jarvis²,

Alhanai³

et al. 2017

2017 IEEE International Conference on Big Data (Big Data)

165

116

View full text Add to dashboard Cite

Clinical researchers, historians, educators and field researchers alike still regularly capture data on paper spreadsheets. In the case of health care and education, data will often contain sensitive personal information, further complicating the process of transcribing paper-based archives into digital form. In this work, we describe a tool that utilizes machine learning and crowd intelligence to automatically transcribe images of paper-based spreadsheets into electronic form while protecting sensitive personal information. Our solution consists of four high-level stages: (1) the extraction of cell-level images from the spreadsheet grid, (2) machine recognition of digits within the cells, (3) human transcription of cell contents that the machine was uncertain of and (4) feedback of human transcription results to the machine to improve future classification performance. We test the algorithm on a novel data-set of 135 heterogeneous clinical flow-sheet images collected from the Massachusetts General Hospital (MGH), 2 hand-drawn spreadsheets, one chalkboard drawing, and one printed table. we demonstrate that our algorithm provides a generalized solution for spreadsheet transcription that maintains privacy, is up to 10 times faster and twice as cost effective than existing alternatives. Our work is valuable both as a tool and as a starting point for the development of better algorithms.

show abstract

“…A recent study by Thompson et al (115) may serve as an example of combining all of the elements of text mining within a single project. The goal was to analyse medical vocabulary from a historical perspective, observing how certain terms and concepts appear, transform and wither across the years.…”

Section: Discussionmentioning

confidence: 99%

Text mining resources for the life sciences

Przybyła¹,

Shardlow²,

Aubin³

et al. 2016

Database

Self Cite

View full text Add to dashboard Cite

Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability.

show abstract

Text Mining the History of Medicine

Cited by 54 publications

References 51 publications

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

An open-source tool for the transcription of paper-spreadsheet data: Code and supplemental materials available online: Https://github.com/deskool/images to spreadsheets

Text mining resources for the life sciences

Contact Info

Product

Resources

About