Automatic classification of scanned electronic health record documents

Goodrum, Heath; Roberts, Kirk; Bernstam, Elmer V.

doi:10.1016/j.ijmedinf.2020.104302

Cited by 36 publications

(31 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The cause of the error can be in the form of the characters in the letter number are illegible (text characters from OCR were not recognized correctly), data does not match the provided regular expression pattern and unpredictable. This study adapts, modifies and combines the methods in previous studies (scanned document classification with OCR-assisted text approach [17]- [19], hierarchical classification [28], CNN [20]- [22], regular expression [23]- [25] and framework Hadoop [29] which in the end this proposed method is able to overcome the problem of classifying scanned documents (using a text-based approach with the help of OCR) at a depth of 4 levels automatically in a hierarchical manner that is able to classify different document types with document conditions that have unstructured text content using CNN and have special patterns (specific and short strings) using regular expression and implementation of big data technology using Hadoop framework for store and analysis of large-scale data. This method is powerful and effective to overcome the multilevel classification problem in the case of this electronic mail document.…”

Section: Resultsmentioning

confidence: 99%

“…Based on small trial, the accuracy performance of Google Vision OCR was the best comparing to other OCR tools [16]. In previous studies, the automatic classification of scanned electronic health record documents done by extracted text using (OCR and multiple text classification machine learning models, including both "bag of words" and deep learning approaches [17], the classifying image spam detection using OCR, machine learning and natural language processing [18] and the classifying promotion images using OCR and Naïve Bayes classifier [19]. From research [17]- [19] show that text-based classification systems can accurately classify scanned documents.…”

Section: Introductionmentioning

confidence: 99%

“…In previous studies, the automatic classification of scanned electronic health record documents done by extracted text using (OCR and multiple text classification machine learning models, including both "bag of words" and deep learning approaches [17], the classifying image spam detection using OCR, machine learning and natural language processing [18] and the classifying promotion images using OCR and Naïve Bayes classifier [19]. From research [17]- [19] show that text-based classification systems can accurately classify scanned documents. The problem of text classification can also be solved by deep learning using the convolutional neural network (CNN) such as for hate speech classification [20], news classification [21] and sentiment analysis [22].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

Arief

Mutiara

Kusuma

et al. 2022

IJECE

View full text Add to dashboard Cite

<p>This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.</p>

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

Arief

Mutiara

Kusuma

et al. 2022

IJECE

View full text Add to dashboard Cite

show abstract

“…As others have noted, the literature devoted to scanned documents and images within EHRs is smaller than we expected given the importance of this commonly used means for HIE in the early decades of EHR use in our country. 18 Our study is limited by its small size-it is a pilot-and by the population that we used which is from an academic center. The number of cancer risk factors identified in scanned records may be different in other populations.…”

Section: Discussionmentioning

confidence: 99%

Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

2021

View full text Add to dashboard Cite

Background Clinicians express concern that they may be unaware of important information contained in voluminous scanned and other outside documents contained in electronic health records (EHRs). An example is “unrecognized EHR risk factor information,” defined as risk factors for heritable cancer that exist within a patient's EHR but are not known by current treating providers. In a related study using manual EHR chart review, we found that half of the women whose EHR contained risk factor information meet criteria for further genetic risk evaluation for heritable forms of breast and ovarian cancer. They were not referred for genetic counseling. Objectives The purpose of this study was to compare the use of automated methods (optical character recognition with natural language processing) versus human review in their ability to identify risk factors for heritable breast and ovarian cancer within EHR scanned documents. Methods We evaluated the accuracy of the chart review by comparing our criterion standard (physician chart review) versus an automated method involving Amazon's Textract service (Amazon.com, Seattle, Washington, United States), a clinical language annotation modeling and processing toolkit (CLAMP) (Center for Computational Biomedicine at The University of Texas Health Science, Houston, Texas, United States), and a custom-written Java application. Results We found that automated methods identified most cancer risk factor information that would otherwise require clinician manual review and therefore is at risk of being missed. Conclusion The use of automated methods for identification of heritable risk factors within EHRs may provide an accurate yet rapid review of patients' past medical histories. These methods could be further strengthened via improved analysis of handwritten notes, tables, and colloquial phrases.

show abstract

“…The paper suggests that a more accurate engine must be used to recognize cursive handwriting to improve accuracy. The paper [6] proposes a system to group the clinical and nonclinical documents into suitable categories which are again subclassified. .Electronic Health Records have also known as (EHR's) contain a large number of scanned documents such as radiology reports, clinical correspondence, identification cards, etc.…”

Section: Literature Surveymentioning

confidence: 99%

A Review of Optical Character Recognition (OCR) in Healthcare

Bhure¹

2021

IJRASET

View full text Add to dashboard Cite

Information is present everywhere in newspapers, magazines, documents etc. but healthcare information majorly consisting of medicine labels, drug information, and personal health records etc. is something which is important and confidential at the same time. Today's world is the world of digitization. Technology is advancing day by day and medical healthcare is no exception. Everyday millions of electronic health records (EHR's) are generated, hundreds of invoices are generated, and prescriptions are written. But how to categorize this data and make the best out of it? Optical character recognition (OCR) proves to be of great help in this field. OCR can be used to categorize the EHR's under certain labels. Even most of the time a physician's written prescription is unrecognizable. With the help of OCR, this text can be identified. Knowing one's medicine is highly important , it is easy for normal people but considering visually impaired people, OCR and TTS technologies can be used to get these and other clinical information available to them in the form of an audio. Optical Character Recognition (OCR) is a technique where images or scanned records are perused and converted into OCR recognizable characters, which are then extended for editing and searching purposes. Pattern Recognition techniques and advanced Computer Vision are the main building blocks of working behind an OCR. The paper discusses the role of OCR in various healthcare applications.

show abstract

Automatic classification of scanned electronic health record documents

Cited by 36 publications

References 16 publications

Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

A Review of Optical Character Recognition (OCR) in Healthcare

Contact Info

Product

Resources

About