TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Medhat, Fady; Mohammadi, Mahnaz; Jaf, Sardar; Willcocks, Chris G.; Breckon, Toby P.; Matthews, Peter; McGough, Andrew Stephen; Theodoropoulos, Georgios; Obara, Bogusław

doi:10.1109/bigdata.2018.8622136

Cited by 3 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To explore the advancement in TIE techniques, [57] and as encoder in attention mechanism outperformed others [56]. Although, these techniques are showing promising results, but diversity in data sources makes the system complex [55]. The effectiveness of these techniques for complex, diverse, high dimensional and heterogeneous datasets must be investigated.…”

Section: Text Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

An analytical study of information extraction from unstructured and multidimensional big data

2019

View full text Add to dashboard Cite

IntroductionInformation extraction (IE) process extracts useful structured information from the unstructured data in the form of entities, relations, objects, events and many other types. The extracted information from unstructured data is used to prepare data for analysis. Therefore, the efficient and accurate transformation of unstructured data in the IE process improves the data analysis. Numerous techniques have been introduced for different data types i.e. text, image, audio, and video.The advancement in technology promoted the rapid growth of data volume in recent years. The volume, variety (structured, unstructured, and semi-structured data) and velocity of big data have also changed the paradigm of computational capabilities of the systems. IBM estimated that more than 2.5 quintillion bytes of data are generated every Abstract Process of information extraction (IE) is used to extract useful information from unstructured or semi-structured data. Big data arise new challenges for IE techniques with the rapid growth of multifaceted also called as multidimensional unstructured data. Traditional IE systems are inefficient to deal with this huge deluge of unstructured big data. The volume and variety of big data demand to improve the computational capabilities of these IE systems. It is necessary to understand the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for huge volumes of multidimensional unstructured data. Numerous studies have been conducted on IE, addressing the challenges and issues for different data types such as text, image, audio and video. Very limited consolidated research work have been conducted to investigate the task-dependent and task-independent limitations of IE covering all data types in a single study. This research work address this limitation and present a systematic literature review of state-of-the-art techniques for a variety of big data, consolidating all data types. Recent challenges of IE are also identified and summarized. Potential solutions are proposed giving future research directions in big data IE. The research is significant in terms of recent trends and challenges related to big data analytics. The outcome of the research and recommendations will help to improve the big data analytics by making it more productive. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Adnan and Akbar J Big Data (2019) 6:91 Malaysia Adnan and Akbar J Big Data (2019) 6:91 RESEARCHday. Among these statistics, it was also predicted that unstructured data from diverse sources will grow up to 90% in few years. IDC estimated that unstructured data will be 95% of the global data in 2020 with estimated 65% annual growth rate [1]. The common characteristics of unstructured data are, (i) it comes in multiple formats...

show abstract

Section: Text Recognitionmentioning

confidence: 99%

“…Unstructured big data comes with high dimensionality [16,18,66], diversity [55,124], dynamicity [32] and heterogeneity [33,131]. Dimensionality reduction [18] and semantic annotation [131] can further improve the IE performance of high dimensional and heterogeneous data respectively.…”

Section: Dimensionality and Heterogeneitymentioning

confidence: 99%

An analytical study of information extraction from unstructured and multidimensional big data

2019

View full text Add to dashboard Cite

show abstract

“…To the best of our knowledge, this is the first work reported on applying neural network model on mixed text recognition. We apply our postprocessing approach to the output of the pipeline proposed by [15] for mixed text recognition over IAM handwriting database [25] to show the effectiveness of neural network based natural language generation on the improvement of OCR accuracy.…”

Section: Related Workmentioning

confidence: 99%

“…Table I shows total number of characters/tokens before and after cleaning the data and also total number of unique characters/tokens after cleaning data for both train files and test files (i.e. result of applying TMIXT [15] on IAM handwriting database for text recognition). The Vocabulary size, bolded in table I, shows the number of cleaned and unique characters/tokens (words) for character level and word level language models.…”

Section: A Data Set and Analysismentioning

confidence: 99%

See 1 more Smart Citation

On the Use of Neural Text Generation for the Task of Optical Character Recognition

Mohammadi

Jaf

McGough

et al. 2019

2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA)

Self Cite

View full text Add to dashboard Cite

Optical Character Recognition (OCR), is extraction of textual data from scanned text documents to facilitate their indexing, searching, editing and to reduce storage space. Although OCR systems have improved significantly in recent years, they still suffer in situations where the OCR output does not match the text in the original document. Deep learning models have contributed positively to many problems but their full potential to many other problems are yet to be explored. In this paper we propose a post-processing approach based on the application deep learning to improve the accuracy of OCR system (minimizing the error rate). We report on the use of neural network language models to accomplish the task of correcting incorrectly predicted characters/words by OCR systems. We applied our approach to the IAM handwriting database. Our proposed approach delivers significant accuracy improvement of 20.41% in F-score, 10.86% in character level comparison using Levenshtein distance and 20.69% in document level comparison over previously reported context based OCR empirical results of IAM handwriting database.

show abstract

Exploring AI-driven approaches for unstructured document analysis and future horizons

Mahadevkar,

Patil,

Kotecha

et al. 2024

J Big Data

View full text Add to dashboard Cite

In the current industrial landscape, a significant number of sectors are grappling with the challenges posed by unstructured data, which incurs financial losses amounting to millions annually. If harnessed effectively, this data has the potential to substantially boost operational efficiency. Traditional methods for extracting information have their limitations; however, solutions powered by artificial intelligence (AI) could provide a more fitting alternative. There is an evident gap in scholarly research concerning a comprehensive evaluation of AI-driven techniques for the extraction of information from unstructured content. This systematic literature review aims to identify, assess, and deliberate on prospective research directions within the field of unstructured document information extraction. It has been observed that prevailing extraction methods primarily depend on static patterns or rules, often proving inadequate when faced with complex document structures typically encountered in real-world scenarios, such as medical records. Datasets currently available to the public suffer from low quality and are tailored for specific tasks only. This underscores an urgent need for developing new datasets that accurately reflect complex issues encountered in practical settings. The review reveals that AI-based techniques show promise in autonomously extracting information from diverse unstructured documents, encompassing both printed and handwritten text. Challenges arise, however, when dealing with varied document layouts. Proposing a framework through hybrid AI-based approaches, this review envisions processing a high-quality dataset for automatic information extraction from unstructured documents. Additionally, it emphasizes the importance of collaborative efforts between organizations and researchers to address the diverse challenges associated with unstructured data analysis.

show abstract

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Cited by 3 publications

References 34 publications

An analytical study of information extraction from unstructured and multidimensional big data

An analytical study of information extraction from unstructured and multidimensional big data

On the Use of Neural Text Generation for the Task of Optical Character Recognition

Exploring AI-driven approaches for unstructured document analysis and future horizons

Contact Info

Product

Resources

About