Corpus Conversion Service

Staar, Peter; Dolfi, Michele; Auer, Christoph; Bekas, Costas

doi:10.1145/3219819.3219834

Cited by 37 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DeepSearch has been previously used in various publications such as [1][2][3][4]. These publications demonstrate the effectiveness of DeepSearch in extracting and analyzing text-based data.…”

Section: Appendix a Methodsmentioning

confidence: 91%

Plan for Constructing DataDiscoveryLab

Keskinoglu¹

2023

Preprint

View full text Add to dashboard Cite

DataDiscoveryLab is a software tool that enables users to recommend possible pathways to their research with references by extracting valuable insights from academic articles by parsing them into text and figures and processing the image data using computer vision algorithms. The software creates two databases for text-based purposes, one for titles, figure captions, and references, and another for abstracts, introductions, methods, and results using NLP techniques. The software then compares these databases to users' research questions, finds similarities, and presents the findings. Additionally, the software takes data from researchers' scientific software and devices to compare with the current figure-based databases, creating a loop until the best answer and pathways to research and articles to recommend can be found. This tool provides valuable insights and context for researchers, helping them make informed decisions about their research.

show abstract

Section: Appendix a Methodsmentioning

confidence: 91%

Plan for Constructing DataDiscoveryLab

Keskinoglu¹

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) (Zhang et al 2023;Li et al 2022;Huang et al 2022). On the other hand, deep learningbased language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al 2022;Livathinos et al 2021;Staar et al 2018).…”

Section: Related Workmentioning

confidence: 99%

ESG Accountability Made Easy: DocQA at Your Service

Mishra,

Berrospi,

Dinkla

et al. 2024

AAAI

View full text Add to dashboard Cite

We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.

show abstract

“…A PDF document provides identical representation on any device and any OS. PDF documents are the de facto standard electronic document, and Adobe has estimated that there were 2.5 trillion PDF documents in circulation [7]. Furthermore, PDF has the specification to validate the content integrity by using the digital signature [8].…”

Section: Related Research 21 Pdf and Htmlmentioning

confidence: 99%

Design of Chained Document HTML Generation Technique Based on Blockchain for Trusted Document Communication

Hwang

Kim

2022

Electronics

View full text Add to dashboard Cite

Digital document communication between an enterprise and a customer is becoming a primary form of communication rather than the traditional physical document communication. A PDF document, the most popular document format, provides an identical document layout regardless of OS or device and has a content integrity verification feature with a digital signature. However, it has a bad user experience, such as low readability on a mobile device. On the other hand, an HTML document has a weakness in verifying the content integrity even though it is the primary document format and provides a good user experience on mobile devices. There are certified document services using blockchain technology, but it is still vulnerable to verifying content integrity. Furthermore, research on the document HTML has proposed the trusted document generation technique by HTML conformance and digital signature; however, this research does not provide content delivery verification, and there is a file size overhead. In this paper, we have developed the chained document HTML by defining HTML conformance, digital signature, and blockchain technology. First, the chained document HTML has to embed all resources and does not allow loading content on-demand. Second, the file is signed by a digital signature, and the signature value is added in the file header. Lastly, the metadata to verify the content integrity is inserted in a blockchain node. We have created the chained document HTML generation and verification experiment environment by Ethereum and Python. We have confirmed that the chained document HTML provides content and delivery integrity verification in the research. We expect the chained document HTML will be widely used in document communication between an enterprise and a customer, especially if the document has sensitive personal information that might have a legal dispute.

show abstract

Corpus Conversion Service

Cited by 37 publications

References 11 publications

Plan for Constructing DataDiscoveryLab

Plan for Constructing DataDiscoveryLab

ESG Accountability Made Easy: DocQA at Your Service

Design of Chained Document HTML Generation Technique Based on Blockchain for Trusted Document Communication

Contact Info

Product

Resources

About