DocDig: Content Based Figure Search in Digitized Documents

Eken, Süleyman; Atay, Burak; Sönmez, Büşra Ceren; Sayar, Ahmet

doi:10.29130/dubited.330094

Cited by 5 publications

(3 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It simultaneously produced region proposals and embedded them into a word-embedding space in which searches were performed. Atay et al [41] developed an architecture that makes content-based figure searches possible on these scanned documents in large quantities. The user can search with some keywords and display related figures in digital documents with their captions.…”

Section: Related Workmentioning

confidence: 99%

Searchable Turkish OCRed historical newspaper collection 1928–1942

Menhour

Şahin

Sarıkaya

et al. 2021

Journal of Information Science

Self Cite

View full text Add to dashboard Cite

The newspaper emerged as a distinct cultural form in early 17th-century Europe. It is bound up with the early modern period of history. Historical newspapers are of utmost importance to nations and its people, and researchers from different disciplines rely on these papers to improve our understanding of the past. In pursuit of satisfying this need, Istanbul University Head Office of Library and Documentation provides access to a big database of scanned historical newspapers. To take it another step further and make the documents more accessible, we need to run optical character recognition (OCR) and named entity recognition (NER) tasks on the whole database and index the results to allow for full-text search mechanism. We design and implement a system encompassing the whole pipeline starting from scrapping the dataset from the original website to providing a graphical user interface to run search queries, and it manages to do that successfully. Proposed system provides to search people, culture and security-related keywords and to visualise them.

show abstract

Section: Related Workmentioning

confidence: 99%

Searchable Turkish OCRed historical newspaper collection 1928–1942

Menhour

Şahin

Sarıkaya

et al. 2021

Journal of Information Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…Doküman anlama genellikle taranmış dokümanlar/görüntüler üzerinde yapılmaktadır (Aiello, Monz, Todoran, & Worring, 2002;Altamura, Esposito, & Malerba, 2000;Eken, Atay, Sönmez & Sayar, 2018;Eken, Karabaş, Sarı & Sayar, 2018;Eken ve Sayar, 2013). Proje kapsamında, yapılan çalışmalardaki gibi biz de özgeçmiş PDF dokümanlarından isim, soy isim (kişisel bilgiler), iletişim bilgileri, eğitim durumu, iş tecrübesi deneyimler, referanslar, özel zevkler gibi metinsel nesneler ile kişi görüntüsü gibi görsel nesnelerin doküman içindeki konumlarıyla (düzen) tespit edilmesi ve XML formatında ilgili özgeçmişin ifade edilmesi gerçekleştirildi.…”

Section: Introductionunclassified

“…İlgilenilen bir diğer konu da PDF ve XML'in birbirlerine karşılıklı olarak dönüştürülebilmesidir. PDF dokümanlarından XML dokümanlarının elde edilmesindeki amaç indeksleme ve geri getirim yoluyla dokümanlar üzerinde yapılacak bir arama için arama uzayını (search space) daraltmaktır (Eken, Ekinci & Sayar, 2014). Bu tür çalışmalar literatürde "belge özetleme-söylem çıkarımı" olarak geçmektedir.…”

Section: Introductionunclassified

Dijital Dokümanlar Üzerinde Otomatik Biçim Tanıma ve Farklı İçeriklere Uyarlama: Özgeçmişler Üzerinde Durum Çalışması

Kantarci

Eken

Sayar

2019

European Journal of Science and Technology

Self Cite

View full text Add to dashboard Cite

Öz Çoğu bilgisayar işleminin merkezinde yer alan toplu kategorizasyona ilişkin olarak bilgi geri çağırmayı etkileyen iki tür ilgili veri vardır: yapısal veriler ve yapılandırılmamış veriler. Yapılandırılmış veriler, ilişkisel bir veritabanına dahil edilmesi gibi yüksek derecede organizasyona sahip bilgileri ifade eder. Bununla birlikte, yapılandırılmamış veriler kendi iç yapısına sahip olabilir, ancak bir e-tabloya veya veritabanına tam olarak karşılık gelmezler. Özgeçmişler bu tür verilerdir. Genelde PDF (Portable Document Format, Taşınabilir Belge Formatı) formatında sunulan özgeçmişler, PDF etiketleme özelliği kullanılarak yapısal hale getirilebilir; fakat çoğu PDF verisi etiketlenmemiş ve yapısal olmayan haldedir. Teknik olmayan iş dünyası kullanıcıları ve veri analistlerinin bu tür kapalı kutularla başa çıkmaları çok zordur.

show abstract

Figure search by text in large scale digital document collections

Yurtsever

Özcan

Taruz

et al. 2021

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary Digital document collections have been created with the transfer of a large number of documents to digital media. These digital archives have provided many benefits to users. As the diversity and size of digital image collections have grown exponentially, it has become increasingly important and difficult to obtain the desired image from them. The images on the document might contain critical information about the subject of it. In this study, an architecture is developed that can work on large‐scale data by creating regular expressions together with full‐text search approaches. The performance of the system has been tested on different academic documents and Elasticsearch and Apache Solr insert times are compared. Compared to Elasticsearch, Apache Solr achieved faster and more successful results.

show abstract

DocDig: Content Based Figure Search in Digitized Documents

Cited by 5 publications

References 11 publications

Searchable Turkish OCRed historical newspaper collection 1928–1942

Searchable Turkish OCRed historical newspaper collection 1928–1942

Dijital Dokümanlar Üzerinde Otomatik Biçim Tanıma ve Farklı İçeriklere Uyarlama: Özgeçmişler Üzerinde Durum Çalışması

Figure search by text in large scale digital document collections

Contact Info

Product

Resources

About