Balázs Indig scite author profile

Balázs Indig

4Publications

24Citation Statements Received

18Citation Statements Given

How they've been cited

How they cite others

Affiliations

Research Centre for the Humanities, Pázmány Péter Catholic University, Eötvös Loránd University

Publications

Order By: Most citations

One format to rule them all – The emtsv pipeline for Hungarian

Indig

Sass

Simon

et al. 2019

View full text Add to dashboard Cite

We present a more efficient version of the e-magyar NLP pipeline for Hungarian called emtsv. It integrates Hungarian NLP tools in a framework whose individual modules can be developed or replaced independently and allows new ones to be added. The design also allows convenient investigation and manual correction of the data flow from one module to another. The improvements we publish include effective communication between the modules and support of the use of individual modules both in the chain and standing alone. Our goals are accomplished using extended tsv (tab separated values) files, a simple, uniform, generic and selfdocumenting input/output format. Our vision is maintaining the system for a long time and making it easier for external developers to fit their own modules into the system, thus sharing existing competencies in the field of processing Hungarian, a mid-resourced language. The source code is available under LGPL 3.0 license 1 .

show abstract

Chaos in Vallis’ asymmetric Lorenz model for El Niño

Garay

Indig

2015

Chaos, Solitons & Fractals

View full text Add to dashboard Cite

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Indig

Endrédy

2018

View full text Add to dashboard Cite

The CoNLL-2000 dataset is the de-facto standard dataset for measuring chunkers on the task of chunking base noun phrases (NP) and arbitrary phrases. The state-of-the-art tagging method is utilising TnT, an HMM-based Part-of-Speech tagger (POS), with simple majority voting on different representations and fine-grained classes created by lexcialising tags. In this paper the state-of-the-art English phrase chunking method was deeply investigated, re-implemented and evaluated with several modifications. We also investigated a less studied side of phrase chunking, i.e. the voting between different currently available taggers, the checking of invalid sequences and the way how the state-of-the-art method can be adapted to morphologically rich, agglutinative languages. We propose a new, mild level of lexicalisation and a better combination of representations and taggers for English. The final architecture outperformed the state-of-the-art for arbitrary phrase identification and NP chunking achieving the F-score of 95.06% for arbitrary and 96.49% for noun phrase chunking.

show abstract

WARChain: Consensus-based trust in web archives via proof-of-stake blockchain technology1

Lendak

Indig

Palkó

2022

JCS

View full text Add to dashboard Cite

Web archives store born-digital documents, which are usually collected from the Internet by crawlers and stored in the Web Archive (WARC) format. The trustworthiness and integrity of web archives is still an open challenge, especially in the news portal domain, which face additional challenges of censorship even in democratic societies. The aim of this paper is to present a light-weight, blockchain-based solution for web archive validation, which would ensure that documents retrieved by crawlers are authentic for many years to come. We developed our archive validation solution as an extension and continuation of our work in web crawler development mainly targeting news portals. The system is designed as an overlay over a blockchain with a proof-of-stake (PoS) distributed consensus algorithm. PoS was chosen due to its lower ecological footprint compared to proof-of-work solutions (e.g. Bitcoin) and lower expected investment in computing infrastructure. We based our prototype on the open-source Nxt blockchain and implemented it in Python. The prototype was tested on web archive content crawled from Hungarian news portals at two different timestamps with more than 1 million articles in total. We concluded that the proposed solution is accessible, usable by different stakeholders to validate crawled content, deployable on cheap commodity hardware, tackles the archive integrity challenge and is capable to efficiently manage duplicate documents.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Balázs Indig

One format to rule them all – The emtsv pipeline for Hungarian

Chaos in Vallis’ asymmetric Lorenz model for El Niño

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

WARChain: Consensus-based trust in web archives via proof-of-stake blockchain technology1

Contact Info

Product

Resources

About