Fusion architectures for automatic subject indexing under concept drift

Toepfer, Martin; Seifert, Christin

doi:10.1007/s00799-018-0240-3

Cited by 8 publications

(8 citation statements)

References 29 publications

(105 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In both cases, the result is a list of candidate subjects for the document. In order to determine the final set of suggested subjects for the document, the candidates must then be ranked and only the most promising ones retained (Medelyan, 2009;Toepfer & Seifert, 2018).…”

Section: Process Of Automated Indexingmentioning

confidence: 99%

“…Algorithms for automated subject indexing can generally be divided into lexical and associative approaches (Toepfer & Seifert, 2018). In lexical approaches, frequently occurring or otherwise salient terms in the document are matched with terms in the vocabulary.…”

Section: Approachesmentioning

confidence: 99%

“…Associative approaches, including machine learning algorithms, instead find correlations between words (or, more generally, short sequences of words called n-grams) in document text and subjects, based on a large amount of training data. These two approaches can be considered complementary, and often the best results are obtained by combining results from both kinds of algorithms using ensembles and/or fusion architectures (Toepfer & Seifert, 2018).…”

Section: Approachesmentioning

confidence: 99%

“…Fusion methods for automated subject indexing (Toepfer & Seifert, 2018) are ways of combining results from multiple algorithms. The algorithms are combined into an ensemble and the final prediction of subjects is made by using a decision function applied on the predictions of individual algorithms.…”

Section: Ensembles and Data Fusionmentioning

confidence: 99%

See 3 more Smart Citations

Annif: DIY automated subject indexing using multiple algorithms

Suominen¹

2019

LIBER

View full text Add to dashboard Cite

Manually indexing documents for subject-based access is a labour-intensive process. We propose using metadata gathered from bibliographic databases to train algorithms that assist librarians in that work. We have developed Annif, an open source tool and microservice for automated subject indexing. After training it with a subject vocabulary and existing metadata, Annif can be used to assign subject headings for new documents. We have tested Annif with different document collections including scientific papers, old scanned books and contemporary e-books, Q&A pairs from an "ask a librarian" service, Finnish Wikipedia, and the archives of a local newspaper. The results of analysing scientific papers and current books have been reassuring, while other types of documents have proved to be more challenging. The current version is based on a combination of existing natural language processing and machine learning tools. By combining multiple approaches and existing open source algorithms, Annif can build on the strengths of individual algorithms and adapt to different settings. With Annif, we expect to improve subject indexing and classification processes especially for electronic documents as well as collections that otherwise would not be indexed at all.

show abstract

Section: Process Of Automated Indexingmentioning

confidence: 99%

Section: Approachesmentioning

confidence: 99%

Section: Approachesmentioning

confidence: 99%

Section: Ensembles and Data Fusionmentioning

confidence: 99%

See 2 more Smart Citations

Annif: DIY automated subject indexing using multiple algorithms

Suominen¹

2019

LIBER

View full text Add to dashboard Cite

show abstract

“…Figures 6.10c and 6.11c). This resembles a challenge because label annotations suffer from concept drift over time [TS20]. We use the years 2012 and 2013 as test documents for EconBiz and the year 2016 for IREON to obtain a 90:10 train-test ratio, as in the citation recommendation datasets described above.…”

Section: Chronological Train-test Splitsmentioning

confidence: 99%

Representation Learning for Texts and Graphs

Galke

View full text Add to dashboard Cite

[...] This thesis is situated between natural language processing and graph representation learning and investigates selected connections. First, we introduce matrix embeddings as an efficient text representation sensitive to word order. [...] Experiments with ten linguistic probing tasks, 11 supervised, and five unsupervised downstream tasks reveal that vector and matrix embeddings have complementary strengths and that a jointly trained hybrid model outperforms both. Second, a popular pretrained language model, BERT, is distilled into matrix embeddings. [...] The results on the GLUE benchmark show that these models are competitive with other recent contextualized language models while being more efficient in time and space. Third, we compare three model types for text classification: bag-of-words, sequence-, and graph-based models. Experiments on five datasets show that, surprisingly, a wide multilayer perceptron on top of a bag-of-words representation is competitive with recent graph-based approaches, questioning the necessity of graphs synthesized from the text. [...] Fourth, we investigate the connection between text and graph data in document-based recommender systems for citations and subject labels. Experiments on six datasets show that the title as side information improves the performance of autoencoder models. [...] We find that the meaning of item co-occurrence is crucial for the choice of input modalities and an appropriate model. Fifth, we introduce a generic framework for lifelong learning on evolving graphs in which new nodes, edges, and classes appear over time. [...] The results show that by reusing previous parameters in incremental training, it is possible to employ smaller history sizes with only a slight decrease in accuracy compared to training with complete history. Moreover, weighting the binary cross-entropy loss function is crucial to mitigate the problem of class imbalance when detecting newly emerging classes. [...]

show abstract

Automated Dewey Decimal Classification of Swedish library metadata using Annif software

Golub,

Suominen,

Mohammed

et al. 2024

View full text Add to dashboard Cite

PurposeIn order to estimate the value of semi-automated subject indexing in operative library catalogues, the study aimed to investigate five different automated implementations of an open source software package on a large set of Swedish union catalogue metadata records, with Dewey Decimal Classification (DDC) as the target classification system. It also aimed to contribute to the body of research on aboutness and related challenges in automated subject indexing and evaluation.Design/methodology/approachOn a sample of over 230,000 records with close to 12,000 distinct DDC classes, an open source tool Annif, developed by the National Library of Finland, was applied in the following implementations: lexical algorithm, support vector classifier, fastText, Omikuji Bonsai and an ensemble approach combing the former four. A qualitative study involving two senior catalogue librarians and three students of library and information studies was also conducted to investigate the value and inter-rater agreement of automatically assigned classes, on a sample of 60 records.FindingsThe best results were achieved using the ensemble approach that achieved 66.82% accuracy on the three-digit DDC classification task. The qualitative study confirmed earlier studies reporting low inter-rater agreement but also pointed to the potential value of automatically assigned classes as additional access points in information retrieval.Originality/valueThe paper presents an extensive study of automated classification in an operative library catalogue, accompanied by a qualitative study of automated classes. It demonstrates the value of applying semi-automated indexing in operative information retrieval systems.

show abstract

Fusion architectures for automatic subject indexing under concept drift

Cited by 8 publications

References 29 publications

Annif: DIY automated subject indexing using multiple algorithms

Annif: DIY automated subject indexing using multiple algorithms

Representation Learning for Texts and Graphs

Automated Dewey Decimal Classification of Swedish library metadata using Annif software

Contact Info

Product

Resources

About