Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Hall, Mark; Clough, Paul; Stevenson, Mark

doi:10.1007/978-3-642-33290-6_35

Cited by 13 publications

(7 citation statements)

References 24 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since its inception, the method of Chang et al (2009) has been used variously as a means of assessing topic models (Paul and Girju, 2010;Reisinger et al, 2010;Hall et al, 2012). Despite its wide acceptance, the method relies on manual annotation and has never been automated.…”

Section: Introductionmentioning

confidence: 99%

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

Lau

Newman

Baldwin

2014

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

401

238

View full text Add to dashboard Cite

Topic models based on latent Dirichlet allocation and related methods are used in a range of user-focused tasks including document navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two tasks of automatic evaluation of single topics and automatic evaluation of whole topic models, and provide recommendations on the best strategy for performing the two tasks, in addition to providing an open-source toolkit for topic and topic model evaluation.

show abstract

Section: Introductionmentioning

confidence: 99%

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

Lau

Newman

Baldwin

2014

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

401

238

View full text Add to dashboard Cite

show abstract

“…As the acceptance of topic coherence measures increases as a mean of topic model assessment (Paul and Girju, 2010;Reisinger et al, 2010;Hall et al, 2012), recent research trends focus on proposing fast and efficient models that can be scaled up to big amounts of data (Yang et al, 2015;Nguyen et al, 2015), using the whole text per document for training.…”

Section: Related Workmentioning

confidence: 99%

Improving Topic Coherence Using Entity Extraction Denoising

Cardenas¹,

Bello²,

Coronado³

et al. 2018

The Prague Bulletin of Mathematical Linguistics

View full text Add to dashboard Cite

Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method and the evaluation of this model is an interesting problem on its own. Topic interpretability measures have been developed in recent years as a more natural option for topic quality evaluation, emulating human perception of coherence with word sets correlation scores. In this paper, we show experimental evidence of the improvement of topic coherence score by restricting the training corpus to that of relevant information in the document obtained by Entity Recognition. We experiment with job advertisement data and find that with this approach topic models improve interpretability in about 40 percentage points on average. Our analysis reveals as well that using the extracted text chunks, some redundant topics are joined while others are split into more skill-specific topics. Fine-grained topics observed in models using the whole text are preserved.

show abstract

“…It is possible to receive multiple records for the same object from the same institution. 8 A quality control failure during the data ingestion process can let duplicates be published in the Europeana portal. Clustering allows us to identify these duplicates with a high degree of accuracy; often the exact same metadata appears in many fields.…”

Section: Qualitative Evaluation and Categorisation Of Clustersmentioning

confidence: 99%

“…In the latter case, an automatic procedure would have difficulty making the distinction with other types of relations. 8 Derivative works These are objects which are derived from another one, such as reprint. Fig.…”

Section: Qualitative Evaluation and Categorisation Of Clustersmentioning

confidence: 99%

“…Recently, researchers started looking at using external knowledge bases such as Wikipedia [7] or WordNet [14] to help measuring similarities between objects. Different similarity measures were compared [8,3] but most existing work explore a single dimension of similarity, which does not take into account the multidimensionality of CH collections; it also focuses on smaller-scale collections. The extraction of FRBR-like relations, a topic researched for more than a decade [17,9], has been a clear source for inspiration for us.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Hierarchical Structuring of Cultural Heritage Objects within Large Aggregations

Wang

Isaac

Charles

et al. 2013

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Huge amounts of cultural content have been digitised and are available through digital libraries and aggregators like Europeana.eu. However, it is not easy for a user to have an overall picture of what is available nor to find related objects. We propose a method for hierarchically structuring cultural objects at different similarity levels. We describe a fast, scalable clustering algorithm with an automated field selection method for finding semantic clusters. We report a qualitative evaluation on the cluster categories based on records from the UK and a quantitative one on the results from the complete Europeana dataset. Iterative parallel clustering based on compression similarity The clustering process is iterative as follows:Step 1 Choose a similarity level and set the maximum iteration. 6 5 The size of these groups depends on the desired similarity level. If clustering at level 100, 16 minhashes are randomly chosen for each group, while if at level 20, only 2 minhashes are selected. In this way, clusters at higher similarity levels have higher probability to be precise than those at lower levels. 6 In our experiments, the maximum iteration is set at 5.

show abstract

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Cited by 13 publications

References 24 publications

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

Improving Topic Coherence Using Entity Extraction Denoising

Hierarchical Structuring of Cultural Heritage Objects within Large Aggregations

Contact Info

Product

Resources

About