Huge amounts of cultural content have been digitised and are available through digital libraries and aggregators like Europeana.eu. However, it is not easy for a user to have an overall picture of what is available nor to find related objects. We propose a method for hierarchically structuring cultural objects at different similarity levels. We describe a fast, scalable clustering algorithm with an automated field selection method for finding semantic clusters. We report a qualitative evaluation on the cluster categories based on records from the UK and a quantitative one on the results from the complete Europeana dataset.
Iterative parallel clustering based on compression similarity The clustering process is iterative as follows:Step 1 Choose a similarity level and set the maximum iteration. 6 5 The size of these groups depends on the desired similarity level. If clustering at level 100, 16 minhashes are randomly chosen for each group, while if at level 20, only 2 minhashes are selected. In this way, clusters at higher similarity levels have higher probability to be precise than those at lower levels. 6 In our experiments, the maximum iteration is set at 5.