Abstract. We propose a novel methodology for clustering XML documents on the basis of their structural similarities. The idea is to equip each cluster with an XML cluster representative, i.e. an XML document subsuming the most typical structural specifics of a set of XML documents. Clustering is essentially accomplished by comparing cluster representatives, and updating the representatives as soon as new clusters are detected. We present an algorithm for the computation of an XML representative based on suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees. Experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.
We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.
Recommendation on signed social rating networks is studied through an innovative approach. Bayesian probabilistic modeling is used to postulate a realistic generative process, wherein user and item interactions are explained by latent factors, whose relevance varies within the underlying network organization into user communities and item groups. Approximate posterior inference captures distrust propagation and drives Gibbs sampling to allow rating and (dis)trust prediction for recommendation along with the unsupervised exploratory analysis of network organization. Comparative experiments reveal the superiority of our approach in rating and link prediction on
Epinions
and
Ciao
, besides community quality and recommendation sensitivity to network organization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.