The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables.To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.
Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitationtheir greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.
Collaborative networks are a special type of social network formed by members who collectively achieve specific goals, such as fixing software bugs and resolving customers' problems. In such networks, information flow among members is driven by the tasks assigned to the network, and by the expertise of its members to complete those tasks. In this work, we analyze real-life collaborative networks to understand their common characteristics and how information is routed in these networks. Our study shows that collaborative networks exhibit significantly different properties compared with other complex networks. Collaborative networks have truncated power-law node degree distributions and other organizational constraints. Furthermore, the number of steps along which information is routed follows a truncated power-law distribution. Based on these observations, we developed a network model that can generate synthetic collaborative networks subject to certain structure constraints. Moreover, we developed a routing model that emulates task-driven information routing conducted by human beings in a collaborative network. Together, these two models can be used to study the efficiency of information routing for different types of collaborative networksa problem that is important in practice yet difficult to solve without the method proposed in this paper.
Biclustering is crucial in finding co-expressed genes and their associated conditions in gene expression data. While various biclustering algorithms (e.g., combinatorial, probabilistic modelling, and matrix factorization) have been proposed and constantly improved in the past decade, data noise and bicluster overlaps make biclustering a still challenging task. It becomes difficult to further improve biclustering performance, without resorting to a new approach. Inspired by the recent progress in unsupervised feature learning using deep neural networks [1], in this work, we propose a novel model for biclustering, named AutoDecoder (AD), by relating biclusters to features and leveraging a neural network that is able to automatically learn features from the input data. To suppress severe noise present in gene expression data, we introduce a non-uniform signal recovery mechanism: Instead of reconstructing the whole input data to capture the bicluster patterns, AD weighs the zero and non-zero parts of the input data differently and is more flexible in dealing with different types of noise. AD is also properly regularized to deal with bicluster overlaps. To the best of our knowledge, this is the first biclustering algorithm that leverages neural network techniques to recover overlapped biclusters hidden in noisy gene expression data. We compared our approach with four state-of-the-art biclustering algorithms on both synthetic and real datasets. On three out of the four real datasets, AD significantly outperforms the other approaches. On controlled synthetic datasets, AD performs the best when noise level is beyond 15%.Source Code: http://grafia.cs.ucsb.edu/autodecoder/
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.