We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T1, . . . , Tn, and a query Q with q sets of keywords Q1, . . . , Qq, decide for each Ti if it is relevant to Q and if so, identify the mapping between the columns of Ti and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods.
Built on top of human resources management databases within the enterprise, we present a decision support system for managing and optimizing screening activities during the hiring process in a large organization. The basic idea is to prioritize the efforts of human resource practitioners to focus on candidates that are likely of high quality, that are likely to accept a job offer if made one, and that are likely to remain with the organization for the long term. To do so, the system first individually ranks candidates along several dimensions using a keyword matching algorithm and several bipartite ranking algorithms with univariate loss trained on historical actions. Next, individual rankings are aggregated to derive a single list that is presented to the recruitment team through an interactive portal. The portal supports multiple filters that facilitate effective identification of candidates. We demonstrate the usefulness of our system on data collected from a large organization over several years with business value metrics showing greater hiring yield with less interviews. Similarly, using historical pre-hire data we demonstrate accurate identification of candidates that will have quickly left the organization. The system has been deployed as described in a large globally integrated enterprise.
Label propagation is a well-explored family of methods for training a semi-supervised classifier where input data points (both labeled and unlabeled) are connected in the form of a weighted graph. For binary classification, the performance of these methods starts degrading considerably whenever input dataset exhibits following characteristics -(i) one of the class label is rare label or equivalently, class imbalance (CI) is very high, and (ii) degree of supervision (DoS) is very low -defined as fraction of labeled points. These characteristics are common in many real-world datasets relating to network fraud detection. Moreover, in such applications, the amount of class imbalance is not known a priori. In this paper, we have proposed and justified the use of an alternative formulation for graph label propagation under such extreme behavior of the datasets. In our formulation, objective function is the difference of two convex quadratic functions and the constraints are box constraints. We solve this program using Concave-Convex Procedure (CCCP). Whenever the problem size becomes too large, we suggest to work with a k-NN subgraph of the given graph which can be sampled by using Locality Sensitive Hashing (LSH) technique. We have also discussed various issues that one typically faces while sampling such a k-NN subgraph in practice. Further, we have proposed a novel label flipping method on top of the CCCP solution, which improves the result of CCCP further whenever class imbalance information is made available a priori. Our method can be easily adopted for a MapReduce platform, such as Hadoop. We have conducted experiments on 11 datasets comprising a graph size of up to 20K nodes, CI as high as 99.6%, and DoS as low as 0.5%. Our method has resulted up to 19.5-times improvement in F -measure and up to 17.5-times improvement in AUC-PR measure against baseline methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.