We study the problem of discovering functional dependencies (FD) from a noisy data set. We adopt a statistical perspective and draw connections between FD discovery and structure learning in probabilistic graphical models. We show that discovering FDs from a noisy data set is equivalent to learning the structure of a model over binary random variables, where each random variable corresponds to a functional of the data set attributes. We build upon this observation to introduce FDX a conceptually simple framework in which learning functional dependencies corresponds to solving a sparse regression problem. We show that FDX can recover true functional dependencies across a diverse array of realworld and synthetic data sets, even in the presence of noisy or missing data. We find that FDX scales to large data instances with millions of tuples and hundreds of attributes while it yields an average F 1 improvement of 2× against state-of-the-art FD discovery methods.
The exploratory and confirmatory approaches of factor analysis lie on two ends of a continuum of substantive input for scale development. Recent advancements in Bayesian regularization methods enable more flexibility in covering a wide range of the substantive continuum. Based on the Bayesian Lasso (least absolute shrinkage and selection operator) methods for the regression model and covariance matrix, this research proposes a partially confirmatory approach to address the loading and residual structures at the same time. With at least one specified loading per item, a one-step procedure can be applied to figure out both structures simultaneously. With a few specified loadings per factor, a two-step procedure is preferred to capture the model configuration correctly. In both cases, the Bayesian hierarchical formulation is implemented using Markov Chain Monte Carlo estimation with different Lasso or regular priors. Both simulated and real data sets were analyzed to evaluate the validity, robustness, and practical usefulness of the proposed approach across different situations.
Condor is being used extensively in the HEP environment. It is the batch system of choice for many compute farms, including several WLCG Tier 1s, Tier 2s and Tier 3s. It is also the building block of one of the Grid pilot infrastructures, namely glideinWMS. As with any software, Condor does not scale indefinitely with the number of users and/or the number of resources being handled. In this paper we are presenting the current observed scalability limits of both the latest production and the latest development release of Condor, and compare them with the limits reported in previous publications. A description of what changes were introduced to remove the previous scalability limits are also presented.
Hotspots, a small set of tuples frequently read/written by a large number of transactions, cause contention in a concurrency control protocol. While a hotspot may comprise only a small fraction of a transaction's execution time, conventional strict two-phase locking allows a transaction to release lock only after the transaction completes, which leaves significant parallelism unexploited. Ideally, a concurrency control protocol serializes transactions only for the duration of the hotspots, rather than the duration of transactions.We observe that exploiting such parallelism requires violating two-phase locking. In this paper, we propose Bamboo, a new concurrency control protocol that can enable such parallelism by modifying the conventional two-phase locking, while maintaining the same guarantees in correctness. We thoroughly analyzed the effect of cascading aborts involved in reading uncommitted data and discussed optimizations that can be applied to further improve the performance. Our evaluation on TPC-C shows a performance improvement up to 4× compared to the best of pessimistic and optimistic baseline protocols. On synthetic workloads that contain a single hotspot, Bamboo achieves a speedup up to 19× over baselines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.