Teppo Niinimäki scite author profile

Heikkilä

Honkela

et al. 2019

Motivation Human genomic datasets often contain sensitive information that limits use and sharing of the data. In particular, simple anonymization strategies fail to provide sufficient level of protection for genomic data, because the data are inherently identifiable. Differentially private machine learning can help by guaranteeing that the published results do not leak too much information about any individual data point. Recent research has reached promising results on differentially private drug sensitivity prediction using gene expression data. Differentially private learning with genomic data is challenging because it is more difficult to guarantee privacy in high dimensions. Dimensionality reduction can help, but if the dimension reduction mapping is learned from the data, then it needs to be differentially private too, which can carry a significant privacy cost. Furthermore, the selection of any hyperparameters (such as the target dimensionality) needs to also avoid leaking private information. Results We study an approach that uses a large public dataset of similar type to learn a compact representation for differentially private learning. We compare three representation learning methods: variational autoencoders, principal component analysis and random projection. We solve two machine learning tasks on gene expression of cancer cell lines: cancer type classification, and drug sensitivity prediction. The experiments demonstrate significant benefit from all representation learning methods with variational autoencoders providing the most accurate predictions most often. Our results significantly improve over previous state-of-the-art in accuracy of differentially private drug sensitivity prediction. Availability and implementation Code used in the experiments is available at https://github.com/DPBayes/dp-representation-transfer.

The Mixing of Markov Chains on Linear Extensions in Practice

Talvitie

Koivisto

2017

We investigate almost uniform sampling from the set of linear extensions of a given partial order. The most efficient schemes stem from Markov chains whose mixing time bounds are polynomial, yet impractically large. We show that, on instances one encounters in practice, the actual mixing times can be much smaller than the worst-case bounds, and particularly so for a novel Markov chain we put forward. We circumvent the inherent hardness of estimating standard mixing times by introducing a refined notion, which admits estimation for moderate-size partial orders. Our empirical results suggest that the Markov chain approach to sample linear extensions can be made to scale well in practice, provided that the actual mixing times can be realized by instance-sensitive upper bounds or termination rules. Examples of the latter include existing perfect simulation algorithms, whose running times in our experiments follow the actual mixing times of certain chains, albeit with significant overhead.

A Scalable Scheme for Counting Linear Extensions

Talvitie

Kangas

et al. 2018

Counting the linear extensions of a given partial order not only has several applications in artificial intelligence but also represents a hard problem that challenges modern paradigms for approximate counting. Recently, Talvitie et al. (AAAI 2018) showed that an exponential time scheme beats the fastest known polynomial time schemes in practice, even if allowing hours of running time. Here, we present a novel scheme, relaxation Tootsie Pop, which in our experiments exhibits polynomial characteristics and significantly outperforms previous schemes. We also instantiate state-of-the-art model counters for CNF formulas; two natural encodings yield schemes that, however, are inferior to the more specialized schemes.

Counting Linear Extensions in Practice: MCMC Versus Exponential Monte Carlo

Talvitie

Kangas

et al. 2018

AAAI

Counting the linear extensions of a given partial order is a #P-complete problem that arises in numerous applications. For polynomial-time approximation, several Markov chain Monte Carlo schemes have been proposed; however, little is known of their efficiency in practice. This work presents an empirical evaluation of the state-of-the-art schemes and investigates a number of ideas to enhance their performance. In addition, we introduce a novel approximation scheme, adaptive relaxation Monte Carlo (ARMC), that leverages exact exponential-time counting algorithms. We show that approximate counting is feasible up to a few hundred elements on various classes of partial orders, and within this range ARMC typically outperforms the other schemes.