9ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique 10 to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription fac-11 tor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as 12 experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect 13 the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences 14 made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, 15 ChIPulate, we assess the impact of various biological and experimental sources of variation on sev-16 eral outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy 17 of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of 18 replicates needed to confidently infer binding strength. We find that the TF motif can be recovered 19 despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the 20 motif is however affected to a larger extent by the fraction of sites that are either cooperatively 21 or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates 22 needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recom-23 mended community standards. Our results establish statistical limits on the accuracy of inferences 24 of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, 25 rather than amplification efficiency, would better improve sensitivity. The source code and instruc-26 tions for running ChIPulate can be found at https://github.com/vishakad/chipulate. . The author list is alphabetical.[1]. Upon mapping of the DNA fragments bound by the TF to the reference genome, the genomic 31 loci bound by the TF are identified as high density mapped regions or peaks, where each peak is 32 associated with an intensity based on the number of sequenced fragments arising from it. The 33 intensity reflects the in vivo occupancy of the TF at that locus.
34Several studies of ChIP-seq data have focussed on the biological factors distinguishing the loci 35 bound by the TF. It has been shown that in addition to the affinities of binding sites present 36 at a locus, nucleosome positioning is a strong determinant of TF binding in vivo [2, 3, 4, 5].
37Other studies have shown that the concentration of the target TF [6, 7], short-range cooperative 38 interactions between the target TF and other TFs [8], and variation in chromatin accessibility [5, 7] 39 explain the variation in intensities across peaks. Some of the variation can arise due to indirect 40 binding, where the target TF binds DNA indirectly via a second 10, 11]. The 41 intensity of such peaks is then no longer directly dependent on the affinity of the target TF to 42 sequence at the bound locus.
43Since the distribution of ChIP-seq peaks and the...