Abstract. Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a different location. The complexity of this problem is still open and it has been a ten-year-old open problem to improve the best known 1.5-approximation algorithm. In this paper we provide a 1.375-approximation algorithm for sorting by transpositions. The algorithm is based on a new upper bound on the diameter of 3-permutations. In addition, we present some new results regarding the transposition diameter: We improve the lower bound for the transposition diameter of the symmetric group, and determine the exact transposition diameter of 2-permutations and simple permutations.
Abstract. Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a different location. The complexity of this problem is still open and it has been a ten-year-old open problem to improve the best known 1.5-approximation algorithm. In this paper we provide a 1.375-approximation algorithm for sorting by transpositions. The algorithm is based on a new upper bound on the diameter of 3-permutations. In addition, we present some new results regarding the transposition diameter: We improve the lower bound for the transposition diameter of the symmetric group, and determine the exact transposition diameter of 2-permutations and simple permutations.
One of the most promising ways to determine evolutionary distance between two organisms is to compare the order of appearance of orthologous genes in their genomes. The resulting genome rearrangement problem calls for finding a shortest sequence of rearrangement operations that sorts one genome into the other. In this paper we provide a 1.5-approximation algorithm for the problem of sorting by transpositions and transreversals, improving on a five-year-old 1.75 ratio for this problem. Our algorithm is also faster than current approaches and requires O(n 3/2 √ log n) time for n genes.
ABSTRACT:Weak designs were defined in R. Raz, O. Reingold, and S. Vadhan [Extracting all the randomness and reducing the error in Trevisan's extractors, Proc 31st ACM Symp Theory of Computing, Atlanta, GA, May 1999, to appear in J Comput System Sci Special Issue on STOC 99] and are used there in constructions of extractors. Roughly speaking, a weak design is a collection of subsets satisfying some near-disjointness properties. Constructions of weak designs with certain parameters are given in Raz et al. These constructions are explicit in the sense that they require time and space polynomial in the number of subsets. However, the constructions require time and space polynomial in the number of subsets even when needed to output only one specific subset out of the collection. Hence, the constructions are not explicit in a stronger sense. In this work we provide constructions of weak designs (with parameters similar to the ones of Raz et al.) that can be carried out in space logarithmic in the number of subsets. Moreover, our constructions are explicit even in a stronger sense: Given an index to a subset, we output the specified subset in time and space polynomial in the size of the index. Using our constructions, we obtain extractors similar to some of the ones given in Raz et al. in terms of parameters, and that can be evaluated in logarithmic space. Our main construction is algebraic. In order to prove the properties of weak designs, we prove some algebro-combinatorial lemmas that may be interesting in their own right. These lemmas regard the number of roots of polynomials over finite fields. In particular, we prove that the number of
235polynomials (over any finite field) with k roots, vanishes exponentially in k. In other words, we prove that the number of roots of a random polynomial is not only bounded by its degree (a well-known fact), but, furthermore, it is concentrated exponentially around its expectation (which is 1). Our lemmas are proved by algebro-combinatorial arguments. The main lemma is also proved by a probabilistic argument.
Background: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. Objective: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. Methods: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. Results: Fully customized systems remove 97-99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. Conclusion: Health organizations should be aware of the levels of customization available when selecting a deidentification deployment solution, in order to choose the one that best matches their resources and target performance level.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.