Abstract. The performance of similarity measures for search, indexing, and data mining applications tends to degrade rapidly as the dimensionality of the data increases. The effects of the so-called 'curse of dimensionality' have been studied by researchers for data sets generated according to a single data distribution. In this paper, we study the effects of this phenomenon on different similarity measures for multiplydistributed data. In particular, we assess the performance of sharedneighbor similarity measures, which are secondary similarity measures based on the rankings of data objects induced by some primary distance measure. We find that rank-based similarity measures can result in more stable performance than their associated primary distance measures.
Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges -in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies. Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer. Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds. These capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms.
International audienceThis paper is concerned with the estimation of continuous intrinsic dimension (ID), a measure of intrinsic dimensionality recently proposed by Houle. Continuous ID can be regarded as an extension of Karger and Ruhl’s expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several es- timators of continuous ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation (MLE), the method of moments (MoM), probability weighted moments (PWM), and regularly varying functions (RV). An experimental evaluation is also provided, using both real and artificial data
The Amino acid-Polyamine-organoCation (APC) superfamily is the second largest superfamily of secondary carriers currently known. In the current study, we establish homology between previously recognized APC superfamily members and proteins of seven new families. These families include the PAAP (Putative Amino Acid Permease), LIVCS (Branched Chain Amino Acid:Cation Symporter), NRAMP (Natural Resistance-Associated Macrophage Protein), CstA (Carbon starvation A protein), KUP (K+ Uptake Permease), BenE (Benzoate:H+ Symporter) and AE (Anion Exchanger). The topology of the well-characterized human Anion Exchanger 1 (AE1) conforms to a UraA-like topology of 14 TMSs (12 α-helical TMSs and 2 mixed coil/helical TMSs). All functionally characterized members of the APC superfamily use cation symport for substrate accumulation except for members of the AE family which frequently use anion:anion exchange. We show how the different topologies fit into the framework of the common LeuT-like fold, defined earlier (Proteins. 2014 Feb;82(2):336–46), and determine that some of the new members contain previously undocumented topological variations. All new entries contain the two 5 or 7 TMS APC superfamily repeat units, sometimes with extra TMSs at the ends, the variations being greatest within the CstA family. New, functionally characterized members transport amino acids, peptides, and inorganic anions or cations. Except for anions, these are typical substrates of established APC superfamily members. Active site TMSs are rich in glycyl residues in variable but conserved constellations. This work expands the APC superfamily and our understanding of its topological variations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.