Bioconductor: open software development for computational biology and bioinformatics The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.
We have updated and extended our semi-analytic galaxy formation modelling capabilities and applied them simultaneously to the stored halo/subhalo merger trees of the Millennium and Millennium-II Simulations (MS and MS-II, respectively). These differ by a factor of 125 in mass resolution, allowing explicit testing of resolution effects on predicted galaxy properties. We have revised the treatment of the transition between the rapid infall and cooling flow regimes of gas accretion, of the sizes of bulges, and of gaseous and stellar discs, of supernova feedback, of the transition between central and satellite status as galaxies fall into larger systems, and of gas and star stripping, once they become satellites. Plausible values of efficiency and scaling parameters yield an excellent fit not only to the observed abundance of low-redshift galaxies over five orders of magnitude in stellar mass and 9 mag in luminosity, but also to the observed abundance of Milky Way satellites. This suggests that reionization effects may not be needed to solve the 'missing-satellite' problem, except, perhaps, for the faintest objects. The same model matches the observed large-scale clustering of galaxies as a function of stellar mass and colour. The fit remains excellent down to ∼30 kpc for massive galaxies. For M * < 6 × 10 10 M , however, the model overpredicts clustering at scales below ∼1 Mpc, suggesting that the assumed fluctuation amplitude, σ 8 = 0.9, is too high. The observed difference in clustering between active and passive galaxies is matched quite well for all masses. Galaxy distributions within rich clusters agree between the simulations and match those observed, but only if galaxies without dark matter subhaloes (so-called orphans) are included. Even at MS-II resolution, schemes which assign galaxies only to resolved dark matter subhaloes cannot match observed clusters. Our model predicts a larger passive fraction among low-mass galaxies than is observed, as well as an overabundance of ∼10 10 M galaxies beyond z ∼ 0.6. (The abundance of ∼10 11 M galaxies is matched out to z ∼ 3.) These discrepancies appear to reflect deficiencies in the way star formation rates are modelled.
We use a modified version of the halo-based group finder developed by Yang et al. to select galaxy groups from the Sloan Digital Sky Survey (SDSS DR4). In the first step, a combination of two methods is used to identify the centers of potential groups and to estimate their characteristic luminosity. Using an iterative approach, the adaptive group finder then uses the average mass-to-light ratios of groups, obtained from the previous iteration, to assign a tentative mass to each group. This mass is then used to estimate the size and velocity dispersion of the underlying halo that hosts the group, which in turn is used to determine group membership in redshift space. Finally, each individual group is assigned two different halo masses: one based on its characteristic luminosity, and the other based on its characteristic stellar mass. Applying the group finder to the SDSS DR4, we obtain 301237 groups in a broad dynamic range, including systems of isolated galaxies. We use detailed mock galaxy catalogues constructed for the SDSS DR4 to test the performance of our group finder in terms of completeness of true members, contamination by interlopers, and accuracy of the assigned masses. This paper is the first in a series and focuses on the selection procedure, tests of the reliability of the group finder, and the basic properties of the group catalogue (e.g. the mass-to-light ratios, the halo mass to stellar mass ratios, etc.). The group catalogues including the membership of the groups are available at these links 1 . Subject headings: dark matter -large-scale structure of the universe -galaxies: halos -methods: statistical 1 Shanghai Astronomical Observatory, the Partner Group of MPA,
We have generated a molecular taxonomy of lung carcinoma, the leading cause of cancer death in the United States and worldwide. Using oligonucleotide microarrays, we analyzed mRNA expression levels corresponding to 12,600 transcript sequences in 186 lung tumor samples, including 139 adenocarcinomas resected from the lung. Hierarchical and probabilistic clustering of expression data defined distinct subclasses of lung adenocarcinoma. Among these were tumors with high relative expression of neuroendocrine genes and of type II pneumocyte genes, respectively. Retrospective analysis revealed a less favorable outcome for the adenocarcinomas with neuroendocrine gene expression. The diagnostic potential of expression profiling is emphasized by its ability to discriminate primary lung adenocarcinomas from metastases of extra-pulmonary origin. These results suggest that integration of expression profile data with clinical parameters could aid in diagnosis of lung cancer patients.
Recent advances in cDNA and oligonucleotide DNA arrays have made it possible to measure the abundance of mRNA transcripts for many genes simultaneously. The analysis of such experiments is nontrivial because of large data size and many levels of variation introduced at different stages of the experiments. The analysis is further complicated by the large differences that may exist among different probes used to interrogate the same gene. However, an attractive feature of high-density oligonucleotide arrays such as those produced by photolithography and inkjet technology is the standardization of chip manufacturing and hybridization process. As a result, probe-specific biases, although significant, are highly reproducible and predictable, and their adverse effect can be reduced by proper modeling and analysis methods. Here, we propose a statistical model for the probe-level data, and develop model-based estimates for gene expression indexes. We also present model-based methods for identifying and handling crosshybridizing probes and contaminating array regions. Applications of these results will be presented elsewhere.O ligonucleotide expression array technology (1) has recently been adopted in many areas of biomedical research. As reviewed in ref. 2, 14 to 20 probe pairs are used to interrogate each gene, each probe pair has a Perfect Match (PM) and Mismatch (MM) signal, and the average of the PM-MM differences for all probe pairs in a probe set (called ''average difference'') is used as an expression index for the target gene. Researchers rely on the average differences as the starting point for ''high-level analysis'' such as SOM analysis (3) or two way clustering (4). Besides the original publications by Affymetrix scientists (1, 5), there have been very few studies on important ''low-level'' analysis issues such as feature extraction, normalization, and computation of expression indexes (6).One of the most critical issues is the way probe-specific effects are handled. We have found that even after making use of the control information provide by the MM intensity, the information on expression level provided by the different probes for the same gene are still highly variable. We use a set of 21 HuGeneFL arrays to illustrate our discussion. This data set is typical, in terms of quality and sample size, of a data set from a single-laboratory experiment. We have applied the methodology to many sets of arrays from different laboratories and obtained similar results. Each of these 21 arrays contains more than 250,000 features and 7,129 probe sets. Figs. 1 and 2 show data for one probe set in the first six arrays. This probe set (no. 6,457) will be called probe set A hereafter. There are considerable differences in the expression levels of this gene in the samples being interrogated, as the between-array variation in PM-MM differences is substantial. More noteworthy is the dramatic variation among the PM-MM differences of the 20 probes that interrogate the transcript level. ANOVA of the PM-MM differences of this pro...
The newly identified 2019 novel coronavirus (2019-nCoV) has caused more than 11,900 laboratory-confirmed human infections, including 259 deaths, posing a serious threat to human health. Currently, however, there is no specific antiviral treatment or vaccine. Considering the relatively high identity of receptor-binding domain (RBD) in 2019-nCoV and SARS-CoV, it is urgent to assess the cross-reactivity of anti-SARS CoV antibodies with 2019-nCoV spike protein, which could have important implications for rapid development of vaccines and therapeutic antibodies against 2019-nCoV. Here, we report for the first time that a SARS-CoV-specific human monoclonal antibody, CR3022, could bind potently with 2019-nCoV RBD (KD of 6.3 nM). The epitope of CR3022 does not overlap with the ACE2 binding site within 2019-nCoV RBD. These results suggest that CR3022 may have the potential to be developed as candidate therapeutics, alone or in combination with other neutralizing antibodies, for the prevention and treatment of 2019-nCoV infections. Interestingly, some of the most potent SARS-CoV-specific neutralizing antibodies (e.g. m396, CR3014) that target the ACE2 binding site of SARS-CoV failed to bind 2019-nCoV spike protein, implying that the difference in the RBD of SARS-CoV and 2019-nCoV has a critical impact for the cross-reactivity of neutralizing antibodies, and that it is still necessary to develop novel monoclonal antibodies that could bind specifically to 2019-nCoV RBD.
For any assumed standard stellar initial mass function, the Sloan Digital Sky Survey (SDSS) gives a precise determination of the abundance of galaxies as a function of their stellar mass over the full stellar mass range 108 M⊙ < M* < 1012 M⊙. Within the concordance Λ cold dark matter (ΛCDM) cosmology, the Millennium Simulations give precise halo abundances as a function of mass and redshift for all haloes within which galaxies can form. Under the plausible hypothesis that the stellar mass of a galaxy is an increasing function of the maximum mass ever attained by its halo, these results combine to give halo mass as a function of stellar mass. The result agrees quite well with observational estimates of mean halo mass as a function of stellar mass from stacking analyses of the gravitational lensing signal and the satellite dynamics of SDSS galaxies. For M*∼ 5.5 × 1010 M⊙, the stellar mass usually assumed for the Milky Way (MW), the implied halo mass is ∼2 × 1012 M⊙, consistent with most recent direct estimates and inferences from the MW/M31 timing argument. The fraction of the baryons associated with each halo which are present as stars in its central galaxy reaches a maximum of 20 per cent at masses somewhat below that of the MW and falls rapidly at both higher and lower masses. These conversion efficiencies are lower than in almost all recent high‐resolution simulations of galaxy formation, showing that these are not yet viable models for the formation of typical members of the galaxy population. When inserted in the Millennium‐II Simulation, our derived relation between stellar mass and halo mass predicts a stellar mass autocorrelation function in excellent agreement with that measured directly in the SDSS. The implied Tully–Fisher relation also appears consistent with observation, suggesting that galaxy luminosity functions and Tully–Fisher relations can be reproduced simultaneously in a ΛCDM cosmology.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.