Neuroimaging allows for the non-invasive study of the brain in rich detail. Data-driven discovery of patterns of population variability in the brain has the potential to be extremely valuable for early disease diagnosis and understanding the brain. The resulting patterns can be used as imaging-derived phenotypes (IDPs), and may complement existing expert-curated IDPs. However, population datasets, comprising many different structural and functional imaging modalities from thousands of subjects, provide a computational challenge not previously addressed. Here, for the first time, a multimodal independent component analysis approach is presented that is scalable for data fusion of voxel-level neuroimaging data in the full UK Biobank (UKB) dataset, that will soon reach 100,000 imaged subjects. This new computational approach can estimate modes of population variability that enhance the ability to predict thousands of phenotypic and behavioural variables using data from UKB and the Human Connectome Project. A high-dimensional decomposition achieved improved predictive power compared with widely-used analysis strategies, single-modality decompositions and existing IDPs. In UKB data (14,503 subjects with 47 different data modalities), many interpretable associations with non-imaging phenotypes were identified, including multimodal spatial maps related to fluid intelligence, handedness and disease, in some cases where IDP-based approaches failed.
Introduction1 Large-scale multimodal brain imaging has enormous potential for boosting epidemiological and neu-2 roscientific studies, generating markers for early disease diagnosis and prediction of disease progres-3 sion, and the understanding of human cognition, by means of linking to clinical or behavioural vari-4 ables. Recent major studies have been acquiring brain magnetic resonance imaging (MRI), genetics and 5 demographic/behavioural data from large cohorts. Examples are the UK Biobank (UKB) 1 , the Human 6 Connectome Project (HCP) 2 and the Adolescent Brain Cognitive Development (ABCD) study 3 . These 7 studies involve multimodal data, meaning that several distinct types of MRI data are acquired, mapping 8 activity, functional networks, structural connectivity, white matter microstructure, and organisation and 9 volumes of different brain tissues and sub-structures 1 . However, the multimodal, high-dimensional and 10 noisy nature of such big datasets makes many existing analytical approaches for extracting interpretable 11 information impractical 4 .
12Traditionally, large-scale neuroimaging studies first summarize the imaging data into interpretable 13 image-derived phenotypes (IDPs) 1, 5 , which are scalar quantities derived from raw imaging data (e.g., 14 regional volumes from structural MRI, mean task activations from task MRI, resting-state functional 15 connectivities between brain parcels). This knowledge-based approach is simple and efficient, and ef-16 fectively reduces the high-dimensional data into interpretable, compact, convenient features. However, 17...