Age-dependent topic modeling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk
The analysis of longitudinal data from electronic health records (EHRs) has the potential to improve clinical diagnoses and enable personalized medicine, motivating efforts to identify disease subtypes from patient comorbidity information.
Here we introduce an age-dependent topic modeling (ATM) method that provides a low-rank representation of longitudinal records of hundreds of distinct diseases in large EHR datasets. We applied ATM to 282,957 UK Biobank samples, identifying 52 diseases with heterogeneous comorbidity profiles; analyses of 211,908 All of Us samples produced concordant results.
We defined subtypes of the 52 heterogeneous diseases based on their comorbidity profiles and compared genetic risk across disease subtypes using polygenic risk scores (PRSs), identifying 18 disease subtypes whose PRS differed significantly from other subtypes of the same disease. We further identified specific genetic variants with subtype-dependent effects on disease risk. In conclusion, ATM identifies disease subtypes with differential genome-wide and locus-specific genetic risk profiles. Reference
Rare variant associations with plasma protein levels in the UK Biobank
Integrating human genomics and proteomics can help elucidate disease mechanisms, identify clinical biomarkers and discover drug targets. Because previous proteogenomic studies have focused on common variation via genome-wide association studies, the contribution of rare variants to the plasma proteome remains largely unknown.
Here we identify associations between rare protein-coding variants and 2,923 plasma protein abundances measured in 49,736 UK Biobank individuals. Our variant-level exome-wide association study identified 5,433 rare genotype–protein associations, of which 81% were undetected in a previous genome-wide association study of the same cohort.
We then looked at aggregate signals using gene-level collapsing analysis, which revealed 1,962 gene–protein associations. Of the 691 gene-level signals from protein-truncating variants, 99.4% were associated with decreased protein levels. STAB1 and STAB2, encoding scavenger receptors involved in plasma protein clearance, emerged as pleiotropic loci, with 77 and 41 protein associations, respectively. We demonstrate the utility of our publicly accessible resource through several applications. Reference
Molecular classification of hormone receptor-positive HER2-negative breast cancer
Hormone receptor-positive (HR+)/human epidermal growth factor receptor 2-negative (HER2−) breast cancer is the most prevalent type of breast cancer, in which endocrine therapy resistance and distant relapse remain unmet challenges.
Accurate molecular classification is urgently required for guiding precision treatment. We established a large-scale multi-omics cohort of 579 patients with HR+/HER2− breast cancer and identified the following four molecular subtypes: canonical luminal, immunogenic, proliferative and receptor tyrosine kinase (RTK)-driven. Tumors of these four subtypes showed distinct biological and clinical features, suggesting subtype-specific therapeutic strategies. The RTK-driven subtype was characterized by the activation of the RTK pathways and associated with poor outcomes.
The immunogenic subtype had enriched immune cells and could benefit from immune checkpoint therapy. In addition, we developed convolutional neural network models to discriminate these subtypes based on digital pathology for potential clinical translation. The molecular classification provides insights into molecular heterogeneity and highlights the potential for precision treatment of HR+/HER2− breast cancer. Reference
Epigenomic dissection of Alzheimer’s disease pinpoints causal variants and reveals epigenome erosion
Recent work has identified dozens of non-coding loci for Alzheimer’s disease (AD) risk, but their mechanisms and AD transcriptional regulatory circuitry are poorly understood.
Here, we profile epigenomic and transcriptomic landscapes of 850,000 nuclei from prefrontal cortexes of 92 individuals with and without AD to build a map of the brain regulome, including epigenomic profiles, transcriptional regulators, co-accessibility modules, and peak-to-gene links in a cell-type-specific manner. We develop methods for multimodal integration and detecting regulatory modules using peak-to-gene linking.
We show AD risk loci are enriched in microglial enhancers and for specific TFs including SPI1, ELF2, and RUNX1. We detect 9,628 cell-type-specific ATAC-QTL loci, which we integrate alongside peak-to-gene links to prioritize AD variant regulatory circuits. We report differential accessibility of regulatory modules in late AD in glia and in early AD in neurons. Strikingly, late-stage AD brains show global epigenome dysregulation indicative of epigenome erosion and cell identity loss. Reference
Genome-wide enhancer-gene regulatory maps
Genome-wide association studies have identified numerous variants associated with human complex traits, most of which reside in the non-coding regions, but biological mechanisms remain unclear.
However, assigning function to the non-coding elements is still challenging. Here we apply Activity-by-Contact (ABC) model to evaluate enhancer-gene regulation effect by integrating multi-omics data and identified 544,849 connections across 20 cancer types. ABC model outperforms previous approaches in linking regulatory variants to target genes. Furthermore, we identify over 30,000 enhancer-gene connections in colorectal cancer (CRC) tissues.
By integrating large-scale population cohorts (23,813 cases and 29,973 controls) and multipronged functional assays, we demonstrate an ABC regulatory variant rs4810856 associated with CRC risk (Odds Ratio = 1.11, 95%CI = 1.05–1.16, P = 4.02 × 10−5) by acting as an allele-specific enhancer to distally facilitate PREX1, CSE1L and STAU1 expression, which synergistically activate p-AKT signaling. Our study provides comprehensive regulation maps and illuminates a single variant regulating multiple genes, providing insights into cancer etiology. Reference
Accurate proteome-wide missense variant effect prediction with AlphaMissense
Genome sequencing has revealed extensive genetic variation in human populations. Missense variants are genetic variants that alter the amino acid sequence of proteins. Pathogenic missense variants disrupt protein function and reduce organismal fitness, while benign missense variants have limited effect.
Classifying these variants is an important ongoing challenge in human genetics. Of more than 4 million observed missense variants, only an estimated 2% have been clinically classified as pathogenic or benign, while the vast majority of them are of unknown clinical significance. This limits the diagnosis of rare diseases, as well as the development or application of clinical treatments that target the underlying genetic cause. Machine learning approaches could close the variant interpretation gap by exploiting patterns in biological data to predict the pathogenicity of unannotated variants.
We developed AlphaMissense to leverage advances on multiple fronts: (i) unsupervised protein language modeling to learn amino acid distributions conditioned on sequence context; (ii) incorporating structural context by using an AlphaFold-derived system; and (iii) fine-tuning on weak labels from population frequency data, thereby avoiding bias from human-curated annotations. AlphaMissense achieves state-of-the-art missense pathogenicity predictions in clinical annotation, de novo disease variants, and experimental assay benchmarks without explicitly training on such data. Reference
A robust deep learning workflow to predict CD8 + T-cell epitopes
T-cells play a crucial role in the adaptive immune system by triggering responses against cancer cells and pathogens, while maintaining tolerance against self-antigens, which has sparked interest in the development of various T-cell-focused immunotherapies.
However, the identification of antigens recognised by T-cells is low-throughput and laborious. To overcome some of these limitations, computational methods for predicting CD8 + T-cell epitopes have emerged. Despite recent developments, most immunogenicity algorithms struggle to learn features of peptide immunogenicity from small datasets, suffer from HLA bias and are unable to reliably predict pathology-specific CD8 + T-cell epitopes.
TRAP was used to identify epitopes from glioblastoma patients as well as SARS-CoV-2 peptides, and it outperformed other algorithms in both cancer and pathogenic settings. TRAP was especially effective at extracting immunogenicity-associated properties from restricted data of emerging pathogens and translating them onto related species, as well as minimising the loss of likely epitopes in imbalanced datasets. Reference
An immune cell atlas reveals the dynamics of human macrophage specification during prenatal development
Macrophages are heterogeneous and play critical roles in development and disease, but their diversity, function, and specification remain inadequately understood during human development.
We generated a single-cell RNA sequencing map of the dynamics of human macrophage specification from PCW 4–26 across 19 tissues. We identified a microglia-like population and a proangiogenic population in 15 macrophage subtypes. Microglia-like cells, molecularly and morphologically similar to microglia in the CNS, are present in the fetal epidermis, testicle, and heart. They are the major immune population in the early epidermis, exhibit a polarized distribution along the dorsal-lateral-ventral axis, and interact with neural crest cells, modulating their differentiation along the melanocyte lineage.
Through spatial and differentiation trajectory analysis, we also showed that proangiogenic macrophages are perivascular across fetal organs and likely yolk-sac-derived as microglia. Our study provides a comprehensive map of the heterogeneity and developmental dynamics of human macrophages and unravels their diverse functions during development. Reference
GWAS of random glucose in 476,326 individuals
Conventional measurements of fasting and postprandial blood glucose levels investigated in genome-wide association studies (GWAS) cannot capture the effects of DNA variability on ‘around the clock’ glucoregulatory processes.
Here we show that GWAS meta-analysis of glucose measurements under nonstandardized conditions (random glucose (RG)) in 476,326 individuals of diverse ancestries and without diabetes enables locus discovery and innovative pathophysiological observations.
We discovered 120 RG loci represented by 150 distinct signals, including 13 with sex-dimorphic effects, two cross-ancestry and seven rare frequency signals. Of these, 44 loci are new for glycemic traits. Regulatory, glycosylation and metagenomic annotations highlight ileum and colon tissues, indicating an underappreciated role of the gastrointestinal tract in controlling blood glucose. Reference
Spatial multimodal analysis of transcriptomes and metabolomes in tissues
We present a spatial omics approach that combines histology, mass spectrometry imaging and spatial transcriptomics to facilitate precise measurements of mRNA transcripts and low-molecular-weight metabolites across tissue regions.
The workflow is compatible with commercially available Visium glass slides. We demonstrate the potential of our method using mouse and human brain samples in the context of dopamine and Parkinson’s disease. Reference
The Oncology Biomarker Discovery framework reveals cetuximab and bevacizumab response patterns in metastatic colorectal cancer
Precision medicine has revolutionised cancer treatments; however, actionable biomarkers remain scarce. To address this, we develop the Oncology Biomarker Discovery (OncoBird) framework for analysing the molecular and biomarker landscape of randomised controlled clinical trials.
OncoBird identifies biomarkers based on single genes or mutually exclusive genetic alterations in isolation or in the context of tumour subtypes, and finally, assesses predictive components by their treatment interactions. Here, we utilise the open-label, randomised phase III trial (FIRE-3, AIO KRK-0306) in metastatic colorectal carcinoma patients, who received either cetuximab or bevacizumab in combination with 5-fluorouracil, folinic acid and irinotecan (FOLFIRI).
We systematically identify five biomarkers with predictive components, e.g., patients with tumours that carry chr20q amplifications or lack mutually exclusive ERK signalling mutations benefited from cetuximab compared to bevacizumab. In summary, OncoBird characterises the molecular landscape and outlines actionable biomarkers, which generalises to any molecularly characterised randomised controlled trial. Reference
A pan-cancer single-cell panorama of human natural killer cells
Natural killer (NK) cells play indispensable roles in innate immune responses against tumor progression. To depict their phenotypic and functional diversities in the tumor microenvironment, we perform integrative single-cell RNA sequencing analyses on NK cells from 716 patients with cancer, covering 24 cancer types.
We observed heterogeneity in NK cell composition in a tumor-type-specific manner. Notably, we have identified a group of tumor-associated NK cells that are enriched in tumors, show impaired anti-tumor functions, and are associated with unfavorable prognosis and resistance to immunotherapy.
Specific myeloid cell subpopulations, in particular LAMP3+ dendritic cells, appear to mediate the regulation of NK cell anti-tumor immunity. Our study provides insights into NK-cell-based cancer immunity and highlights potential clinical utilities of NK cell subsets as therapeutic targets. Reference
An extensive resource for Bioinformatics, Epigenomics, Genomics and Metagenomics