Integrative multi-omics and drug response profiling of childhood acute lymphoblastic leukemia cell lines
Acute lymphoblastic leukemia (ALL) is the most common childhood cancer. Although standard-of-care chemotherapeutics are sufficient for most ALL cases, there are subsets of patients with poor response who relapse in disease. The biology underlying differences between subtypes and their response to therapy has only partially been explained by genetic and transcriptomic profiling.
Here, we perform comprehensive multi-omic analyses of 49 readily available childhood ALL cell lines, using proteomics, transcriptomics, and pharmacoproteomic characterization. We connect the molecular phenotypes with drug responses to 528 oncology drugs, identifying drug correlations as well as lineage-dependent correlations. We also identify the diacylglycerol-analog bryostatin-1 as a therapeutic candidate in the MEF2D-HNRNPUL1 fusion high-risk subtype, for which this drug activates pro-apoptotic ERK signaling associated with molecular mediators of pre-B cell negative selection. Reference
MUON: multimodal omics analysis framework
Advances in multi-omics have led to an explosion of multimodal datasets to address questions from basic biology to translation. While these data provide novel opportunities for discovery, they also pose management and analysis challenges, thus motivating the development of tailored computational solutions.
Here, we present a data standard and an analysis framework for multi-omics, MUON, designed to organise, analyse, visualise, and exchange multimodal data. MUON stores multimodal data in an efficient yet flexible and interoperable data structure. MUON enables a versatile range of analyses, from data preprocessing to flexible multi-omics alignment. Reference
ClusterMap for multi-scale clustering analysis of spatial gene expression
Quantifying RNAs in their spatial context is crucial to understanding gene expression and regulation in complex tissues. In situ transcriptomic methods generate spatially resolved RNA profiles in intact tissues. However, there is a lack of a unified computational framework for integrative analysis of in situ transcriptomic data.
Here, we introduce an unsupervised and annotation-free framework, termed ClusterMap, which incorporates the physical location and gene identity of RNAs, formulates the task as a point pattern analysis problem, and identifies biologically meaningful structures by density peak clustering (DPC). Specifically, ClusterMap precisely clusters RNAs into subcellular structures, cell bodies, and tissue regions in both two- and three-dimensional space, and performs consistently on diverse tissue types, including mouse brain, placenta, gut, and human cardiac organoids. We demonstrate ClusterMap to be broadly applicable to various in situ transcriptomic measurements to uncover gene expression patterns, cell niche, and tissue organization principles from images with high-dimensional transcriptomic profiles. Reference
Subtype heterogeneity and epigenetic convergence in neuroendocrine prostate cancer
Neuroendocrine carcinomas (NEC) are tumors expressing markers of neuronal differentiation that can arise at different anatomic sites but have strong histological and clinical similarities.
Here we report the chromatin landscapes of a range of human NECs and show convergence to the activation of a common epigenetic program. With a particular focus on treatment emergent neuroendocrine prostate cancer (NEPC), we analyze cell lines, patient-derived xenograft (PDX) models and human clinical samples to show the existence of two distinct NEPC subtypes based on the expression of the neuronal transcription factors ASCL1 and NEUROD1. While in cell lines and PDX models these subtypes are mutually exclusive, single-cell analysis of human clinical samples exhibits a more complex tumor structure with subtypes coexisting as separate sub-populations within the same tumor. These tumor sub-populations differ genetically and epigenetically contributing to intra- and inter-tumoral heterogeneity in human metastases. Overall, our results provide a deeper understanding of the shared clinicopathological characteristics shown by NECs. Reference
Single-cell ATAC and RNA sequencing reveal pre-existing and persistent cells associated with prostate cancer relapse
Prostate cancer is heterogeneous and patients would benefit from methods that stratify those who are likely to respond to systemic therapy.
Here, we employ single-cell assays for transposase-accessible chromatin (ATAC) and RNA sequencing in models of early treatment response and resistance to enzalutamide. In doing so, we identify pre-existing and treatment-persistent cell subpopulations that possess regenerative potential when subjected to treatment. We find distinct chromatin landscapes associated with enzalutamide treatment and resistance that are linked to alternative transcriptional programs. Transcriptional profiles characteristic of persistent cells are able to stratify the treatment response of patients. Reference
EpiScanpy: integrated single-cell epigenomic analysis
EpiScanpy is a toolkit for the analysis of single-cell epigenomic data, namely single-cell DNA methylation and single-cell ATAC-seq data. To address the modality specific challenges from epigenomics data, epiScanpy quantifies the epigenome using multiple feature space constructions and builds a nearest neighbour graph using epigenomic distance between cells.
EpiScanpy makes the many existing scRNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities, including methods for common clustering, dimension reduction, cell type identification and trajectory learning techniques, as well as an atlas integration tool for scATAC-seq datasets. The toolkit also features numerous useful downstream functions, such as differential methylation and differential openness calling, mapping epigenomic features of interest to their nearest gene, or constructing gene activity matrices using chromatin openness. We successfully benchmark epiScanpy against other scATAC-seq analysis tools and show its outperformance at discriminating cell types. Reference
NucHMM: a method for quantitative modeling of nucleosome organization identifying functional nucleosome states distinctly associated with splicing potentiality
We develop a novel computational method, NucHMM, to identify functional nucleosome states associated with cell type-specific combinatorial histone marks and nucleosome organization features such as phasing, spacing and positioning.
We test it on publicly available MNase-seq and ChIP-seq data in MCF7, H1, and IMR90 cells and identify 11 distinct functional nucleosome states. We demonstrate these nucleosome states are distinctly associated with the splicing potentiality of skipping exons. This advances our understanding of the chromatin function at the nucleosome level and offers insights into the interplay between nucleosome organization and splicing processes. Reference
Integration of metabolomics, genomics, and immune phenotypes reveals the causal roles of metabolites in disease
Recent studies highlight the role of metabolites in immune diseases, but it remains unknown how much of this effect is driven by genetic and non-genetic host factors.
We systematically investigate circulating metabolites in a cohort of 500 healthy subjects (500FG) in whom immune function and activity are deeply measured and whose genetics are profiled. Our data reveal that several major metabolic pathways, including the alanine/glutamate pathway and the arachidonic acid pathway, have a strong impact on cytokine production in response to ex vivo stimulation. We also examine the genetic regulation of metabolites associated with immune phenotypes through genome-wide association analysis and identify 29 significant loci, including eight novel independent loci. Reference
scMC learns biological variation through the alignment of multiple single-cell genomics datasets
Distinguishing biological from technical variation is crucial when integrating and comparing single-cell genomics datasets across different experiments.
Existing methods lack the capability in explicitly distinguishing these two variations, often leading to the removal of both variations. Here, we present an integration method scMC to remove the technical variation while preserving the intrinsic biological variation. scMC learns biological variation via variance analysis to subtract technical variation inferred in an unsupervised manner. Application of scMC to both simulated and real datasets from single-cell RNA-seq and ATAC-seq experiments demonstrates its capability of detecting context-shared and context-specific biological signals via accurate alignment. Reference
Single-cell atlas of the first intra-mammalian developmental stage of the human parasite Schistosoma mansoni
Over 250 million people suffer from schistosomiasis, a tropical disease caused by parasitic flatworms known as schistosomes. Humans become infected by free-swimming, water-borne larvae, which penetrate the skin.
The earliest intra-mammalian stage, called the schistosomulum, undergoes a series of developmental transitions. These changes are critical for the parasite to adapt to its new environment as it navigates through host tissues to reach its niche, where it will grow to reproductive maturity. Unravelling the mechanisms that drive intra-mammalian development requires knowledge of the spatial organisation and transcriptional dynamics of different cell types that comprise the schistomulum body. To fill these important knowledge gaps, we perform single-cell RNA sequencing on two-day old schistosomula of Schistosoma mansoni. We identify likely gene expression profiles for muscle, nervous system, tegument, oesophageal gland, parenchymal/primordial gut cells, and stem cells. In addition, we validate cell markers for all these clusters by in situ hybridisation in schistosomula and adult parasites. Reference
FAN-C: a feature-rich framework for the analysis and visualisation of chromosome conformation capture data
Chromosome conformation capture data, particularly from high-throughput approaches such as Hi-C, are typically very complex to analyse. Existing analysis tools are often single-purpose, or limited in compatibility to a small number of data formats, frequently making Hi-C analyses tedious and time-consuming.
Here, we present FAN-C, an easy-to-use command-line tool and powerful Python API with a broad feature set covering matrix generation, analysis, and visualisation for C-like data (https://github.com/vaquerizaslab/fanc). Due to its compatibility with the most prevalent Hi-C storage formats, FAN-C can be used in combination with a large number of existing analysis tools, thus greatly simplifying Hi-C matrix analysis. Reference
Genome-wide association and multi-omic analyses reveal ACTN2 as a gene linked to heart failure
Heart failure is a major public health problem affecting over 23 million people worldwide. In this study, we present the results of a large scale meta-analysis of heart failure GWAS and replication in a comparable sized cohort to identify one known and two novel loci associated with heart failure.
Heart failure sub-phenotyping shows that a new locus in chromosome 1 is associated with left ventricular adverse remodeling and clinical heart failure, in response to different initial cardiac muscle insults. Functional characterization and fine-mapping of that locus reveal a putative causal variant in a cardiac muscle specific regulatory region activated during cardiomyocyte differentiation that binds to the ACTN2 gene, a crucial structural protein inside the cardiac sarcolemma (Hi-C interaction p-value = 0.00002). Reference
A rare codon-based translational program of cell proliferation
The speed of translation elongation is primarily determined by the abundance of tRNAs. Thus, the codon usage influences the rate with which individual mRNAs are translated.
As the nature of tRNA pools and modifications can vary across biological conditions, codon elongation rates may also vary, leading to fluctuations in the protein production from individual mRNAs. Although it has been observed that functionally related mRNAs exhibit similar codon usage, presumably to provide an effective way to coordinate expression of multiple proteins, experimental evidence for codon-mediated translation efficiency modulation of functionally related mRNAs in specific conditions is scarce and the associated mechanisms are still debated. Reference
Altered chromatin landscape and enhancer engagement underlie transcriptional dysregulation in MED12 mutant uterine leiomyomas
Uterine leiomyomas (fibroids) are a major source of gynecologic morbidity in reproductive age women and are characterized by the excessive deposition of a disorganized extracellular matrix, resulting in rigid benign tumors.
Although down regulation of the transcription factor AP-1 is highly prevalent in leiomyomas, the functional consequence of AP-1 loss on gene transcription in uterine fibroids remains poorly understood. Using high-resolution ChIP-sequencing, promoter capture Hi-C, and RNA-sequencing of matched normal and leiomyoma tissues, here we show that modified enhancer architecture is a major driver of transcriptional dysregulation in MED12 mutant uterine leiomyomas. Furthermore, modifications in enhancer architecture are driven by the depletion of AP-1 occupancy on chromatin. Reference
Immuno-genomic landscape of osteosarcoma
Limited clinical activity has been seen in osteosarcoma (OS) patients treated with immune checkpoint inhibitors (ICI). To gain insights into the immunogenic potential of these tumors, we conducted whole genome, RNA, and T-cell receptor sequencing, immunohistochemistry and reverse phase protein array profiling (RPPA) on OS specimens from 48 pediatric and adult patients with primary, relapsed, and metastatic OS.
Median immune infiltrate level was lower than in other tumor types where ICI are effective, with concomitant low T-cell receptor clonalities. Neoantigen expression in OS was lacking and significantly associated with high levels of nonsense-mediated decay (NMD). Samples with low immune infiltrate had higher number of deleted genes while those with high immune infiltrate expressed higher levels of adaptive resistance pathways. Reference
Pan-cancer analysis reveals cooperativity of both strands of microRNA that regulate tumorigenesis and patient survival
Recently, both 5p and 3p miRNA strands are being recognized as functional instead of only one, leaving many miRNA strands uninvestigated. To determine whether both miRNA strands, which have different mRNA-targeting sequences, cooperate to regulate pathways/functions across cancer types, we evaluate genomic, epigenetic, and molecular profiles of >5200 patient samples from 14 different cancers, and RNA interference and CRISPR screens in 290 cancer cell lines.
We identify concordantly dysregulated miRNA 5p/3p pairs that coordinately modulate oncogenic pathways and/or cell survival/growth across cancers. Down-regulation of both strands of miR-30a and miR-145 recurrently increased cell cycle pathway genes and significantly reduced patient survival in multiple cancers. Forced expression of all four strands show cooperativity, reducing cell cycle pathways and inhibiting lung cancer cell proliferation and migration. Reference
A computational platform for high-throughput analysis of RNA sequences and modifications by mass spectrometry
The field of epitranscriptomics continues to reveal how post-transcriptional modification of RNA affects a wide variety of biological phenomena. A pivotal challenge in this area is the identification of modified RNA residues within their sequence contexts. Mass spectrometry (MS) offers a comprehensive solution by using analogous approaches to shotgun proteomics.
However, software support for the analysis of RNA MS data is inadequate at present and does not allow high-throughput processing. Existing software solutions lack the raw performance and statistical grounding to efficiently handle the numerous modifications found on RNA. We present a free and open-source database search engine for RNA MS data, called NucleicAcidSearchEngine (NASE), that addresses these shortcomings. Reference
Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data
Many functional analysis tools have been developed to extract functional and mechanistic insight from bulk transcriptome data. With the advent of single-cell RNA sequencing (scRNA-seq), it is in principle possible to do such an analysis for single cells.
However, scRNA-seq data has characteristics such as drop-out events and low library sizes. It is thus not clear if functional TF and pathway analysis tools established for bulk sequencing can be applied to scRNA-seq in a meaningful way. Reference
Epigenetic specifications of host chromosome docking sites for latent Epstein-Barr virus
Epstein-Barr virus (EBV) genomes persist in latently infected cells as extrachromosomal episomes that attach to host chromosomes through the tethering functions of EBNA1, a viral encoded sequence-specific DNA binding protein.
Here we employ circular chromosome conformation capture (4C) analysis to identify genome-wide associations between EBV episomes and host chromosomes. We find that EBV episomes in Burkitt’s lymphoma cells preferentially associate with cellular genomic sites containing EBNA1 binding sites enriched with B-cell factors EBF1 and RBP-jK, the repressive histone mark H3K9me3, and AT-rich flanking sequence. These attachment sites correspond to transcriptionally silenced genes with GO enrichment for neuronal function and protein kinase A pathways. Reference
Single-cell RNA-sequencing of differentiating iPS cells reveals dynamic genetic effects on gene expression
Recent developments in stem cell biology have enabled the study of cell fate decisions in early human development that are impossible to study in vivo. However, understanding how development varies across individuals and, in particular, the influence of common genetic variants during this process has not been characterised.
Here, we exploit human iPS cell lines from 125 donors, a pooled experimental design, and single-cell RNA-sequencing to study population variation of endoderm differentiation. We identify molecular markers that are predictive of differentiation efficiency of individual lines, and utilise heterogeneity in the genetic background across individuals to map hundreds of expression quantitative trait loci that influence expression dynamically during differentiation and across cellular contexts. Reference
Systematic functional identification of cancer multi-drug resistance genes
Drug resistance is a major obstacle in cancer therapy. To elucidate the genetic factors that regulate sensitivity to anti-cancer drugs, we performed CRISPR-Cas9 knockout screens for resistance to a spectrum of drugs.
In addition to known drug targets and resistance mechanisms, this study revealed novel insights into drug mechanisms of action, including cellular transporters, drug target effectors, and genes involved in target-relevant pathways. Importantly, we identified ten multi-drug resistance genes, including an uncharacterized gene C1orf115, which we named Required for Drug-induced Death 1 (RDD1). Loss of RDD1 resulted in resistance to five anti-cancer drugs. Finally, targeting RDD1 leads to chemotherapy resistance in mice and low RDD1 expression is associated with poor prognosis in multiple cancers. Reference
Pan-Cancer Analysis of Whole Genomes
Cancer is a disease of the genome, caused by a cell’s acquisition of somatic mutations in key cancer genes. These mutations alter pathways involved in regulating cellular growth and interactions with the tissue environment. Until recently, research on the cancer genome was focused on protein-coding genes, which together account for only 1% of the genome.
To address this issue, the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Project performed whole genome sequencing and integrative analysis on over 2,600 primary cancers and their matching normal tissues across 38 distinct tumor types. This study revealed the extensive role played by large-scale structural mutations in cancer, identified previously-unknown cancer-related mutations in gene regulatory regions, inferred tumor evolution across multiple cancer types, illuminated the interactions between somatic mutations and the transcriptome, and studied the role of germline genetic variants in modulating mutational processes. Reference
scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles
Simultaneous measurements of transcriptomic and epigenomic profiles in the same individual cells provide an unprecedented opportunity to understand cell fates. However, effective approaches for the integrative analysis of such data are lacking.
Here, we present a single-cell aggregation and integration (scAI) method to deconvolute cellular heterogeneity from parallel transcriptomic and epigenomic profiles. Through iterative learning, scAI aggregates sparse epigenomic signals in similar cells learned in an unsupervised manner, allowing coherent fusion with transcriptomic measurements. Simulation studies and applications to three real datasets demonstrate its capability of dissecting cellular heterogeneity within both transcriptomic and epigenomic layers and understanding transcriptional regulatory mechanisms. Reference
A proactive genotype-to-patient-phenotype map for cystathionine beta-synthase
For the majority of rare clinical missense variants, pathogenicity status cannot currently be classified. Classical homocystinuria, characterized by elevated homocysteine in plasma and urine, is caused by variants in the cystathionine beta-synthase (CBS) gene, most of which are rare. With early detection, existing therapies are highly effective.
Damaging CBS variants can be detected based on their failure to restore growth in yeast cells lacking the yeast ortholog CYS4. This assay has only been applied reactively, after first observing a variant in patients. Using saturation codon-mutagenesis, en masse growth selection, and sequencing, we generated a comprehensive, proactive map of CBS missense variant function. Reference
scMAGeCK links genotypes with multiple phenotypes in single-cell CRISPR screens
We present scMAGeCK, a computational framework to identify genomic elements associated with multiple expression-based phenotypes in CRISPR/Cas9 functional screening that uses single-cell RNA-seq as readout.
scMAGeCK outperforms existing methods, identifies genes and enhancers with known and novel functions in cell proliferation, and enables an unbiased construction of genotype-phenotype network. Single-cell CRISPR screening on mouse embryonic stem cells identifies key genes associated with different pluripotency states. Applying scMAGeCK on multiple datasets, we identify key factors that improve the power of single-cell CRISPR screening. Collectively, scMAGeCK is a novel tool to study genotype-phenotype relationships at a single-cell level. Reference
Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts
Understanding the impact of rare variants is essential to understanding human health. We analyze rare (MAF < 0.1%) variants against 4264 phenotypes in 49,960 exome-sequenced individuals from the UK Biobank and 1934 phenotypes (1821 overlapping with UK Biobank) in 21,866 members of the Healthy Nevada Project (HNP) cohort who underwent Exome + sequencing at Helix.
After using our rare-variant-tailored methodology to reduce test statistic inflation, we identify 64 statistically significant gene-based associations in our meta-analysis of the two cohorts and 37 for phenotypes available in only one cohort. Singletons make significant contributions to our results, and the vast majority of the associations could not have been identified with a genotyping chip. Reference
CUBIC: an atlas of genetic architecture promises directed maize improvement
Identifying genotype-phenotype links and causative genes from quantitative trait loci (QTL) is challenging for complex agronomically important traits. To accelerate maize gene discovery and breeding, we present the Complete-diallel design plus Unbalanced Breeding-like Inter-Cross (CUBIC) population, consisting of 1404 individuals created by extensively inter-crossing 24 widely used Chinese maize founders.
Hundreds of QTL for 23 agronomic traits are uncovered with 14 million high-quality SNPs and a high-resolution identity-by-descent map, which account for an average of 75% of the heritability for each trait. We find epistasis contributes to phenotypic variance widely. Integrative cross-population analysis and cross-omics mapping allow effective and rapid discovery of underlying genes, validated here with a case study on leaf width. Reference
Genomic surveillance for hypervirulence and multi-drug resistance in invasive Klebsiella pneumoniae from South and Southeast Asia
Klebsiella pneumoniae is a leading cause of bloodstream infection (BSI). Strains producing extended-spectrum beta-lactamases (ESBLs) or carbapenemases are considered global priority pathogens for which new treatment and prevention strategies are urgently required, due to severely limited therapeutic options.
South and Southeast Asia are major hubs for antimicrobial-resistant (AMR) K. pneumoniae and also for the characteristically antimicrobial-sensitive, community-acquired “hypervirulent” strains. The emergence of hypervirulent AMR strains and lack of data on exopolysaccharide diversity pose a challenge for K. pneumoniae BSI control strategies worldwide. Reference
Benchmarking principal component analysis for large-scale single-cell RNA-sequencing
Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but for large-scale scRNA-seq datasets, computation time is long and consumes large amounts of memory.
In this work, we review the existing fast and memory-efficient PCA algorithms and implementations and evaluate their practical application to large-scale scRNA-seq datasets. Our benchmark shows that some PCA algorithms based on Krylov subspace and randomized singular value decomposition are fast, memory-efficient, and more accurate than the other algorithms. Reference
Machine learning can identify newly diagnosed patients with CLL at high risk of infection
Infections have become the major cause of morbidity and mortality among patients with chronic lymphocytic leukemia (CLL) due to immune dysfunction and cytotoxic CLL treatment.
Yet, predictive models for infection are missing. In this work, we develop the CLL Treatment-Infection Model (CLL-TIM) that identifies patients at risk of infection or CLL treatment within 2 years of diagnosis as validated on both internal and external cohorts. CLL-TIM is an ensemble algorithm composed of 28 machine learning algorithms based on data from 4,149 patients with CLL. The model is capable of dealing with heterogeneous data, including the high rates of missing data to be expected in the real-world setting, with a precision of 72% and a recall of 75%. To address concerns regarding the use of complex machine learning algorithms in the clinic, for each patient with CLL, CLL-TIM provides explainable predictions through uncertainty estimates and personalized risk factors. Reference
DENDRO: genetic heterogeneity profiling and subclone detection by single-cell RNA sequencing
Although scRNA-seq is now ubiquitously adopted in studies of intratumor heterogeneity, detection of somatic mutations and inference of clonal membership from scRNA-seq is currently unreliable.
We propose DENDRO, an analysis method for scRNA-seq data that clusters single cells into genetically distinct subclones and reconstructs the phylogenetic tree relating the subclones. DENDRO utilizes transcribed point mutations and accounts for technical noise and expression stochasticity. We benchmark DENDRO and demonstrate its application on simulation data and real data from three cancer types. In particular, on a mouse melanoma model in response to immunotherapy, DENDRO delineates the role of neoantigens in treatment response. Reference
Transcription phenotypes of pancreatic cancer are driven by genomic events during tumor evolution
Pancreatic adenocarcinoma presents as a spectrum of a highly aggressive disease in patients. The basis of this disease heterogeneity has proved difficult to resolve due to poor tumor cellularity and extensive genomic instability.
To address this, a dataset of whole genomes and transcriptomes was generated from purified epithelium of primary and metastatic tumors. Transcriptome analysis demonstrated that molecular subtypes are a product of a gene expression continuum driven by a mixture of intratumoral subpopulations, which was confirmed by single-cell analysis. Integrated whole-genome analysis uncovered that molecular subtypes are linked to specific copy number aberrations in genes such as mutant KRAS and GATA6. By mapping tumor genetic histories, tetraploidization emerged as a key mutational process behind these events. Reference
A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome
Developing insight into tissue-specific transcriptional mechanisms can help improve our understanding of how genetic variants exert their effects on complex traits and disease.
In this study, we apply the principles of Mendelian randomization to systematically evaluate transcriptome-wide associations between gene expression (across 48 different tissue types) and 395 complex traits. Our findings indicate that variants which influence gene expression levels in multiple tissues are more likely to influence multiple complex traits. Moreover, detailed investigations of our results highlight tissue-specific associations, drug validation opportunities, insight into the likely causal pathways for trait-associated variants and also implicate putative associations at loci yet to be implicated in disease susceptibility. Reference
Chromatin interactome mapping at 139 independent breast cancer risk signals
Genome-wide association studies have identified 196 high confidence independent signals associated with breast cancer susceptibility. Variants within these signals frequently fall in distal regulatory DNA elements that control gene expression.
We designed a Capture Hi-C array to enrich for chromatin interactions between the credible causal variants and target genes in six human mammary epithelial and breast cancer cell lines. We show that interacting regions are enriched for open chromatin, histone marks for active enhancers, and transcription factors relevant to breast biology. We exploit this comprehensive resource to identify candidate target genes at 139 independent breast cancer risk signals and explore the functional mechanism underlying altered risk at the 12q24 risk region. Reference
Single-cell analysis based dissection of clonality in myelofibrosis
Cancer development is an evolutionary genomic process with parallels to Darwinian selection. It requires acquisition of multiple somatic mutations that collectively cause a malignant phenotype and continuous clonal evolution is often linked to tumor progression.
Here, we show the clonal evolution structure in 15 myelofibrosis (MF) patients while receiving treatment with JAK inhibitors (mean follow-up 3.9 years). Whole-exome sequencing at multiple time points reveal acquisition of somatic mutations and copy number aberrations over time. While JAK inhibition therapy does not seem to create a clear evolutionary bottleneck, we observe a more complex clonal architecture over time, and appearance of unrelated clones. Disease progression associates with increased genetic heterogeneity and gain of RAS/RTK pathway mutations. Clonal diversity results in clone-specific expansion within different myeloid cell lineages. Single-cell genotyping of circulating CD34 + progenitor cells allows the reconstruction of MF phylogeny demonstrating loss of heterozygosity and parallel evolution as recurrent events. Reference
Epigenetics meets proteomics in an epigenome-wide association study with circulating blood plasma protein traits
DNA methylation and blood circulating proteins have been associated with many complex disorders, but the underlying disease-causing mechanisms often remain unclear. Here, we report an epigenome-wide association study of 1123 proteins from 944 participants of the KORA population study and replication in a multi-ethnic cohort of 344 individuals.
We identify 98 CpG-protein associations (pQTMs) at a stringent Bonferroni level of significance. Overlapping associations with transcriptomics, metabolomics, and clinical endpoints suggest implication of processes related to chronic low-grade inflammation, including a network involving methylation of NLRC5, a regulator of the inflammasome, and associated pQTMs implicating key proteins of the immune system, such as CD48, CD163, CXCL10, CXCL11, LAG3, FCGR3B, and B2M. Our study links DNA methylation to disease endpoints via intermediate proteomics phenotypes and identifies correlative networks that may eventually be targeted in a personalized approach of chronic low-grade inflammation. Reference
In vivo functional analysis of non-conserved human lncRNAs associated with cardiometabolic traits
Unlike protein-coding genes, the majority of human long non-coding RNAs (lncRNAs) are considered non-conserved. Although lncRNAs have been shown to function in diverse pathophysiological processes in mice, it remains largely unknown whether human lncRNAs have such in vivo functions. Here, we describe an integrated pipeline to define the in vivo function of non-conserved human lncRNAs.
We first identify lncRNAs with high function potential using multiple indicators derived from human genetic data related to cardiometabolic traits, then define lncRNA’s function and specific target genes by integrating its correlated biological pathways in humans and co-regulated genes in a humanized mouse model. Finally, we demonstrate that the in vivo function of human-specific lncRNAs can be successfully examined in the humanized mouse model, and experimentally validate the predicted function of an obesity-associated lncRNA, LINC01018, in regulating the expression of genes in fatty acid oxidation in humanized livers through its interaction with RNA-binding protein HuR. Reference
scRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation
The Human Cell Atlas is a large international collaborative effort to map all cell types of the human body. Single-cell RNA sequencing can generate high-quality data for the delivery of such an atlas. However, delays between fresh sample collection and processing may lead to poor data and difficulties in experimental design.
This study assesses the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72 h. We collect 240,000 high-quality single-cell transcriptomes with detailed cell type annotations and whole genome sequences of donors, enabling future eQTL studies. Our data provide a valuable resource for the study of these 3 organs and will allow cross-organ comparison of cell types. Reference
Identifying cross-disease components of genetic risk across hospital data in the UK Biobank
Genetic risk factors frequently affect multiple common human diseases, providing insight into shared pathophysiological pathways and opportunities for therapeutic development. However, systematic identification of genetic profiles of disease risk is limited by the availability of both comprehensive clinical data on population-scale cohorts and the lack of suitable statistical methodology that can handle the scale of and differential power inherent in multi-phenotype data.
Here, we develop a disease-agnostic approach to cluster the genetic risk profiles for 3,025 genome-wide independent loci across 19,155 disease classification codes from 320,644 participants in the UK Biobank, representing a large and heterogeneous population. We identify 339 distinct disease association profiles and use multiple approaches to link clusters to the underlying biological pathways. Reference
The somatic mutation landscape of the human body
Somatic mutations in healthy tissues contribute to aging, neurodegeneration, and cancer initiation, yet they remain largely uncharacterized. To gain a better understanding of the genome-wide distribution and functional impact of somatic mutations, we leverage the genomic information contained in the transcriptome to uniformly call somatic mutations from over 7500 tissue samples, representing 36 distinct tissues.
This catalog, containing over 280,000 mutations, reveals a wide diversity of tissue-specific mutation profiles associated with gene expression levels and chromatin states. For example, lung samples with low expression of the mismatch-repair gene MLH1 show a mutation signature of deficient mismatch repair. In addition, we find pervasive negative selection acting on missense and nonsense mutations, except for mutations previously observed in cancer samples, which are under positive selection and are highly enriched in many healthy tissues. Reference
Knowledge Base Commons (KBCommons) v1.1: a universal framework for multi-omics data integration and biological discoveries
Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based framework providing generic functionalities for storing, sharing, analyzing, exploring, integrating and visualizing multiple organisms’ genomics and integrative omics data. KBCommons is designed and developed to integrate diverse multi-level omics data and to support biological discoveries for all species via a common platform.
KBCommons has four modules including data storage, data processing, data accessing, and web interface for data management and retrieval. It provides a comprehensive framework for new plant-specific, animal-specific, virus-specific, bacteria-specific or human disease-specific knowledge base (KB) creation, for adding new genome versions and additional multi-omics data to existing KBs, and for exploring existing datasets within current KBs. Reference
Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke
Recent genome-wide association studies in stroke have enabled the generation of genomic risk scores (GRS) but their predictive power has been modest compared to established stroke risk factors.
Here, using a meta-scoring approach, we develop a metaGRS for ischaemic stroke (IS) and analyse this score in the UK Biobank (n = 395,393; 3075 IS events by age 75). The metaGRS hazard ratio for IS (1.26, 95% CI 1.22–1.31 per metaGRS standard deviation) doubles that of a previous GRS, identifying a subset of individuals at monogenic levels of risk: the top 0.25% of metaGRS have three-fold risk of IS. The metaGRS is similarly or more predictive compared to several risk factors, such as family history, blood pressure, body mass index, and smoking. Reference
Uterine adenomyosis is an oligoclonal disorder associated with KRAS mutations
Uterine adenomyosis is a benign disorder that often co-occurs with endometriosis and/or leiomyoma, and impairs quality of life. The genomic features of adenomyosis are unknown. Here we apply next-generation sequencing to adenomyosis (70 individuals and 192 multi-regional samples), as well as co-occurring leiomyoma and endometriosis, and find recurring KRAS mutations in 26/70 (37.1%) of adenomyosis cases.
Multi-regional sequencing reveals oligoclonality in adenomyosis, with some mutations also detected in normal endometrium and/or co-occurring endometriosis. KRAS mutations are more frequent in cases of adenomyosis with co-occurring endometriosis, low progesterone receptor (PR) expression, or progestin (dienogest; DNG) pretreatment. DNG’s anti-proliferative effect is diminished via epigenetic silencing of PR in immortalized cells with mutant KRAS. Reference
SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies
Genomic differences range from single nucleotide differences to complex structural variations. Current methods typically annotate sequence differences ranging from SNPs to large indels accurately but do not unravel the full complexity of structural rearrangements, including inversions, translocations, and duplications, where highly similar sequence changes in location, orientation, or copy number.
Here, we present SyRI, a pairwise whole-genome comparison tool for chromosome-level assemblies. SyRI starts by finding rearranged regions and then searches for differences in the sequences, which are distinguished for residing in syntenic or rearranged regions. This distinction is important as rearranged regions are inherited differently compared to syntenic regions. Reference
Recurrent PTPRT/JAK2 mutations in lung adenocarcinoma among African Americans
Reducing or eliminating persistent disparities in lung cancer incidence and survival has been challenging because our current understanding of lung cancer biology is derived primarily from populations of European descent.
Here we show results from a targeted sequencing panel using NCI-MD Case Control Study patient samples and reveal a significantly higher prevalence of PTPRT and JAK2 mutations in lung adenocarcinomas among African Americans compared with European Americans. This increase in mutation frequency was validated with independent WES data from the NCI-MD Case Control Study and TCGA. We find that patients carrying these mutations have a concomitant increase in IL-6/STAT3 signaling and miR-21 expression. Reference
Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference
Multiplexed single-cell RNA-seq analysis of multiple samples using pooling is a promising experimental design, offering increased throughput while allowing to overcome batch variation.
To reconstruct the sample identify of each cell, genetic variants that segregate between the samples in the pool have been proposed as natural barcode for cell demultiplexing. Existing demultiplexing strategies rely on availability of complete genotype data from the pooled samples, which limits the applicability of such methods, in particular when genetic variation is not the primary object of study. To address this, we here present Vireo, a computationally efficient Bayesian model to demultiplex single-cell data from pooled experimental designs. Uniquely, our model can be applied in settings when only partial or no genotype information is available. Using pools based on synthetic mixtures and results on real data, we demonstrate the robustness of Vireo and illustrate the utility of multiplexed experimental designs for common expression analyses. Reference
Obesity and disease severity magnify disturbed microbiome-immune interactions in asthma patients
In order to improve targeted therapeutic approaches for asthma patients, insights into the molecular mechanisms that differentially contribute to disease phenotypes, such as obese asthmatics or severe asthmatics, are required.
Here we report immunological and microbiome alterations in obese asthmatics (n = 50, mean age = 45), non-obese asthmatics (n = 53, mean age = 40), obese non-asthmatics (n = 51, mean age = 44) and their healthy counterparts (n = 48, mean age = 39). Obesity is associated with elevated proinflammatory signatures, which are enhanced in the presence of asthma. Similarly, obesity or asthma induced changes in the composition of the microbiota, while an additive effect is observed in obese asthma patients. Asthma disease severity is negatively correlated with fecal Akkermansia muciniphila levels. Reference
Text-mining clinically relevant cancer biomarkers for curation into the CIViC database
Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer.
To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature. Reference
Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis
Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction.
Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. Reference
Transcriptomic analysis of human primary breast cancer identifies fatty acid oxidation as a target for metformin
Epidemiological studies suggest that metformin may reduce the incidence of cancer in patients with diabetes and multiple late phase clinical trials assessing the potential of repurposing this drug are underway.
Transcriptomic profiling of tumour samples is an excellent tool to understand drug bioactivity, identify candidate biomarkers and assess for mechanisms of resistance to therapy. Thirty-six patients with untreated primary breast cancer were recruited to a window study and transcriptomic profiling of tumour samples carried out before and after metformin treatment. Reference
Dashing: fast and accurate genomic distances with HyperLogLog
Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections.
Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Reference
In silico prediction of high-resolution Hi-C interaction matrices
The three-dimensional (3D) organization of the genome plays an important role in gene regulation bringing distal sequence elements in 3D proximity to genes hundreds of kilobases away. Hi-C is a powerful genome-wide technique to study 3D genome organization. Owing to experimental costs, high resolution Hi-C datasets are limited to a few cell lines.
Computational prediction of Hi-C counts can offer a scalable and inexpensive approach to examine 3D genome organization across multiple cellular contexts. Here we present HiC-Reg, an approach to predict contact counts from one-dimensional regulatory signals. HiC-Reg predictions identify topologically associating domains and significant interactions that are enriched for CCCTC-binding factor (CTCF) bidirectional motifs and interactions identified from complementary sources. Reference
Genomic and transcriptomic insights into molecular basis of sexually dimorphic nuptial spines in Leptobrachium leishanense
Identification of genetic biomarkers associated with autism spectrum disorders (ASDs) could improve recurrence prediction for families with a child with ASD. Here, we describe clinical microarray findings for 253 longitudinally phenotyped ASD families from the Baby Siblings Research Consortium (BSRC), encompassing 288 infant siblings.
By age 3, 103 siblings (35.8%) were diagnosed with ASD and 54 (18.8%) were developing atypically. Thirteen siblings have copy number variants (CNVs) involving ASD-relevant genes: 6 with ASD, 5 atypically developing, and 2 typically developing. Within these families, an ASD-related CNV in a sibling has a positive predictive value (PPV) for ASD or atypical development of 0.83; the Simons Simplex Collection of ASD families shows similar PPVs. Polygenic risk analyses suggest that common genetic variants may also contribute to ASD. CNV findings would have been pre-symptomatically predictive of ASD or atypical development in 11 (7%) of the 157 BSRC siblings who were eventually diagnosed clinically. Reference
Orchestrating single-cell analysis with Bioconductor
Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights.
The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools. Reference
ReorientExpress: reference-free orientation of nanopore cDNA reads with deep learning
We describe ReorientExpress, a method to perform reference-free orientation of transcriptomic long sequencing reads.
ReorientExpress uses deep learning to correctly predict the orientation of the majority of reads, and in particular when trained on a closely related species or in combination with read clustering. ReorientExpress enables long-read transcriptomics in non-model organisms and samples without a genome reference without using additional technologies. Reference
Comparative analysis of functional assay evidence use by ClinGen Variant Curation Expert Panels
The 2015 American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) guidelines for clinical sequence variant interpretation state that “well-established” functional studies can be used as evidence in variant classification.
These guidelines articulated key attributes of functional data, including that assays should reflect the biological environment and be analytically sound; however, details of how to evaluate these attributes were left to expert judgment. The Clinical Genome Resource (ClinGen) designates Variant Curation Expert Panels (VCEPs) in specific disease areas to make gene-centric specifications to the ACMG/AMP guidelines, including more specific definitions of appropriate functional assays. We set out to evaluate the existing VCEP guidelines for functional assays. Reference
Genomic and immune profiling of pre-invasive lung adenocarcinoma
Adenocarcinoma in situ and minimally invasive adenocarcinoma are the pre-invasive forms of lung adenocarcinoma. The genomic and immune profiles of these lesions are poorly understood. Here we report exome and transcriptome sequencing of 98 lung adenocarcinoma precursor lesions and 99 invasive adenocarcinomas.
We have identified EGFR, RBM10, BRAF, ERBB2, TP53, KRAS, MAP2K1 and MET as significantly mutated genes in the pre/minimally invasive group. Classes of genome alterations that increase in frequency during the progression to malignancy are revealed. These include mutations in TP53, arm-level copy number alterations, and HLA loss of heterozygosity. Immune infiltration is correlated with copy number alterations of chromosome arm 6p, suggesting a link between arm-level events and the tumor immune environment. Reference
Trans-splicing of mRNAs links gene transcription to translational control regulated by mTOR
In phylogenetically diverse organisms, the 5′ ends of a subset of mRNAs are trans-spliced with a spliced leader (SL) RNA. The functions of SL trans-splicing, however, remain largely enigmatic.
We quantified translation genome-wide in the marine chordate, Oikopleura dioica, under inhibition of mTOR, a central growth regulator. Translation of trans-spliced TOP mRNAs was suppressed, consistent with a role of the SL sequence in nutrient-dependent translational control of growth-related mRNAs. Under crowded, nutrient-limiting conditions, O. dioica continued to filter-feed, but arrested growth until favorable conditions returned. Upon release from unfavorable conditions, initial recovery was independent of nutrient-responsive, trans-spliced genes, suggesting animal density sensing as a first trigger for resumption of development. Reference
Immune receptor repertoires in pediatric and adult acute myeloid leukemia
Acute myeloid leukemia (AML), caused by the abnormal proliferation of immature myeloid cells in the blood or bone marrow, is one of the most common hematologic malignancies.
Currently, the interactions between malignant myeloid cells and the immune microenvironment, especially T cells and B cells, remain poorly characterized. In this study, we systematically analyzed the T cell receptor and B cell receptor (TCR and BCR) repertoires from the RNA-seq data of 145 pediatric and 151 adult AML samples as well as 73 non-tumor peripheral blood samples. Reference
DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput
We present an easy-to-use integrated software suite, DIA-NN, that exploits deep neural networks and new quantification and signal correction strategies for the processing of data-independent acquisition (DIA) proteomics experiments.
DIA-NN improves the identification and quantification performance in conventional DIA proteomic applications, and is particularly beneficial for high-throughput applications, as it is fast and enables deep and confident proteome coverage when used in combination with fast chromatographic methods. Reference
MIA-Sig: multiplex chromatin interaction analysis by signal processing and statistical algorithms
The single-molecule multiplex chromatin interaction data are generated by emerging 3D genome mapping technologies such as GAM, SPRITE, and ChIA-Drop. These datasets provide insights into high-dimensional chromatin organization, yet introduce new computational challenges.
Thus, we developed MIA-Sig, an algorithmic solution based on signal processing and information theory. We demonstrate its ability to de-noise the multiplex data, assess the statistical significance of chromatin complexes, and identify topological domains and frequent inter-domain contacts. On chromatin immunoprecipitation (ChIP)-enriched data, MIA-Sig can clearly distinguish the protein-associated interactions from the non-specific topological domains. Together, MIA-Sig represents a novel algorithmic framework for multiplex chromatin interaction analysis. Reference
Identification of cancer sex-disparity in the functional integrity of p53 and its X chromosome network
The disproportionately high prevalence of male cancer is poorly understood. We tested for sex-disparity in the functional integrity of the major tumor suppressor p53 in sporadic cancers. Our bioinformatics analyses expose three novel levels of p53 impact on sex-disparity in 12 non-reproductive cancer types.
First, TP53 mutation is more frequent in these cancers among US males than females, with poorest survival correlating with its mutation. Second, numerous X-linked genes are associated with p53, including vital genomic regulators. Males are at unique risk from alterations of their single copies of these genes. High expression of X-linked negative regulators of p53 in wild-type TP53 cancers corresponds with reduced survival. Reference
Contrasting the impact of cytotoxic and cytostatic drug therapies on tumour progression
A tumour grows when the total division (birth) rate of its cells exceeds their total mortality (death) rate. The capability for uncontrolled growth within the host tissue is acquired via the accumulation of driver mutations which enable the tumour to progress through various hallmarks of cancer.
We present a mathematical model of the penultimate stage in such a progression. We assume the tumour has reached the limit of its present growth potential due to cell competition that either results in total birth rate reduction or death rate increase. The tumour can then progress to the final stage by either seeding a metastasis or acquiring a driver mutation. We influence the ensuing evolutionary dynamics by cytotoxic (increasing death rate) or cytostatic (decreasing birth rate) therapy while keeping the effect of the therapy on net growth reduction constant. Comparing the treatments head to head we derive conditions for choosing optimal therapy. Reference