Machine learning can identify newly diagnosed patients with CLL at high risk of infection
Infections have become the major cause of morbidity and mortality among patients with chronic lymphocytic leukemia (CLL) due to immune dysfunction and cytotoxic CLL treatment.
Yet, predictive models for infection are missing. In this work, we develop the CLL Treatment-Infection Model (CLL-TIM) that identifies patients at risk of infection or CLL treatment within 2 years of diagnosis as validated on both internal and external cohorts. CLL-TIM is an ensemble algorithm composed of 28 machine learning algorithms based on data from 4,149 patients with CLL. The model is capable of dealing with heterogeneous data, including the high rates of missing data to be expected in the real-world setting, with a precision of 72% and a recall of 75%. To address concerns regarding the use of complex machine learning algorithms in the clinic, for each patient with CLL, CLL-TIM provides explainable predictions through uncertainty estimates and personalized risk factors. Reference
DENDRO: genetic heterogeneity profiling and subclone detection by single-cell RNA sequencing
Although scRNA-seq is now ubiquitously adopted in studies of intratumor heterogeneity, detection of somatic mutations and inference of clonal membership from scRNA-seq is currently unreliable.
We propose DENDRO, an analysis method for scRNA-seq data that clusters single cells into genetically distinct subclones and reconstructs the phylogenetic tree relating the subclones. DENDRO utilizes transcribed point mutations and accounts for technical noise and expression stochasticity. We benchmark DENDRO and demonstrate its application on simulation data and real data from three cancer types. In particular, on a mouse melanoma model in response to immunotherapy, DENDRO delineates the role of neoantigens in treatment response. Reference
Transcription phenotypes of pancreatic cancer are driven by genomic events during tumor evolution
Pancreatic adenocarcinoma presents as a spectrum of a highly aggressive disease in patients. The basis of this disease heterogeneity has proved difficult to resolve due to poor tumor cellularity and extensive genomic instability.
To address this, a dataset of whole genomes and transcriptomes was generated from purified epithelium of primary and metastatic tumors. Transcriptome analysis demonstrated that molecular subtypes are a product of a gene expression continuum driven by a mixture of intratumoral subpopulations, which was confirmed by single-cell analysis. Integrated whole-genome analysis uncovered that molecular subtypes are linked to specific copy number aberrations in genes such as mutant KRAS and GATA6. By mapping tumor genetic histories, tetraploidization emerged as a key mutational process behind these events. Reference
A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome
Developing insight into tissue-specific transcriptional mechanisms can help improve our understanding of how genetic variants exert their effects on complex traits and disease.
In this study, we apply the principles of Mendelian randomization to systematically evaluate transcriptome-wide associations between gene expression (across 48 different tissue types) and 395 complex traits. Our findings indicate that variants which influence gene expression levels in multiple tissues are more likely to influence multiple complex traits. Moreover, detailed investigations of our results highlight tissue-specific associations, drug validation opportunities, insight into the likely causal pathways for trait-associated variants and also implicate putative associations at loci yet to be implicated in disease susceptibility. Reference
Chromatin interactome mapping at 139 independent breast cancer risk signals
Genome-wide association studies have identified 196 high confidence independent signals associated with breast cancer susceptibility. Variants within these signals frequently fall in distal regulatory DNA elements that control gene expression.
We designed a Capture Hi-C array to enrich for chromatin interactions between the credible causal variants and target genes in six human mammary epithelial and breast cancer cell lines. We show that interacting regions are enriched for open chromatin, histone marks for active enhancers, and transcription factors relevant to breast biology. We exploit this comprehensive resource to identify candidate target genes at 139 independent breast cancer risk signals and explore the functional mechanism underlying altered risk at the 12q24 risk region. Reference
Single-cell analysis based dissection of clonality in myelofibrosis
Cancer development is an evolutionary genomic process with parallels to Darwinian selection. It requires acquisition of multiple somatic mutations that collectively cause a malignant phenotype and continuous clonal evolution is often linked to tumor progression.
Here, we show the clonal evolution structure in 15 myelofibrosis (MF) patients while receiving treatment with JAK inhibitors (mean follow-up 3.9 years). Whole-exome sequencing at multiple time points reveal acquisition of somatic mutations and copy number aberrations over time. While JAK inhibition therapy does not seem to create a clear evolutionary bottleneck, we observe a more complex clonal architecture over time, and appearance of unrelated clones. Disease progression associates with increased genetic heterogeneity and gain of RAS/RTK pathway mutations. Clonal diversity results in clone-specific expansion within different myeloid cell lineages. Single-cell genotyping of circulating CD34 + progenitor cells allows the reconstruction of MF phylogeny demonstrating loss of heterozygosity and parallel evolution as recurrent events. Reference
Epigenetics meets proteomics in an epigenome-wide association study with circulating blood plasma protein traits
DNA methylation and blood circulating proteins have been associated with many complex disorders, but the underlying disease-causing mechanisms often remain unclear. Here, we report an epigenome-wide association study of 1123 proteins from 944 participants of the KORA population study and replication in a multi-ethnic cohort of 344 individuals.
We identify 98 CpG-protein associations (pQTMs) at a stringent Bonferroni level of significance. Overlapping associations with transcriptomics, metabolomics, and clinical endpoints suggest implication of processes related to chronic low-grade inflammation, including a network involving methylation of NLRC5, a regulator of the inflammasome, and associated pQTMs implicating key proteins of the immune system, such as CD48, CD163, CXCL10, CXCL11, LAG3, FCGR3B, and B2M. Our study links DNA methylation to disease endpoints via intermediate proteomics phenotypes and identifies correlative networks that may eventually be targeted in a personalized approach of chronic low-grade inflammation. Reference
In vivo functional analysis of non-conserved human lncRNAs associated with cardiometabolic traits
Unlike protein-coding genes, the majority of human long non-coding RNAs (lncRNAs) are considered non-conserved. Although lncRNAs have been shown to function in diverse pathophysiological processes in mice, it remains largely unknown whether human lncRNAs have such in vivo functions. Here, we describe an integrated pipeline to define the in vivo function of non-conserved human lncRNAs.
We first identify lncRNAs with high function potential using multiple indicators derived from human genetic data related to cardiometabolic traits, then define lncRNA’s function and specific target genes by integrating its correlated biological pathways in humans and co-regulated genes in a humanized mouse model. Finally, we demonstrate that the in vivo function of human-specific lncRNAs can be successfully examined in the humanized mouse model, and experimentally validate the predicted function of an obesity-associated lncRNA, LINC01018, in regulating the expression of genes in fatty acid oxidation in humanized livers through its interaction with RNA-binding protein HuR. Reference
scRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation
The Human Cell Atlas is a large international collaborative effort to map all cell types of the human body. Single-cell RNA sequencing can generate high-quality data for the delivery of such an atlas. However, delays between fresh sample collection and processing may lead to poor data and difficulties in experimental design.
This study assesses the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72 h. We collect 240,000 high-quality single-cell transcriptomes with detailed cell type annotations and whole genome sequences of donors, enabling future eQTL studies. Our data provide a valuable resource for the study of these 3 organs and will allow cross-organ comparison of cell types. Reference
Identifying cross-disease components of genetic risk across hospital data in the UK Biobank
Genetic risk factors frequently affect multiple common human diseases, providing insight into shared pathophysiological pathways and opportunities for therapeutic development. However, systematic identification of genetic profiles of disease risk is limited by the availability of both comprehensive clinical data on population-scale cohorts and the lack of suitable statistical methodology that can handle the scale of and differential power inherent in multi-phenotype data.
Here, we develop a disease-agnostic approach to cluster the genetic risk profiles for 3,025 genome-wide independent loci across 19,155 disease classification codes from 320,644 participants in the UK Biobank, representing a large and heterogeneous population. We identify 339 distinct disease association profiles and use multiple approaches to link clusters to the underlying biological pathways. Reference
The somatic mutation landscape of the human body
Somatic mutations in healthy tissues contribute to aging, neurodegeneration, and cancer initiation, yet they remain largely uncharacterized. To gain a better understanding of the genome-wide distribution and functional impact of somatic mutations, we leverage the genomic information contained in the transcriptome to uniformly call somatic mutations from over 7500 tissue samples, representing 36 distinct tissues.
This catalog, containing over 280,000 mutations, reveals a wide diversity of tissue-specific mutation profiles associated with gene expression levels and chromatin states. For example, lung samples with low expression of the mismatch-repair gene MLH1 show a mutation signature of deficient mismatch repair. In addition, we find pervasive negative selection acting on missense and nonsense mutations, except for mutations previously observed in cancer samples, which are under positive selection and are highly enriched in many healthy tissues. Reference
Knowledge Base Commons (KBCommons) v1.1: a universal framework for multi-omics data integration and biological discoveries
Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based framework providing generic functionalities for storing, sharing, analyzing, exploring, integrating and visualizing multiple organisms’ genomics and integrative omics data. KBCommons is designed and developed to integrate diverse multi-level omics data and to support biological discoveries for all species via a common platform.
KBCommons has four modules including data storage, data processing, data accessing, and web interface for data management and retrieval. It provides a comprehensive framework for new plant-specific, animal-specific, virus-specific, bacteria-specific or human disease-specific knowledge base (KB) creation, for adding new genome versions and additional multi-omics data to existing KBs, and for exploring existing datasets within current KBs. Reference
Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke
Recent genome-wide association studies in stroke have enabled the generation of genomic risk scores (GRS) but their predictive power has been modest compared to established stroke risk factors.
Here, using a meta-scoring approach, we develop a metaGRS for ischaemic stroke (IS) and analyse this score in the UK Biobank (n = 395,393; 3075 IS events by age 75). The metaGRS hazard ratio for IS (1.26, 95% CI 1.22–1.31 per metaGRS standard deviation) doubles that of a previous GRS, identifying a subset of individuals at monogenic levels of risk: the top 0.25% of metaGRS have three-fold risk of IS. The metaGRS is similarly or more predictive compared to several risk factors, such as family history, blood pressure, body mass index, and smoking. Reference
Uterine adenomyosis is an oligoclonal disorder associated with KRAS mutations
Uterine adenomyosis is a benign disorder that often co-occurs with endometriosis and/or leiomyoma, and impairs quality of life. The genomic features of adenomyosis are unknown. Here we apply next-generation sequencing to adenomyosis (70 individuals and 192 multi-regional samples), as well as co-occurring leiomyoma and endometriosis, and find recurring KRAS mutations in 26/70 (37.1%) of adenomyosis cases.
Multi-regional sequencing reveals oligoclonality in adenomyosis, with some mutations also detected in normal endometrium and/or co-occurring endometriosis. KRAS mutations are more frequent in cases of adenomyosis with co-occurring endometriosis, low progesterone receptor (PR) expression, or progestin (dienogest; DNG) pretreatment. DNG’s anti-proliferative effect is diminished via epigenetic silencing of PR in immortalized cells with mutant KRAS. Reference
SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies
Genomic differences range from single nucleotide differences to complex structural variations. Current methods typically annotate sequence differences ranging from SNPs to large indels accurately but do not unravel the full complexity of structural rearrangements, including inversions, translocations, and duplications, where highly similar sequence changes in location, orientation, or copy number.
Here, we present SyRI, a pairwise whole-genome comparison tool for chromosome-level assemblies. SyRI starts by finding rearranged regions and then searches for differences in the sequences, which are distinguished for residing in syntenic or rearranged regions. This distinction is important as rearranged regions are inherited differently compared to syntenic regions. Reference
Recurrent PTPRT/JAK2 mutations in lung adenocarcinoma among African Americans
Reducing or eliminating persistent disparities in lung cancer incidence and survival has been challenging because our current understanding of lung cancer biology is derived primarily from populations of European descent.
Here we show results from a targeted sequencing panel using NCI-MD Case Control Study patient samples and reveal a significantly higher prevalence of PTPRT and JAK2 mutations in lung adenocarcinomas among African Americans compared with European Americans. This increase in mutation frequency was validated with independent WES data from the NCI-MD Case Control Study and TCGA. We find that patients carrying these mutations have a concomitant increase in IL-6/STAT3 signaling and miR-21 expression. Reference
Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference
Multiplexed single-cell RNA-seq analysis of multiple samples using pooling is a promising experimental design, offering increased throughput while allowing to overcome batch variation.
To reconstruct the sample identify of each cell, genetic variants that segregate between the samples in the pool have been proposed as natural barcode for cell demultiplexing. Existing demultiplexing strategies rely on availability of complete genotype data from the pooled samples, which limits the applicability of such methods, in particular when genetic variation is not the primary object of study. To address this, we here present Vireo, a computationally efficient Bayesian model to demultiplex single-cell data from pooled experimental designs. Uniquely, our model can be applied in settings when only partial or no genotype information is available. Using pools based on synthetic mixtures and results on real data, we demonstrate the robustness of Vireo and illustrate the utility of multiplexed experimental designs for common expression analyses. Reference
Obesity and disease severity magnify disturbed microbiome-immune interactions in asthma patients
In order to improve targeted therapeutic approaches for asthma patients, insights into the molecular mechanisms that differentially contribute to disease phenotypes, such as obese asthmatics or severe asthmatics, are required.
Here we report immunological and microbiome alterations in obese asthmatics (n = 50, mean age = 45), non-obese asthmatics (n = 53, mean age = 40), obese non-asthmatics (n = 51, mean age = 44) and their healthy counterparts (n = 48, mean age = 39). Obesity is associated with elevated proinflammatory signatures, which are enhanced in the presence of asthma. Similarly, obesity or asthma induced changes in the composition of the microbiota, while an additive effect is observed in obese asthma patients. Asthma disease severity is negatively correlated with fecal Akkermansia muciniphila levels. Reference
Text-mining clinically relevant cancer biomarkers for curation into the CIViC database
Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer.
To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature. Reference
Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis
Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction.
Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. Reference
Transcriptomic analysis of human primary breast cancer identifies fatty acid oxidation as a target for metformin
Epidemiological studies suggest that metformin may reduce the incidence of cancer in patients with diabetes and multiple late phase clinical trials assessing the potential of repurposing this drug are underway.
Transcriptomic profiling of tumour samples is an excellent tool to understand drug bioactivity, identify candidate biomarkers and assess for mechanisms of resistance to therapy. Thirty-six patients with untreated primary breast cancer were recruited to a window study and transcriptomic profiling of tumour samples carried out before and after metformin treatment. Reference
Dashing: fast and accurate genomic distances with HyperLogLog
Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections.
Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Reference
In silico prediction of high-resolution Hi-C interaction matrices
The three-dimensional (3D) organization of the genome plays an important role in gene regulation bringing distal sequence elements in 3D proximity to genes hundreds of kilobases away. Hi-C is a powerful genome-wide technique to study 3D genome organization. Owing to experimental costs, high resolution Hi-C datasets are limited to a few cell lines.
Computational prediction of Hi-C counts can offer a scalable and inexpensive approach to examine 3D genome organization across multiple cellular contexts. Here we present HiC-Reg, an approach to predict contact counts from one-dimensional regulatory signals. HiC-Reg predictions identify topologically associating domains and significant interactions that are enriched for CCCTC-binding factor (CTCF) bidirectional motifs and interactions identified from complementary sources. Reference
Genomic and transcriptomic insights into molecular basis of sexually dimorphic nuptial spines in Leptobrachium leishanense
Identification of genetic biomarkers associated with autism spectrum disorders (ASDs) could improve recurrence prediction for families with a child with ASD. Here, we describe clinical microarray findings for 253 longitudinally phenotyped ASD families from the Baby Siblings Research Consortium (BSRC), encompassing 288 infant siblings.
By age 3, 103 siblings (35.8%) were diagnosed with ASD and 54 (18.8%) were developing atypically. Thirteen siblings have copy number variants (CNVs) involving ASD-relevant genes: 6 with ASD, 5 atypically developing, and 2 typically developing. Within these families, an ASD-related CNV in a sibling has a positive predictive value (PPV) for ASD or atypical development of 0.83; the Simons Simplex Collection of ASD families shows similar PPVs. Polygenic risk analyses suggest that common genetic variants may also contribute to ASD. CNV findings would have been pre-symptomatically predictive of ASD or atypical development in 11 (7%) of the 157 BSRC siblings who were eventually diagnosed clinically. Reference
Orchestrating single-cell analysis with Bioconductor
Recent technological advancements have enabled the profiling of a large number of genome-wide features in individual cells. However, single-cell data present unique challenges that require the development of specialized methods and software infrastructure to successfully derive biological insights.
The Bioconductor project has rapidly grown to meet these demands, hosting community-developed open-source software distributed as R packages. Featuring state-of-the-art computational methods, standardized data infrastructure and interactive data visualization tools. Reference
ReorientExpress: reference-free orientation of nanopore cDNA reads with deep learning
We describe ReorientExpress, a method to perform reference-free orientation of transcriptomic long sequencing reads.
ReorientExpress uses deep learning to correctly predict the orientation of the majority of reads, and in particular when trained on a closely related species or in combination with read clustering. ReorientExpress enables long-read transcriptomics in non-model organisms and samples without a genome reference without using additional technologies. Reference
Comparative analysis of functional assay evidence use by ClinGen Variant Curation Expert Panels
The 2015 American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) guidelines for clinical sequence variant interpretation state that “well-established” functional studies can be used as evidence in variant classification.
These guidelines articulated key attributes of functional data, including that assays should reflect the biological environment and be analytically sound; however, details of how to evaluate these attributes were left to expert judgment. The Clinical Genome Resource (ClinGen) designates Variant Curation Expert Panels (VCEPs) in specific disease areas to make gene-centric specifications to the ACMG/AMP guidelines, including more specific definitions of appropriate functional assays. We set out to evaluate the existing VCEP guidelines for functional assays. Reference
Genomic and immune profiling of pre-invasive lung adenocarcinoma
Adenocarcinoma in situ and minimally invasive adenocarcinoma are the pre-invasive forms of lung adenocarcinoma. The genomic and immune profiles of these lesions are poorly understood. Here we report exome and transcriptome sequencing of 98 lung adenocarcinoma precursor lesions and 99 invasive adenocarcinomas.
We have identified EGFR, RBM10, BRAF, ERBB2, TP53, KRAS, MAP2K1 and MET as significantly mutated genes in the pre/minimally invasive group. Classes of genome alterations that increase in frequency during the progression to malignancy are revealed. These include mutations in TP53, arm-level copy number alterations, and HLA loss of heterozygosity. Immune infiltration is correlated with copy number alterations of chromosome arm 6p, suggesting a link between arm-level events and the tumor immune environment. Reference
Trans-splicing of mRNAs links gene transcription to translational control regulated by mTOR
In phylogenetically diverse organisms, the 5′ ends of a subset of mRNAs are trans-spliced with a spliced leader (SL) RNA. The functions of SL trans-splicing, however, remain largely enigmatic.
We quantified translation genome-wide in the marine chordate, Oikopleura dioica, under inhibition of mTOR, a central growth regulator. Translation of trans-spliced TOP mRNAs was suppressed, consistent with a role of the SL sequence in nutrient-dependent translational control of growth-related mRNAs. Under crowded, nutrient-limiting conditions, O. dioica continued to filter-feed, but arrested growth until favorable conditions returned. Upon release from unfavorable conditions, initial recovery was independent of nutrient-responsive, trans-spliced genes, suggesting animal density sensing as a first trigger for resumption of development. Reference
Immune receptor repertoires in pediatric and adult acute myeloid leukemia
Acute myeloid leukemia (AML), caused by the abnormal proliferation of immature myeloid cells in the blood or bone marrow, is one of the most common hematologic malignancies.
Currently, the interactions between malignant myeloid cells and the immune microenvironment, especially T cells and B cells, remain poorly characterized. In this study, we systematically analyzed the T cell receptor and B cell receptor (TCR and BCR) repertoires from the RNA-seq data of 145 pediatric and 151 adult AML samples as well as 73 non-tumor peripheral blood samples. Reference
DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput
We present an easy-to-use integrated software suite, DIA-NN, that exploits deep neural networks and new quantification and signal correction strategies for the processing of data-independent acquisition (DIA) proteomics experiments.
DIA-NN improves the identification and quantification performance in conventional DIA proteomic applications, and is particularly beneficial for high-throughput applications, as it is fast and enables deep and confident proteome coverage when used in combination with fast chromatographic methods. Reference
MIA-Sig: multiplex chromatin interaction analysis by signal processing and statistical algorithms
The single-molecule multiplex chromatin interaction data are generated by emerging 3D genome mapping technologies such as GAM, SPRITE, and ChIA-Drop. These datasets provide insights into high-dimensional chromatin organization, yet introduce new computational challenges.
Thus, we developed MIA-Sig, an algorithmic solution based on signal processing and information theory. We demonstrate its ability to de-noise the multiplex data, assess the statistical significance of chromatin complexes, and identify topological domains and frequent inter-domain contacts. On chromatin immunoprecipitation (ChIP)-enriched data, MIA-Sig can clearly distinguish the protein-associated interactions from the non-specific topological domains. Together, MIA-Sig represents a novel algorithmic framework for multiplex chromatin interaction analysis. Reference
Identification of cancer sex-disparity in the functional integrity of p53 and its X chromosome network
The disproportionately high prevalence of male cancer is poorly understood. We tested for sex-disparity in the functional integrity of the major tumor suppressor p53 in sporadic cancers. Our bioinformatics analyses expose three novel levels of p53 impact on sex-disparity in 12 non-reproductive cancer types.
First, TP53 mutation is more frequent in these cancers among US males than females, with poorest survival correlating with its mutation. Second, numerous X-linked genes are associated with p53, including vital genomic regulators. Males are at unique risk from alterations of their single copies of these genes. High expression of X-linked negative regulators of p53 in wild-type TP53 cancers corresponds with reduced survival. Reference
Contrasting the impact of cytotoxic and cytostatic drug therapies on tumour progression
A tumour grows when the total division (birth) rate of its cells exceeds their total mortality (death) rate. The capability for uncontrolled growth within the host tissue is acquired via the accumulation of driver mutations which enable the tumour to progress through various hallmarks of cancer.
We present a mathematical model of the penultimate stage in such a progression. We assume the tumour has reached the limit of its present growth potential due to cell competition that either results in total birth rate reduction or death rate increase. The tumour can then progress to the final stage by either seeding a metastasis or acquiring a driver mutation. We influence the ensuing evolutionary dynamics by cytotoxic (increasing death rate) or cytostatic (decreasing birth rate) therapy while keeping the effect of the therapy on net growth reduction constant. Comparing the treatments head to head we derive conditions for choosing optimal therapy. Reference
Causative role of PDLIM2 epigenetic repression in lung cancer and therapeutic resistance
Most cancers are resistant to anti-PD-1/PD-L1 and chemotherapy. Herein we identify PDLIM2 as a tumor suppressor particularly important for lung cancer therapeutic responses. While PDLIM2 is epigenetically repressed in human lung cancer, associating with therapeutic resistance and poor prognosis, its global or lung epithelial-specific deletion in mice causes increased lung cancer development, chemoresistance, and complete resistance to anti-PD-1 and epigenetic drugs.
PDLIM2 epigenetic restoration or ectopic expression shows antitumor activity, and synergizes with anti-PD-1, notably, with chemotherapy for complete remission of most lung cancers. Mechanistically, through repressing NF-κB/RelA and STAT3, PDLIM2 increases expression of genes involved in antigen presentation and T-cell activation while repressing multidrug resistance genes and cancer-related genes, thereby rendering cancer cells vulnerable to immune attacks and therapies. Reference
EpiMethylTag: simultaneous detection of ATAC-seq or ChIP-seq signals with DNA methylation
Activation of regulatory elements is thought to be inversely correlated with DNA methylation levels. However, it is difficult to determine whether DNA methylation is compatible with chromatin accessibility or transcription factor (TF) binding if assays are performed separately.
We developed a fast, low-input, low sequencing depth method, EpiMethylTag, that combines ATAC-seq or ChIP-seq (M-ATAC or M-ChIP) with bisulfite conversion, to simultaneously examine accessibility/TF binding and methylation on the same DNA. Here we demonstrate that EpiMethylTag can be used to study the functional interplay between chromatin accessibility and TF binding (CTCF and KLF4) at methylated sites. Reference
The mutational footprints of cancer therapies
Some cancer therapies damage DNA and cause mutations in both cancerous and healthy cells. Therapy-induced mutations may underlie some of the long-term and late side effects of treatments, such as mental disabilities, organ toxicity and secondary neoplasms. Nevertheless, the burden of mutation contributed by different chemotherapies has not been explored.
Here we identify the mutational signatures or footprints of six widely used anticancer therapies across more than 3,500 metastatic tumors originating from different organs. These include previously known and new mutational signatures generated by platinum-based drugs as well as a previously unknown signature of nucleoside metabolic inhibitors. Reference
A Bayesian machine learning approach for drug target identification using diverse data types
Drug target identification is a crucial step in development, yet is also among the most complex. To address this, we develop BANDIT, a Bayesian machine-learning approach that integrates multiple data types to predict drug binding targets. Integrating public data, BANDIT benchmarked a ~90% accuracy on 2000+ small molecules.
Applied to 14,000+ compounds without known targets, BANDIT generated ~4,000 previously unknown molecule-target predictions. From this set we validate 14 novel microtubule inhibitors, including 3 with activity on resistant cancer cells. We applied BANDIT to ONC201—an anti-cancer compound in clinical development whose target had remained elusive. We identified and validated DRD2 as ONC201’s target, and this information is now being used for precise clinical trial design. Reference
Novel risk genes and mechanisms implicated by exome sequencing of 2572 individuals with pulmonary arterial hypertension
Group 1 pulmonary arterial hypertension (PAH) is a rare disease with high mortality despite recent therapeutic advances. Pathogenic remodeling of pulmonary arterioles leads to increased pulmonary pressures, right ventricular hypertrophy, and heart failure.
Mutations in bone morphogenetic protein receptor type 2 and other risk genes predispose to disease, but the vast majority of non-familial cases remain genetically undefined. Methods To identify new risk genes, we performed exome sequencing in a large cohort from the National Biological Sample and Data Repository for PAH (PAH Biobank, n = 2572). We then carried out rare deleterious variant identification followed by case-control gene-based association analyses. Reference
NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION
Technological limitations have hindered the large-scale genetic investigation of tandem repeats in disease. We show that long-read sequencing with a single Oxford Nanopore Technologies PromethION flow cell per individual achieves 30× human genome coverage and enables accurate assessment of tandem repeats including the 10,000-bp Alzheimer’s disease-associated ABCA7 VNTR.
The Guppy “flip-flop” base caller and tandem-genotypes tandem repeat caller are efficient for large-scale tandem repeat assessment, but base calling and alignment challenges persist. We present NanoSatellite, which analyzes tandem repeats directly on electric current data and improves calling of GC-rich tandem repeats, expanded alleles, and motif interruptions. Reference
A Bayesian mixture model for the analysis of allelic expression in single cells
Allele-specific expression (ASE) at single-cell resolution is a critical tool for understanding the stochastic and dynamic features of gene expression. However, low read coverage and high biological variability present challenges for analyzing ASE. We demonstrate that discarding multi-mapping reads leads to higher variability in estimates of allelic proportions, an increased frequency of sampling zeros, and can lead to spurious findings of dynamic and monoallelic gene expression.
Here, we report a method for ASE analysis from single-cell RNA-Seq data that accurately classifies allelic expression states and improves estimation of allelic proportions by pooling information across cells. We further demonstrate that combining information across cells using a hierarchical mixture model reduces sampling variability without sacrificing cell-to-cell heterogeneity. We applied our approach to re-evaluate the statistical independence of allelic bursting and track changes in the allele-specific expression patterns of cells sampled over a developmental time course. Reference
FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation
The ability to simulate high-throughput chromatin conformation (Hi-C) data is foundational for benchmarking Hi-C data analysis methods.
Here we present a nonparametric strategy named FreeHi-C to simulate Hi-C data from the interacting genome fragments. Data from FreeHi-C exhibit high fidelity to biological Hi-C data. FreeHi-C boosts the precision and power of differential chromatin interaction detection through data augmentation under preserved false discovery rate control. Reference
OrthoFinder: phylogenetic orthology inference for comparative genomics
Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics.
Each output is benchmarked on appropriate real or simulated datasets, and where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. Reference
Super-enhancer-guided mapping of regulatory networks controlling mouse trophoblast stem cells
Trophectoderm (TE) lineage development is pivotal for proper implantation, placentation, and healthy pregnancy. However, only a few TE-specific transcription factors (TFs) have been systematically characterized, hindering our understanding of the process. To elucidate regulatory mechanisms underlying TE development, here we map super-enhancers (SEs) in trophoblast stem cells (TSCs) as a model.
We find both prominent TE-specific master TFs (Cdx2, Gata3, and Tead4), and >150 TFs that had not been previously implicated in TE lineage, that are SE-associated. Mapping targets of 27 SE-predicted TFs reveals a highly intertwined transcriptional regulatory circuitry. Intriguingly, SE-predicted TFs show 4 distinct expression patterns with dynamic alterations of their targets during TSC differentiation. Reference
Global impact of somatic structural variation on the DNA methylome of human cancers
Genomic rearrangements exert a heavy influence on the molecular landscape of cancer. New analytical approaches integrating somatic structural variants (SSVs) with altered gene features represent a framework by which we can assign global significance to a core set of genes, analogous to established methods that identify genes non-randomly targeted by somatic mutation or copy number alteration.
While recent studies have defined broad patterns of association involving gene transcription and nearby SSV breakpoints, global alterations in DNA methylation in the context of SSVs remain largely unexplored. Reference
Mapping 123 million neonatal, infant and child deaths between 2000 and 2017
Since 2000, many countries have achieved considerable success in improving child survival, but localized progress remains unclear. To inform efforts towards United Nations Sustainable Development Goal 3.2—to end preventable child deaths by 2030—we need consistently estimated data at the subnational level regarding child mortality rates and trends.
Here we quantified, for the period 2000–2017, the subnational variation in mortality rates and number of deaths of neonates, infants and children under 5 years of age within 99 low- and middle-income countries using a geostatistical survival model. We estimated that 32% of children under 5 in these countries lived in districts that had attained rates of 25 or fewer child deaths per 1,000 live births by 2017, and that 58% of child deaths between 2000 and 2017 in these countries could have been averted in the absence of geographical inequality. Reference
Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease
Human T cells coordinate adaptive immunity in diverse anatomic compartments through production of cytokines and effector molecules, but it is unclear how tissue site influences T cell persistence and function.
Here, we use single cell RNA-sequencing (scRNA-seq) to define the heterogeneity of human T cells isolated from lungs, lymph nodes, bone marrow and blood, and their functional responses following stimulation. Through analysis of >50,000 resting and activated T cells, we reveal tissue T cell signatures in mucosal and lymphoid sites, and lineage-specific activation states across all sites including distinct effector states for CD8+ T cells and an interferon-response state for CD4+ T cells. Comparing scRNA-seq profiles of tumor-associated T cells to our dataset reveals predominant activated CD8+ compared to CD4+ T cell states within multiple tumor types. Reference
MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions
scRNA-seq profiles each represent a highly partial sample of mRNA molecules from a unique cell that can never be resampled, and robust analysis must separate the sampling effect from biological variance.
We describe a methodology for partitioning scRNA-seq datasets into metacells: disjoint and homogenous groups of profiles that could have been resampled from the same cell. Unlike clustering analysis, our algorithm specializes at obtaining granular as opposed to maximal groups. We show how to use metacells as building blocks for complex quantitative transcriptional maps while avoiding data smoothing. Our algorithms are implemented in the MetaCell R/C++ software package. Reference
Genome-wide association mapping of date palm fruit traits
Date palms (Phoenix dactylifera) are an important fruit crop of arid regions of the Middle East and North Africa. Despite its importance, few genomic resources exist for date palms, hampering evolutionary genomic studies of this perennial species.
Here we report an improved long-read genome assembly for P. dactylifera that is 772.3 Mb in length, with contig N50 of 897.2 Kb, and use this to perform genome-wide association studies (GWAS) of the sex determining region and 21 fruit traits. We find a fruit color GWAS at the R2R3-MYB transcription factor VIRESCENS gene and identify functional alleles that include a retrotransposon insertion and start codon mutation. We also find a GWAS peak for sugar composition spanning deletion polymorphisms in multiple linked invertase genes. Reference
Relating the gut metagenome and metatranscriptome to immunotherapy responses in melanoma patients
Recent evidence suggests that immunotherapy efficacy in melanoma is modulated by gut microbiota. Few studies have examined this phenomenon in humans, and none have incorporated metatranscriptomics, important for determining expression of metagenomic functions in the microbial community.
In melanoma patients undergoing immunotherapy, gut microbiome was characterized in pre-treatment stool using 16S rRNA gene and shotgun metagenome sequencing (n = 27). Transcriptional expression of metagenomic pathways was confirmed with metatranscriptome sequencing in a subset of 17. We examined associations of taxa and metagenomic pathways with progression-free survival (PFS) using 500 × 10-fold cross-validated elastic-net penalized Cox regression. Reference
A systematic evaluation of single cell RNA-seq analysis pipelines
The recent rapid spread of single cell RNA sequencing (scRNA-seq) methods has created a large variety of experimental and computational pipelines for which best practices have not yet been established. Here, we use simulations based on five scRNA-seq library protocols in combination with nine realistic differential expression (DE) setups to systematically evaluate three mapping, four imputation, seven normalisation and four differential expression testing approaches resulting in ~3000 pipelines, allowing us to also assess interactions among pipeline steps.
We find that choices of normalisation and library preparation protocols have the biggest impact on scRNA-seq analyses. Specifically, we find that library preparation determines the ability to detect symmetric expression differences, while normalisation dominates pipeline performance in asymmetric DE-setups. Finally, we illustrate the importance of informed choices by showing that a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the sample size. Reference
DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data
Characterizing and interpreting heterogeneous mixtures at the cellular level is a critical problem in genomics. Single-cell assays offer an opportunity to resolve cellular level heterogeneity, e.g., scRNA-seq enables single-cell expression profiling, and scATAC-seq identifies active regulatory elements.
Furthermore, while scHi-C can measure the chromatin contacts (i.e., loops) between active regulatory elements to target genes in single cells, bulk HiChIP can measure such contacts in a higher resolution. In this work, we introduce DC3 (De-Convolution and Coupled-Clustering) as a method for the joint analysis of various bulk and single-cell data such as HiChIP, RNA-seq and ATAC-seq from the same heterogeneous cell population. DC3 can simultaneously identify distinct subpopulations, assign single cells to the subpopulations (i.e., clustering) and de-convolve the bulk data into subpopulation-specific data. Reference
Identifying significantly impacted pathways: a comprehensive review and assessment
Many high-throughput experiments compare two phenotypes such as disease vs. healthy, with the goal of understanding the underlying biological phenomena characterizing the given phenotype. Because of the importance of this type of analysis, more than 70 pathway analysis methods have been proposed so far.
These can be categorized into two main categories: non-topology-based (non-TB) and topology-based (TB). Although some review papers discuss this topic from different aspects, there is no systematic, large-scale assessment of such methods. Furthermore, the majority of the pathway analysis approaches rely on the assumption of uniformity of p values under the null hypothesis, which is often not true. Reference
Convergence of human and Old World monkey gut microbiomes demonstrates the importance of human ecology over phylogeny
Comparative data from non-human primates provide insight into the processes that shaped the evolution of the human gut microbiome and highlight microbiome traits that differentiate humans from other primates.
Here, in an effort to improve our understanding of the human microbiome, we compare gut microbiome composition and functional potential in 14 populations of humans from ten nations and 18 species of wild, non-human primates. Contrary to expectations from host phylogenetics, we find that human gut microbiome composition and functional potential are more similar to those of cercopithecines, a subfamily of Old World monkey, particularly baboons, than to those of African apes. Reference
Transcriptional landscape and clinical utility of enhancer RNAs for eRNA-targeted therapy in cancer
Enhancer RNA (eRNA) is a type of noncoding RNA transcribed from the enhancer. Although critical roles of eRNA in gene transcription control have been increasingly realized, the systemic landscape and potential function of eRNAs in cancer remains largely unexplored.
Here, we report the integration of multi-omics and pharmacogenomics data across large-scale patient samples and cancer cell lines. We observe a cancer-/lineage-specificity of eRNAs, which may be largely driven by tissue-specific TFs. eRNAs are involved in multiple cancer signaling pathways through putatively regulating their target genes, including clinically actionable genes and immune checkpoints. Reference
Interplay between the human gut microbiome and host metabolism
The human gut is inhabited by a complex and metabolically active microbial ecosystem. While many studies focused on the effect of individual microbial taxa on human health, their overall metabolic potential has been under-explored.
Using whole-metagenome shotgun sequencing data in 1,004 twins, we first observed that unrelated subjects share, on average, almost double the number of metabolic pathways (82%) than species (43%). Then, using 673 blood and 713 faecal metabolites, we found metabolic pathways to be associated with 34% of blood and 95% of faecal metabolites, with over 18,000 significant associations, while species showed less than 3,000 associations. Finally, we estimated that the microbiome was involved in a dialogue between 71% of faecal, and 15% of blood, metabolites. Reference
A homology-guided, genome-based proteome for improved proteomics in the alloploid Nicotiana benthamiana
Nicotiana benthamiana is an important model organism of the Solanaceae (Nightshade) family. Several draft assemblies of the N. benthamiana genome have been generated, but many of the gene-models in these draft assemblies appear incorrect.
Here we present an improved proteome based on the Niben1.0.1 draft genome assembly guided by gene models from other Nicotiana species. Due to the fragmented nature of the Niben1.0.1 draft genome, many protein-encoding genes are missing or partial. We complement these missing proteins by similarly annotating other draft genome assemblies. This approach overcomes problems caused by mis-annotated exon-intron boundaries and mis-assigned short read transcripts to homeologs in polyploid genomes. Reference
Identifying Crohn’s disease signal from variome analysis
After years of concentrated research efforts, the exact cause of Crohn’s disease (CD) remains unknown. Its accurate diagnosis, however, helps in management and preventing the onset of disease. Genome-wide association studies have identified 241 CD loci, but these carry small log odds ratios and are thus diagnostically uninformative.
Here, we describe a machine learning method—AVA,Dx (Analysis of Variation for Association with Disease)—that uses exonic variants from whole exome or genome sequencing data to extract CD signal and predict CD status. Using the person-specific coding variation in genes from a panel of only 111 individuals, we built disease-prediction models informative of previously undiscovered disease genes. Reference
Target genes, variants, tissues and transcriptional pathways influencing human serum urate levels
Elevated serum urate levels cause gout and correlate with cardiometabolic diseases via poorly understood mechanisms.
We performed a trans-ancestry genome-wide association study of serum urate in 457,690 individuals, identifying 183 loci (147 previously unknown) that improve the prediction of gout in an independent cohort of 334,880 individuals. Serum urate showed significant genetic correlations with many cardiometabolic traits, with genetic causality analyses supporting a substantial role for pleiotropy. Enrichment analysis, fine-mapping of urate-associated loci and colocalization with gene expression in 47 tissues implicated the kidney and liver as the main target organs and prioritized potentially causal genes and variants, including the transcriptional master regulators in the liver and kidney, HNF1A and HNF4A. Reference
Multi-Cell ECM compaction is predictable via superposition of nonlinear cell dynamics linearized in augmented state space
Cells interacting through an extracellular matrix (ECM) exhibit emergent behaviors resulting from collective intercellular interaction. In wound healing and tissue development, characteristic compaction of ECM gel is induced by multiple cells that generate tensions in the ECM fibers and coordinate their actions with other cells. Computational prediction of collective cell-ECM interaction based on first principles is highly complex especially as the number of cells increase.
Here, we introduce a computationally-efficient method for predicting nonlinear behaviors of multiple cells interacting mechanically through a 3-D ECM fiber network. The key enabling technique is superposition of single cell computational models to predict multicellular behaviors. While cell-ECM interactions are highly nonlinear, they can be linearized accurately with a unique method, termed Dual-Faceted Linearization. This method recasts the original nonlinear dynamics in an augmented space where the system behaves more linearly. The independent state variables are augmented by combining auxiliary variables that inform nonlinear elements involved in the system. This computational method involves a) expressing the original nonlinear state equations with two sets of linear dynamic equations b) reducing the order of the augmented linear system via principal component analysis and c) superposing individual single cell-ECM dynamics to predict collective behaviors of multiple cells. Reference
Transcriptome-wide association study of attention deficit hyperactivity disorder identifies associated genes and phenotypes
Attention deficit/hyperactivity disorder (ADHD) is a common neurodevelopmental psychiatric disorder. Genome-wide association studies (GWAS) have identified several loci associated with ADHD. However, understanding the biological relevance of these genetic loci has proven to be difficult.
Here, we conduct an ADHD transcriptome-wide association study (TWAS) consisting of 19,099 cases and 34,194 controls and identify 9 transcriptome-wide significant hits, of which 6 genes were not implicated in the original GWAS. We demonstrate that two of the previous GWAS hits can be largely explained by expression regulation. Probabilistic causal fine-mapping of TWAS signals prioritizes KAT2B with a posterior probability of 0.467 in the dorsolateral prefrontal cortex and TMEM161B with a posterior probability of 0.838 in the amygdala. Reference
microRNA arm-imbalance in part from complementary targets mediated decay promotes gastric cancer progression
Strand-selection is the final step of microRNA biogenesis in which functional mature miRNAs are generated from one or both arms of precursor. The preference of strand-selection is diverse during development and tissue formation, however, its pathological effect is still unknown.
Here we find that two miRNA arms from the same precursor, miR-574-5p and miR-574-3p, are inversely expressed and play exactly opposite roles in gastric cancer progression. Higher-5p with lower-3p expression pattern is significantly correlated with higher TNM stages and poor prognosis of gastric cancer patients. The increase of miR-574-5p/-3p ratio, named miR-574 arm-imbalance is partially due to the dynamic expression of their highly complementary targets in gastric carcinogenesis, moreover, the arm-imbalance of miR-574 is in turn involved and further promotes gastric cancer progression. Reference
A de novo evolved gene in the house mouse regulates female pregnancy cycles
The de novo emergence of new genes has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here, we identify a set of house mouse-specific protein-coding genes and assess their translation by ribosome profiling and mass spectrometry data.
We functionally analyze one of them, Gm13030, which is specifically expressed in females in the oviduct. The interruption of the reading frame affects the transcriptional network in the oviducts at a specific stage of the estrous cycle. This includes the upregulation of Dcpp genes, which are known to stimulate the growth of preimplantation embryos. As a consequence, knockout females have their second litters after shorter times and have a higher infanticide rate. Given that Gm13030 shows no signs of positive selection, our findings support the hypothesis that a de novo evolved gene can directly adopt a function without much sequence adaptation. Reference
Metabolomic adaptations and correlates of survival to immune checkpoint blockade
Despite remarkable success of immune checkpoint inhibitors, the majority of cancer patients have yet to receive durable benefits.
Here, in order to investigate the metabolic alterations in response to immune checkpoint blockade, we comprehensively profile serum metabolites in advanced melanoma and renal cell carcinoma patients treated with nivolumab, an antibody against programmed cell death protein 1 (PD1). We identify serum kynurenine/tryptophan ratio increases as an adaptive resistance mechanism associated with worse overall survival. This advocates for patient stratification and metabolic monitoring in immunotherapy clinical trials including those combining PD1 blockade with indoleamine 2,3-dioxygenase/tryptophan 2,3-dioxygenase (IDO/TDO) inhibitors. Reference
Divergent neuronal DNA methylation patterns across human cortical development reveal critical periods and a unique role of CpH methylation
DNA methylation (DNAm) is a critical regulator of both development and cellular identity and shows unique patterns in neurons. To better characterize maturational changes in DNAm patterns in these cells, we profile the DNAm landscape at single-base resolution across the first two decades of human neocortical development in NeuN+ neurons using whole-genome bisulfite sequencing and compare them to non-neurons (primarily glia) and prenatal homogenate cortex.
We show that DNAm changes more dramatically during the first 5 years of postnatal life than during the entire remaining period. We further refine global patterns of increasingly divergent neuronal CpG and CpH methylation (mCpG and mCpH) into six developmental trajectories and find that in contrast to genome-wide patterns, neighboring mCpG and mCpH levels within these regions are highly correlated. Reference
Massively parallel RNA device engineering in mammalian cells with RNA-Seq
Synthetic RNA-based genetic devices dynamically control a wide range of gene-regulatory processes across diverse cell types. However, the limited throughput of quantitative assays in mammalian cells has hindered fast iteration and interrogation of sequence space needed to identify new RNA devices.
Here we report developing a quantitative, rapid and high-throughput mammalian cell-based RNA-Seq assay to efficiently engineer RNA devices. We identify new ribozyme-based RNA devices that respond to theophylline, hypoxanthine, cyclic-di-GMP, and folinic acid from libraries of ~22,700 sequences in total. The small molecule responsive devices exhibit low basal expression and high activation ratios, significantly expanding our toolset of highly functional ribozyme switches. Reference
Genetic architecture of human plasma lipidome and its link to cardiovascular disease
Understanding genetic architecture of plasma lipidome could provide better insights into lipid metabolism and its link to cardiovascular diseases (CVDs).
Here, we perform genome-wide association analyses of 141 lipid species (n = 2,181 individuals), followed by phenome-wide scans with 25 CVD related phenotypes (n = 511,700 individuals). We identify 35 lipid-species-associated loci (P <5 ×10−8), 10 of which associate with CVD risk including five new loci-COL5A1, GLTPD2, SPTLC3, MBOAT7 and GALNT16 (false discovery rate<0.05). We identify loci for lipid species that are shown to predict CVD e.g., SPTLC3 for CER(d18:1/24:1). We show that lipoprotein lipase (LPL) may more efficiently hydrolyze medium length triacylglycerides (TAGs) than others. Polyunsaturated lipids have highest heritability and genetic correlations, suggesting considerable genetic regulation at fatty acids levels. Reference