T-Scan: A Genome-wide Method for the Systematic Discovery of T Cell Epitopes
T cell recognition of specific antigens mediates protection from pathogens and controls neoplasias, but can also cause autoimmunity. Our knowledge of T cell antigens and their implications for human health is limited by the technical limitations of T cell profiling technologies. Here, we present T-Scan, a high-throughput platform for identification of antigens productively recognized by T cells.
T-Scan uses lentiviral delivery of antigen libraries into cells for endogenous processing and presentation on major histocompatibility complex (MHC) molecules. Target cells functionally recognized by T cells are isolated using a reporter for granzyme B activity, and the antigens mediating recognition are identified by next-generation sequencing. We show T-Scan correctly identifies cognate antigens of T cell receptors (TCRs) from viral and human genome-wide libraries. We apply T-Scan to discover new viral antigens, perform high-resolution mapping of TCR specificity, and characterize the reactivity of a tumor-derived TCR. T-Scan is a powerful approach for studying T cell responses. Reference
Single cell transcriptome analysis of developing arcuate nucleus neurons uncovers their key developmental regulators
Despite the crucial physiological processes governed by neurons in the hypothalamic arcuate nucleus (ARC), such as growth, reproduction and energy homeostasis, the developmental pathways and regulators for ARC neurons remain understudied. Our single cell RNA-seq analyses of mouse embryonic ARC revealed many cell type-specific markers for developing ARC neurons.
These markers include transcription factors whose expression is enriched in specific neuronal types and often depleted in other closely-related neuronal types, raising the possibility that these transcription factors play important roles in the fate commitment or differentiation of specific ARC neuronal types. We validated this idea with the two transcription factors, Foxp2 enriched for Ghrh-neurons and Sox14 enriched for Kisspeptin-neurons, using Foxp2- and Sox14-deficient mouse models. Reference
Single-cell DNA replication profiling identifies spatiotemporal developmental dynamics of chromosome organization
In mammalian cells, chromosomes are partitioned into megabase-sized topologically associating domains (TADs). TADs can be in either A (active) or B (inactive) subnuclear compartments, which exhibit early and late replication timing (RT), respectively.
Here, we show that A/B compartments change coordinately with RT changes genome wide during mouse embryonic stem cell (mESC) differentiation. While A to B compartment changes and early to late RT changes were temporally inseparable, B to A changes clearly preceded late to early RT changes and transcriptional activation. Compartments changed primarily by boundary shifting, altering the compartmentalization of TADs facing the A/B compartment interface, which was conserved during reprogramming and confirmed in individual cells by single-cell Repli-seq. Reference
A meta-analysis of genome-wide association studies identifies multiple longevity genes
Human longevity is heritable, but genome-wide association (GWA) studies have had limited success. Here, we perform two meta-analyses of GWA studies of a rigorous longevity phenotype definition including 11,262/3484 cases surviving at or beyond the age corresponding to the 90th/99th survival percentile, respectively, and 25,483 controls whose age at death or at last contact was at or below the age corresponding to the 60th survival percentile.
Consistent with previous reports, rs429358 (apolipoprotein E (ApoE) ε4) is associated with lower odds of surviving to the 90th and 99th percentile age, while rs7412 (ApoE ε2) shows the opposite. Reference
BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes
To fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for identifying cell lineages and bona fide transcriptional signals, it is necessary to combine data from multiple experiments.
We present BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data. BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate that BERMUDA outperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets. Reference
Insight into genetic predisposition to chronic lymphocytic leukemia from integrative epigenomics
Genome-wide association studies have provided evidence for inherited genetic predisposition to chronic lymphocytic leukemia (CLL). To gain insight into the mechanisms underlying CLL risk we analyze chromatin accessibility, active regulatory elements marked by H3K27ac, and DNA methylation at 42 risk loci in up to 486 primary CLLs.
We identify that risk loci are significantly enriched for active chromatin in CLL with evidence of being CLL-specific or differentially regulated in normal B-cell development. We then use in situ promoter capture Hi-C, in conjunction with gene expression data to reveal likely target genes of the risk loci. Candidate target genes are enriched for pathways related to B-cell development such as MYC and BCL2 signalling. At 14 loci the analysis highlights 63 variants as the probable functional basis of CLL risk. Reference
A systematic assessment of current genome-scale metabolic reconstruction tools
Several genome-scale metabolic reconstruction software platforms have been developed and are being continuously updated. These tools have been widely applied to reconstruct metabolic models for hundreds of microorganisms ranging from important human pathogens to species of industrial relevance.
However, these platforms, as yet, have not been systematically evaluated with respect to software quality, best potential uses and intrinsic capacity to generate high-quality, genome-scale metabolic models. It is therefore unclear for potential users which tool best fits the purpose of their research. Reference
Connecting signaling and metabolic pathways in EGF receptor-mediated oncogenesis of glioblastoma
As malignant transformation requires synchronization of growth-driving signaling (S) and metabolic (M) pathways, defining cancer-specific S-M interconnected networks (SMINs) could lead to better understanding of oncogenic processes.
In a systems-biology approach, we developed a mathematical model for SMINs in mutated EGF receptor (EGFRvIII) compared to wild-type EGF receptor (EGFRwt) expressing glioblastoma multiforme (GBM). Starting with experimentally validated human protein-protein interactome data for S-M pathways, and incorporating proteomic data for EGFRvIII and EGFRwt GBM cells and patient transcriptomic data, we designed a dynamic model for EGFR-driven GBM-specific information flow. Key nodes and paths identified by in silico perturbation were validated experimentally when inhibition of signaling pathway proteins altered expression of metabolic proteins as predicted by the model. Reference
Comprehensive transcriptomic analysis of cell lines as models of primary tumors across 22 tumor types
Cancer cell lines are a cornerstone of cancer research but previous studies have shown that not all cell lines are equal in their ability to model primary tumors.
Here we present a comprehensive pan-cancer analysis utilizing transcriptomic profiles from The Cancer Genome Atlas and the Cancer Cell Line Encyclopedia to evaluate cell lines as models of primary tumors across 22 tumor types. We perform correlation analysis and gene set enrichment analysis to understand the differences between cell lines and primary tumors. Additionally, we classify cell lines into tumor subtypes in 9 tumor types. We present our pancreatic cancer results as a case study and find that the commonly used cell line MIA PaCa-2 is transcriptionally unrepresentative of primary pancreatic adenocarcinomas. Reference
Proteogenomic landscape of squamous cell lung cancer
How genomic and transcriptomic alterations affect the functional proteome in lung cancer is not fully understood. Here, we integrate DNA copy number, somatic mutations, RNA-sequencing, and expression proteomics in a cohort of 108 squamous cell lung cancer (SCC) patients.
We identify three proteomic subtypes, two of which (Inflamed, Redox) comprise 87% of tumors. The Inflamed subtype is enriched with neutrophils, B-cells, and monocytes and expresses more PD-1. Redox tumours are enriched for oxidation-reduction and glutathione pathways and harbor more NFE2L2/KEAP1 alterations and copy gain in the 3q2 locus. Proteomic subtypes are not associated with patient survival. Reference
BART-Seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis
We describe a highly sensitive, quantitative, and inexpensive technique for targeted sequencing of transcript cohorts or genomic regions from thousands of bulk samples or single cells in parallel.
Multiplexing is based on a simple method that produces extensive matrices of diverse DNA barcodes attached to invariant primer sets, which are all pre-selected and optimized in silico. By applying the matrices in a novel workflow named Barcode Assembly foR Targeted Sequencing (BART-Seq), we analyze developmental states of thousands of single human pluripotent stem cells, either in different maintenance media or upon Wnt/β-catenin pathway activation, which identifies the mechanisms of differentiation induction. Reference
GWAS hints at pleiotropic roles for FLOWERING LOCUS T in flowering time and yield-related traits in canola
Transition to flowering at the right time is critical for local adaptation and to maximize grain yield in crops. Canola is an important oilseed crop with extensive variation in flowering time among varieties. However, our understanding of underlying genes and their role in canola productivity is limited.
We report our analyses of a diverse GWAS panel (300–368 accessions) of canola and identify SNPs that are significantly associated with variation in flowering time and response to photoperiod across multiple locations. We show that several of these associations map in the vicinity of FLOWERING LOCUS T (FT) paralogs and its known transcriptional regulators. Complementary QTL and eQTL mapping studies, conducted in an Australian doubled haploid population, also detected consistent genomic regions close to the FT paralogs associated with flowering time and yield-related traits. FT sequences vary between accessions. Reference
The existence of discrete phenotypic traits suggests that the complex regulatory processes which produce them are functionally modular. These processes are usually represented by networks. Only modular networks can be partitioned into intelligible subcircuits able to evolve relatively independently.
Traditionally, functional modularity is approximated by detection of modularity in network structure. However, the correlation between structure and function is loose. Many regulatory networks exhibit modular behaviour without structural modularity. Here we partition an experimentally tractable regulatory network—the gap gene system of dipteran insects—using an alternative approach. We show that this system, although not structurally modular, is composed of dynamical modules driving different aspects of whole-network behaviour. Reference
A genome-wide positioning systems network algorithm for in silico drug repurposing
Recent advances in DNA/RNA sequencing have made it possible to identify new targets rapidly and to repurpose approved drugs for treating heterogeneous diseases by the ‘precise’ targeting of individualized disease modules.
In this study, we develop a Genome-wide Positioning Systems network (GPSnet) algorithm for drug repurposing by specifically targeting disease modules derived from individual patient’s DNA and RNA sequencing profiles mapped to the human protein-protein interactome network. We investigate whole-exome sequencing and transcriptome profiles from ~5,000 patients across 15 cancer types from The Cancer Genome Atlas. We show that GPSnet-predicted disease modules can predict drug responses and prioritize new indications for 140 approved drugs. Reference
A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes
A platform for highly parallel direct sequencing of native RNA strands was recently described by Oxford Nanopore Technologies, but despite initial efforts it remains crucial to further investigate the technology for quantification of complex transcriptomes.
Here we undertake native RNA sequencing of polyA + RNA from two human cell lines, analysing ~5.2 million aligned native RNA reads. To enable informative comparisons, we also perform relevant ONT direct cDNA- and Illumina-sequencing. We find that while native RNA sequencing does enable some of the anticipated advantages, key unexpected aspects currently hamper its performance, most notably the quite frequent inability to obtain full-length transcripts from single reads, as well as difficulties to unambiguously infer their true transcript of origin. Reference
HUPAN: a pan-genome analysis pipeline for human genomes
The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions.
Here, we developed a HUman Pan-genome ANalysis (HUPAN) system to build the human pan-genome. We applied it to 185 deep sequencing and 90 assembled Han Chinese genomes and detected 29.5 Mb novel genomic sequences and at least 188 novel protein-coding genes missing in the human reference genome (GRCh38). It can be an important resource for the human genome-related biomedical studies, such as cancer genome analysis. Reference
Spatial chromatin architecture alteration by structural variations in human genomes at the population scale
The number of reported examples of chromatin architecture alterations involved in the regulation of gene transcription and in disease is increasing. However, no genome-wide testing has been performed to assess the abundance of these events and their importance relative to other factors affecting genome regulation.
This is particularly interesting given that a vast majority of genetic variations identified in association studies are located outside coding sequences. This study attempts to address this lack by analyzing the impact on chromatin spatial organization of genetic variants identified in individuals from 26 human populations and in genome-wide association studies. Reference
Role of p110a subunit of PI3-kinase in skeletal muscle mitochondrial homeostasis and metabolism
Skeletal muscle insulin resistance, decreased phosphatidylinositol 3-kinase (PI3K) activation and altered mitochondrial function are hallmarks of type 2 diabetes. To determine the relationship between these abnormalities, we created mice with muscle-specific knockout of the p110α or p110β catalytic subunits of PI3K.
We find that mice with muscle-specific knockout of p110α, but not p110β, display impaired insulin signaling and reduced muscle size due to enhanced proteasomal and autophagic activity. Despite insulin resistance and muscle atrophy, M-p110αKO mice show decreased serum myostatin, increased mitochondrial mass, increased mitochondrial fusion, and increased PGC1α expression, especially PCG1α2 and PCG1α3. Reference
Hidden Markov models lead to higher resolution maps of mutation signature activity in cancer
Knowing the activity of the mutational processes shaping a cancer genome may provide insight into tumorigenesis and personalized therapy. It is thus important to characterize the signatures of active mutational processes in patients from their patterns of single base substitutions.
However, mutational processes do not act uniformly on the genome, leading to statistical dependencies among neighboring mutations. To account for such dependencies, we develop the first sequence-dependent model, SigMa, for mutation signatures. We apply SigMa to characterize genomic and other factors that influence the activity of mutation signatures in breast cancer. We show that SigMa outperforms previous approaches, revealing novel insights on signature etiology. Reference
RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts
While genetic relatedness, usually manifested as segments identical by descent (IBD), is ubiquitous in modern large biobanks, current IBD detection methods are not efficient at such a scale.
Here, we describe an efficient method, RaPID, for detecting IBD segments in a panel with phased haplotypes. RaPID achieves a time and space complexity linear to the input size and the number of reported IBDs. With simulation, we showed that RaPID is orders of magnitude faster than existing method while offering competitive power and accuracy. In UK Biobank, RaPID identified 3,335,807 IBDs with a lenght ≥ 10 cM among 223,507 male X chromosomes in 11 min. Reference
FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science
FDA proactively invests in tools to support innovation of emerging technologies, such as infectious disease next generation sequencing (ID-NGS). Here, we introduce FDA-ARGOS quality-controlled reference genomes as a public database for diagnostic purposes and demonstrate its utility on the example of two use cases.
We provide quality control metrics for the FDA-ARGOS genomic database resource and outline the need for genome quality gap filling in the public domain. In the first use case, we show more accurate microbial identification of Enterococcus avium from metagenomic samples with FDA-ARGOS reference genomes compared to non-curated GenBank genomes. Reference
Integrating Gene and Protein Expression Reveals Perturbed Functional Networks in Alzheimer’s Disease
Asymptomatic and symptomatic Alzheimer’s disease (AD) subjects may present with equivalent neuropathological burdens but have significantly different antemortem cognitive decline rates. Using the transcriptome as a proxy for functional state, we selected 414 expression profiles of symptomatic AD subjects and age-matched non-demented controls from a community-based neuropathological study. B
y combining brain tissue-specific protein interactomes with gene networks, we identified functionally distinct composite clusters of genes that reveal extensive changes in expression levels in AD. Global expression for clusters broadly corresponding to synaptic transmission, metabolism, cell cycle, survival, and immune response were downregulated, while the upregulated cluster included largely uncharacterized processes. Reference
Landscape of transcriptomic interactions between breast cancer and its microenvironment
Solid tumours comprise mixtures of tumour cells (TCs) and tumour-adjacent cells (TACs), and the intricate interconnections between these diverse populations shape the tumour’s microenvironment. Despite this complexity, clinical genomic profiling is typically performed from bulk samples, without distinguishing TCs from TACs.
To better understand TC–TAC interactions, we computationally distinguish their transcriptomes in 1780 primary breast tumours. We show that TC and TAC mRNA abundances are divergently associated with clinical phenotypes, including tumour subtypes and patient survival. These differences reflect distinct responses of TCs and TACs to specific somatic driver mutations, particularly TP53. These data further elucidate how the molecular interplay between breast tumours and their microenvironment drives aggressive tumour phenotypes. Reference
CellSIUS provides sensitive and specific detection of rare cell populations from complex single-cell RNA-seq data
We develop CellSIUS (Cell Subtype Identification from Upregulated gene Sets) to fill a methodology gap for rare cell population identification for scRNA-seq data.
CellSIUS outperforms existing algorithms for specificity and selectivity for rare cell types and their transcriptomic signature identification in synthetic and complex biological data. Characterization of a human pluripotent cell differentiation protocol recapitulating deep-layer corticogenesis using CellSIUS reveals unrecognized complexity in human stem cell-derived cellular populations. CellSIUS enables identification of novel rare cell populations and their signature genes providing the means to study those populations in vitro in light of their role in health and disease. Reference
Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq
Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle.
Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis. Reference
Gene expression profile of human T cells following a single stimulation of peripheral blood mononuclear cells with anti-CD3 antibodies
Anti-CD3 immunotherapy was initially approved for clinical use for renal transplantation rejection prevention. Subsequently, new generations of anti-CD3 antibodies have entered clinical trials for a broader spectrum of therapeutic applications, including cancer and autoimmune diseases.
Despite their extensive use, little is known about the exact mechanism of these molecules, except that they are able to activate T cells, inducing an overall immunoregulatory and tolerogenic behavior. To better understand the effects of anti-CD3 antibodies on human T cells, PBMCs were stimulated, and then, we performed RNA-seq assays of enriched T cells to assess changes in their gene expression profiles. Reference
Analysis of the equine “cumulome” reveals major metabolic aberrations after maturation in vitro
Maturation of oocytes under in vitro conditions (IVM) results in impaired developmental competence compared to oocytes matured in vivo.
As oocytes are closely coupled to their cumulus complex, elucidating aberrations in cumulus metabolism in vitro is important to bridge the gap towards more physiological maturation conditions. The aim of this study was to analyze the equine “cumulome” in a novel combination of proteomic (nano-HPLC MS/MS) and metabolomic (UPLC-nanoESI-MS) profiling of single cumulus complexes of metaphase II oocytes matured either in vivo (n = 8) or in vitro (n = 7). Reference
Genome and epigenome wide studies of neurological protein biomarkers in the Lothian Birth Cohort 1936
Although plasma proteins may serve as markers of neurological disease risk, the molecular mechanisms responsible for inter-individual variation in plasma protein levels are poorly understood.
Therefore, we conduct genome- and epigenome-wide association studies on the levels of 92 neurological proteins to identify genetic and epigenetic loci associated with their plasma concentrations (n = 750 healthy older adults). We identify 41 independent genome-wide significant (P < 5.4 × 10−10) loci for 33 proteins and 26 epigenome-wide significant (P < 3.9 × 10−10) sites associated with the levels of 9 proteins. Using this information, we identify biological pathways in which putative neurological biomarkers are implicated (neurological, immunological and extracellular matrix metabolic pathways). Reference
Membrane protein-regulated networks across human cancers
Alterations in membrane proteins (MPs) and their regulated pathways have been established as cancer hallmarks and extensively targeted in clinical applications. However, the analysis of MP-interacting proteins and downstream pathways across human malignancies remains challenging.
Here, we present a systematically integrated method to generate a resource of cancer membrane protein-regulated networks (CaMPNets), containing 63,746 high-confidence protein–protein interactions (PPIs) for 1962 MPs, using expression profiles from 5922 tumors with overall survival outcomes across 15 human cancers. Comprehensive analysis of CaMPNets links MP partner communities and regulated pathways to provide MP-based gene sets for identifying prognostic biomarkers and druggable targets. Reference
CONFINED: distinguishing biological from technical sources of variation by leveraging multiple methylation datasets
Methylation datasets are affected by innumerable sources of variability, both biological (cell-type composition, genetics) and technical (batch effects).
Here, we propose a reference-free method based on sparse canonical correlation analysis to separate the biological from technical sources of variability. We show through simulations and real data that our method, CONFINED, is not only more accurate than the state-of-the-art reference-free methods for capturing known, replicable biological variability, but it is also considerably more robust to dataset-specific technical variability than previous approaches. Reference
GEMINI: a variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screens
Systems for CRISPR-based combinatorial perturbation of two or more genes are emerging as powerful tools for uncovering genetic interactions.
However, systematic identification of these relationships is complicated by sample, reagent, and biological variability. We develop a variational Bayes approach (GEMINI) that jointly analyzes all samples and reagents to identify genetic interactions in pairwise knockout screens. The improved accuracy and scalability of GEMINI enables the systematic analysis of combinatorial CRISPR knockout screens, regardless of design and dimension. Reference
The nasal methylome as a biomarker of asthma and airway inflammation in children
The nasal cellular epigenome may serve as biomarker of airway disease and environmental response. Here we collect nasal swabs from the anterior nares of 547 children (mean-age 12.9 y), and measure DNA methylation (DNAm) with the Infinium MethylationEPIC BeadChip.
We perform nasal Epigenome-Wide Association analyses (EWAS) of current asthma, allergen sensitization, allergic rhinitis, fractional exhaled nitric oxide (FeNO) and lung function. We find multiple differentially methylated CpGs (FDR < 0.05) and Regions (DMRs; ≥ 5-CpGs and FDR < 0.05) for asthma (285-CpGs), FeNO (8,372-CpGs; 191-DMRs), total IgE (3-CpGs; 3-DMRs), environment IgE (17-CpGs; 4-DMRs), allergic asthma (1,235-CpGs; 7-DMRs) and bronchodilator response (130-CpGs). Reference
SMURF-seq: efficient copy number profiling on long-read sequencers
We present SMURF-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer by randomly ligating them to form long molecules. Applying SMURF-seq using the Oxford Nanopore MinION yields up to 30 fragments per read, providing an average of 6.2 and up to 7.5 million mappable fragments per run, increasing information throughput for read-counting applications.
We apply SMURF-seq on the MinION to generate copy number profiles. A comparison with profiles from Illumina sequencing reveals that SMURF-seq attains similar accuracy. More broadly, SMURF-seq expands the utility of long-read sequencers for read-counting applications. Reference
GWAS of peripheral artery disease in the Million Veteran Program
Peripheral artery disease (PAD) is a leading cause of cardiovascular morbidity and mortality; however, the extent to which genetic factors increase risk for PAD is largely unknown. Using electronic health record data, we performed a genome-wide association study in the Million Veteran Program testing ~32 million DNA sequence variants with PAD (31,307 cases and 211,753 controls) across veterans of European, African and Hispanic ancestry.
The results were replicated in an independent sample of 5,117 PAD cases and 389,291 controls from the UK Biobank. We identified 19 PAD loci, 18 of which have not been previously reported. Eleven of the 19 loci were associated with disease in three vascular beds (coronary, cerebral, peripheral), including LDLR, LPL and LPA, suggesting that therapeutic modulation of low-density lipoprotein cholesterol, the lipoprotein lipase pathway or circulating lipoprotein(a) may be efficacious for multiple atherosclerotic disease phenotypes. Reference
Metabolic network percolation quantifies biosynthetic capabilities across the human oral microbiome
The biosynthetic capabilities of microbes underlie their growth and interactions, playing a prominent role in microbial community structure. For large, diverse microbial communities, prediction of these capabilities is limited by uncertainty about metabolic functions and environmental conditions.
To address this challenge, we propose a probabilistic method, inspired by percolation theory, to computationally quantify how robustly a genome-derived metabolic network produces a given set of metabolites under an ensemble of variable environments. We used this method to compile an atlas of predicted biosynthetic capabilities for 97 metabolites across 456 human oral microbes. This atlas captures taxonomically-related trends in biomass composition, and makes it possible to estimate inter-microbial metabolic distances that correlate with microbial co-occurrences. Reference
Integrative analysis of vascular endothelial cell genomic features identifies AIDA as a coronary artery disease candidate gene
Genome-wide association studies (GWAS) have identified hundreds of loci associated with coronary artery disease (CAD) and blood pressure (BP) or hypertension. Many of these loci are not linked to traditional risk factors, nor do they include obvious candidate genes, complicating their functional characterization.
We hypothesize that many GWAS loci associated with vascular diseases modulate endothelial functions. Endothelial cells play critical roles in regulating vascular homeostasis, such as roles in forming a selective barrier, inflammation, hemostasis, and vascular tone, and endothelial dysfunction is a hallmark of atherosclerosis and hypertension. Reference
Evolving neoantigen profiles in colorectal cancers with DNA repair defects
Neoantigens that arise as a consequence of tumor-specific mutations can be recognized by T lymphocytes leading to effective immune surveillance. In colorectal cancer (CRC) and other tumor types, a high number of neoantigens is associated with patient response to immune therapies.
The molecular processes governing the generation of neoantigens and their turnover in cancer cells are poorly understood. We exploited CRC as a model system to understand how alterations in DNA repair pathways modulate neoantigen profiles over time. Reference
Multi-region exome sequencing reveals genomic evolution from preneoplasia to lung adenocarcinoma
There has been a dramatic increase in the detection of lung nodules, many of which are preneoplasia atypical adenomatous hyperplasia (AAH), adenocarcinoma in situ (AIS), minimally invasive adenocarcinoma (MIA) or invasive adenocarcinoma (ADC).
The molecular landscape and the evolutionary trajectory of lung preneoplasia have not been well defined. Here, we perform multi-region exome sequencing of 116 resected lung nodules including AAH (n = 22), AIS (n = 27), MIA (n = 54) and synchronous ADC (n = 13). Comparing AAH to AIS, MIA and ADC, we observe progressive genomic evolution at the single nucleotide level and demarcated evolution at the chromosomal level supporting the early lung carcinogenesis model from AAH to AIS, MIA and ADC. Reference
Personal clinical history predicts antibiotic resistance of urinary tract infections
Antibiotic resistance is prevalent among the bacterial pathogens causing urinary tract infections. However, antimicrobial treatment is often prescribed ‘empirically’, in the absence of antibiotic susceptibility testing, risking mismatched and therefore ineffective treatment.
Here, linking a 10-year longitudinal data set of over 700,000 community-acquired urinary tract infections with over 5,000,000 individually resolved records of antibiotic purchases, we identify strong associations of antibiotic resistance with the demographics, records of past urine cultures and history of drug purchases of the patients. When combined together, these associations allow for machine-learning-based personalized drug-specific predictions of antibiotic resistance, thereby enabling drug-prescribing algorithms that match an antibiotic treatment recommendation to the expected resistance of each sample. Reference
RNA-Seq in 296 phased trios provides a high-resolution map of genomic imprinting
Identification of imprinted genes, demonstrating a consistent preference towards the paternal or maternal allelic expression, is important for the understanding of gene expression regulation during embryonic development and of the molecular basis of developmental disorders with a parent-of-origin effect.
Combining allelic analysis of RNA-Seq data with phased genotypes in family trios provides a powerful method to detect parent-of-origin biases in gene expression. Reference
CDetection: CRISPR-Cas12b-based DNA detection with sub-attomolar sensitivity and single-base specificity
CRISPR-based nucleic acid detection methods are reported to facilitate rapid and sensitive DNA detection. However, precise DNA detection at the single-base resolution and its wide applications including high-fidelity SNP genotyping remain to be explored. Here we develop a Cas12b-mediated DNA detection (CDetection) strategy, which shows higher sensitivity on examined targets compared with the previously reported Cas12a-based detection platform.
Moreover, we show that CDetection can distinguish differences at the single-base level upon combining the optimized tuned guide RNA (tgRNA). Therefore, our findings highlight the high sensitivity and accuracy of CDetection, which provides an efficient and highly practical platform for DNA detection. Reference
miRkwood: a tool for the reliable identification of microRNAs in plant genomes
MicroRNAs (miRNAs) play crucial roles in post-transcriptional regulation of eukaryotic gene expression and are involved in many aspects of plant development. Although several prediction tools are available for metazoan genomes, the number of tools dedicated to plants is relatively limited.
Here, we present miRkwood, a user-friendly tool for the identification of miRNAs in plant genomes using small RNA sequencing data. Deep-sequencing data of Argonaute associated small RNAs showed that miRkwood is able to identify a large diversity of plant miRNAs and limits false positive predictions. Reference
Death effector domain-containing protein induces vulnerability to cell cycle inhibition in triple-negative breast cancer
Lacking targetable molecular drivers, triple-negative breast cancer (TNBC) is the most clinically challenging subtype of breast cancer. In this study, we reveal that Death Effector Domain-containing DNA-binding protein (DEDD), which is overexpressed in > 60% of TNBCs, drives a mitogen-independent G1/S cell cycle transition through cytoplasm localization. The gain of cytosolic DEDD enhances cyclin D1 expression by interacting with heat shock 71 kDa protein 8 (HSC70). Concurrently, DEDD interacts with Rb family proteins and promotes their proteasome-mediated degradation.
DEDD overexpression renders TNBCs vulnerable to cell cycle inhibition. Patients with TNBC have been excluded from CDK 4/6 inhibitor clinical trials due to the perceived high frequency of Rb-loss in TNBCs. Interestingly, our study demonstrated that, irrespective of Rb status, TNBCs with DEDD overexpression exhibit a DEDD-dependent vulnerability to combinatorial treatment with CDK4/6 inhibitor and EGFR inhibitor in vitro and in vivo. Thus, our study provided a rationale for the clinical application of CDK4/6 inhibitor combinatorial regimens for patients with TNBC. Reference
Performance of neural network basecalling tools for Oxford Nanopore sequencing
Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT).
Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Reference
Microbiota therapy acts via a regulatory T cell MyD88/RORγt pathway to suppress food allergy
The role of dysbiosis in food allergy (FA) remains unclear. We found that dysbiotic fecal microbiota in FA infants evolved compositionally over time and failed to protect against FA in mice. Infants and mice with FA had decreased IgA and increased IgE binding to fecal bacteria, indicative of a broader breakdown of oral tolerance than hitherto appreciated.
Therapy with Clostridiales species impacted by dysbiosis, either as a consortium or as monotherapy with Subdoligranulum variabile, suppressed FA in mice as did a separate immunomodulatory Bacteroidales consortium. Bacteriotherapy induced expression by regulatory T (Treg) cells of the transcription factor ROR-γt in a MyD88-dependent manner, which was deficient in FA infants and mice and ineffectively induced by their microbiota. Reference
Transcriptomic correlates of electrophysiological and morphological diversity within and across excitatory and inhibitory neuron classes
In order to further our understanding of how gene expression contributes to key functional properties of neurons, we combined publicly accessible gene expression, electrophysiology, and morphology measurements to identify cross-cell type correlations between these data modalities. Building on our previous work using a similar approach, we distinguished between correlations which were “class-driven,” meaning those that could be explained by differences between excitatory and inhibitory cell classes, and those that reflected graded phenotypic differences within classes. Taking cell class identity into account increased the degree to which our results replicated in an independent dataset as well as their correspondence with known modes of ion channel function based on the literature.
We also found a smaller set of genes whose relationships to electrophysiological or morphological properties appear to be specific to either excitatory or inhibitory cell types. Next, using data from PatchSeq experiments, allowing simultaneous single-cell characterization of gene expression and electrophysiology, we found that some of the gene-property correlations observed across cell types were further predictive of within-cell type heterogeneity. Reference
Meta-omics analysis of elite athletes identifies a performance-enhancing microbe that functions via lactate metabolism
The human gut microbiome is linked to many states of human health and disease1. The metabolic repertoire of the gut microbiome is vast, but the health implications of these bacterial pathways are poorly understood. In this study, we identify a link between members of the genus Veillonella and exercise performance.
We observed an increase in Veillonella relative abundance in marathon runners postmarathon and isolated a strain of Veillonella atypica from stool samples. Inoculation of this strain into mice significantly increased exhaustive treadmill run time. Veillonella utilize lactate as their sole carbon source, which prompted us to perform a shotgun metagenomic analysis in a cohort of elite athletes, finding that every gene in a major pathway metabolizing lactate to propionate is at higher relative abundance postexercise. Using 13C3-labeled lactate in mice, we demonstrate that serum lactate crosses the epithelial barrier into the lumen of the gut. Reference
Performance of neural network basecalling tools for Oxford Nanopore sequencing
Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT).
Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Reference
Integrated analysis of long non-coding RNA and mRNA expression in different colored skin of koi carp
Long non-coding RNAs (lncRNAs) perform crucial roles in biological process involving complex mechanisms. However, information regarding their abundance, characteristics and potential functions linked to fish skin color is limited. Herein, Illumina sequencing and bioinformatics were conducted on black, white, and red skin of Koi carp (Cyprinus carpio L.).
A total of 590,415,050 clean reads, 446,614 putative transcripts, 4252 known and 72,907 novel lncRNAs were simultaneously obtained, including 92 significant differentially expressed lncRNAs and 722 mRNAs. Ccr_lnc5622441 and Ccr_lnc765201 were up-regulated in black and red skin, Ccr_lnc14074601 and Ccr_lnc2382951 were up-regulated in white skin, and premelanosome protein a (Pmela), Pmelb and tyrosinase (Tyr) were up-regulated in black skin. Reference
MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices
Sample multiplexing facilitates scRNA-seq by reducing costs and identifying artifacts such as cell doublets. However, universal and scalable sample barcoding strategies have not been described. We therefore developed MULTI-seq: multiplexing using lipid-tagged indices for single-cell and single-nucleus RNA sequencing.
MULTI-seq reagents can barcode any cell type or nucleus from any species with an accessible plasma membrane. The method involves minimal sample processing, thereby preserving cell viability and endogenous gene expression patterns. When cells are classified into sample groups using MULTI-seq barcode abundances, data quality is improved through doublet identification and recovery of cells with low RNA content that would otherwise be discarded by standard quality-control workflows. Reference
Exome sequencing in routine diagnostics: a generic test for 254 patients with primary immunodeficiencies
Diagnosis of primary immunodeficiencies (PIDs) is complex and cumbersome yet important for the clinical management of the disease. Exome sequencing may provide a genetic diagnosis in a significant number of patients in a single genetic test.
In May 2013, we implemented exome sequencing in routine diagnostics for patients suffering from PIDs. This study reports the clinical utility and diagnostic yield for a heterogeneous group of 254 consecutively referred PID patients from 249 families. For the majority of patients, the clinical diagnosis was based on clinical criteria including rare and/or unusual severe bacterial, viral, or fungal infections, sometimes accompanied by autoimmune manifestations. Functional immune defects were interpreted in the context of aberrant immune cell populations, aberrant antibody levels, or combinations of these factors. Reference
Identification of metabolic vulnerabilities of receptor tyrosine kinases-driven cancer
One of the biggest hurdles for the development of metabolism-targeted therapies is to identify the responsive tumor subsets. However, the metabolic vulnerabilities for most human cancers remain unclear.
Establishing the link between metabolic signatures and the oncogenic alterations of receptor tyrosine kinases (RTK), the most well-defined cancer genotypes, may precisely direct metabolic intervention to a broad patient population. By integrating metabolomics and transcriptomics, we herein show that oncogenic RTK activation causes distinct metabolic preference. Specifically, EGFR activation branches glycolysis to the serine synthesis for nucleotide biosynthesis and redox homeostasis, whereas FGFR activation recycles lactate to fuel oxidative phosphorylation for energy generation. Reference
A cellular census of human lungs identifies novel cell states in health and in asthma
Human lungs enable efficient gas exchange and form an interface with the environment, which depends on mucosal immunity for protection against infectious agents. Tightly controlled interactions between structural and immune cells are required to maintain lung homeostasis.
Here, we use single-cell transcriptomics to chart the cellular landscape of upper and lower airways and lung parenchyma in healthy lungs, and lower airways in asthmatic lungs. We report location-dependent airway epithelial cell states and a novel subset of tissue-resident memory T cells. In the lower airways of patients with asthma, mucous cell hyperplasia is shown to stem from a novel mucous ciliated cell state, as well as goblet cell hyperplasia. Reference
Cold stress induces enhanced chromatin accessibility and bivalent histone modifications H3K4me3 and H3K27me3 of active genes in potato
Cold stress can greatly affect plant growth and development. Plants have developed special systems to respond to and tolerate cold stress. While plant scientists have discovered numerous genes involved in responses to cold stress, few studies have been dedicated to investigation of genome-wide chromatin dynamics induced by cold or other abiotic stresses.
Genomic regions containing active cis-regulatory DNA elements can be identified as DNase I hypersensitive sites (DHSs). We develop high-resolution DHS maps in potato (Solanum tuberosum) using chromatin isolated from tubers stored under room (22 °C) and cold (4 °C) conditions. We find that cold stress induces a large number of DHSs enriched in genic regions which are frequently associated with differential gene expression in response to temperature variation. Reference
Predicting three-dimensional genome organization with chromatin states
We introduce a computational model to simulate chromatin structure and dynamics. Starting from one-dimensional genomics and epigenomics data that are available for hundreds of cell types, this model enables de novo prediction of chromatin structures at five-kilo-base resolution.
Simulated chromatin structures recapitulate known features of genome organization, including the formation of chromatin loops, topologically associating domains (TADs) and compartments, and are in quantitative agreement with chromosome conformation capture experiments and super-resolution microscopy measurements. Detailed characterization of the predicted structural ensemble reveals the dynamical flexibility of chromatin loops and the presence of cross-talk among neighboring TADs. Analysis of the model’s energy function uncovers distinct mechanisms for chromatin folding at various length scales and suggests a need to go beyond simple A/B compartment types to predict specific contacts between regulatory elements using polymer simulations. Reference
HumanMycobiomeScan: a new bioinformatics tool for the characterization of the fungal fraction in metagenomic samples
Modern metagenomic analysis of complex microbial communities produces large amounts of sequence data containing information on the microbiome in terms of bacterial, archaeal, viral and eukaryotic composition.
HumanMycobiomeScan is a bioinformatics tool for the taxonomic profiling of the mycobiome directly from raw data of next-generation sequencing. The tool uses hierarchical databases of fungi in order to unambiguously assign reads to fungal species more accurately and > 10,000 times faster than other comparable approaches. HumanMycobiomeScan was validated using in silico generated synthetic communities and then applied to metagenomic data, to characterize the intestinal fungal components in subjects adhering to different subsistence strategies. Reference
Invasive DNA elements modify the nuclear architecture of their insertion site by KNOT-linked silencing in Arabidopsis thaliana
The three-dimensional (3D) organization of chromosomes is linked to epigenetic regulation and transcriptional activity. However, only few functional features of 3D chromatin architecture have been described to date.
Here, we report the KNOT’s involvement in regulating invasive DNA elements. Transgenes can specifically interact with the KNOT, leading to perturbations of 3D nuclear organization, which correlates with the transgene’s expression: high KNOT interaction frequencies are associated with transgene silencing. KNOT-linked silencing (KLS) cannot readily be connected to canonical silencing mechanisms, such as RNA-directed DNA methylation and post-transcriptional gene silencing, as both cytosine methylation and small RNA abundance do not correlate with KLS. Reference
COMPASS for rapid combinatorial optimization of biochemical pathways based on artificial transcription factors
Balanced expression of multiple genes is central for establishing new biosynthetic pathways or multiprotein cellular complexes. Methods for efficient combinatorial assembly of regulatory sequences (promoters) and protein coding sequences are therefore highly wanted.
Here, we report a high-throughput cloning method, called COMPASS for COMbinatorial Pathway ASSembly, for the balanced expression of multiple genes in Saccharomyces cerevisiae. COMPASS employs orthogonal, plant-derived artificial transcription factors (ATFs) and homologous recombination-based cloning for the generation of thousands of individual DNA constructs in parallel. Reference
Codon usage optimization in pluripotent embryonic stem cells
The uneven use of synonymous codons in the transcriptome regulates the efficiency and fidelity of protein translation rates. Yet, the importance of this codon bias in regulating cell state-specific expression programmes is currently debated.
Here, we ask whether different codon usage controls gene expression programmes in self-renewing and differentiating embryonic stem cells. Using ribosome and transcriptome profiling, we identify distinct codon signatures during human embryonic stem cell differentiation. We find that cell state-specific codon bias is determined by the guanine-cytosine (GC) content of differentially expressed genes. Reference
A high-density BAC physical map covering the entire MHC region of addax antelope genome
The mammalian major histocompatibility complex (MHC) harbours clusters of genes associated with the immunological defence of animals against infectious pathogens. At present, no complete MHC physical map is available for any of the wild ruminant species in the world.
The high-density physical map is composed of two contigs of 47 overlapping bacterial artificial chromosome (BAC) clones, with an average of 115 Kb for each BAC, covering the entire addax MHC genome. The first contig has 40 overlapping BAC clones covering an approximately 2.9 Mb region of MHC class I, class III, and class IIa, and the second contig has 7 BAC clones covering an approximately 500 Kb genomic region that harbours MHC class IIb. Reference
Comprehensive study of nuclear receptor DNA binding provides a revised framework for understanding receptor specificity
The type II nuclear receptors (NRs) function as heterodimeric transcription factors with the retinoid X receptor (RXR) to regulate diverse biological processes in response to endogenous ligands and therapeutic drugs. DNA-binding specificity has been proposed as a primary mechanism for NR gene regulatory specificity.
Here we use protein-binding microarrays (PBMs) to comprehensively analyze the DNA binding of 12 NR:RXRα dimers. We find more promiscuous NR-DNA binding than has been reported, challenging the view that NR binding specificity is defined by half-site spacing. We show that NRs bind DNA using two distinct modes, explaining widespread NR binding to half-sites in vivo. Finally, we show that the current models of NR specificity better reflect binding-site activity rather than binding-site affinity. Reference
Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer
In most cases of sporadic colorectal cancers, tumorigenesis is a multistep process, involving genomic alterations in parallel with morphologic changes. In addition, accumulating evidence suggests that the human gut microbiome is linked to the development of colorectal cancer.
Here we performed fecal metagenomic and metabolomic studies on samples from a large cohort of 616 participants who underwent colonoscopy to assess taxonomic and functional characteristics of gut microbiota and metabolites. Microbiome and metabolome shifts were apparent in cases of multiple polypoid adenomas and intramucosal carcinomas, in addition to more advanced lesions. We found two distinct patterns of microbiome elevations. Reference
The Genomic and Immune Landscapes of Lethal Metastatic Breast Cancer
The detailed molecular characterization of lethal cancers is a prerequisite to understanding resistance to therapy and escape from cancer immunoediting. We performed extensive multi-platform profiling of multi-regional metastases in autopsies from 10 patients with therapy-resistant breast cancer.
The integrated genomic and immune landscapes show that metastases propagate and evolve as communities of clones, reveal their predicted neo-antigen landscapes, and show that they can accumulate HLA loss of heterozygosity (LOH). The data further identify variable tumor microenvironments and reveal, through analyses of T cell receptor repertoires, that adaptive immune responses appear to co-evolve with the metastatic genomes. These findings reveal in fine detail the landscapes of lethal metastatic breast cancer. Reference
Diabetes causes marked inhibition of mitochondrial metabolism in pancreatic β-cells
Diabetes is a global health problem caused primarily by the inability of pancreatic β-cells to secrete adequate levels of insulin. The molecular mechanisms underlying the progressive failure of β-cells to respond to glucose in type-2 diabetes remain unresolved.
Using a combination of transcriptomics and proteomics, we find significant dysregulation of major metabolic pathways in islets of diabetic βV59M mice, a non-obese, eulipidaemic diabetes model. Multiple genes/proteins involved in glycolysis/gluconeogenesis are upregulated, whereas those involved in oxidative phosphorylation are downregulated. In isolated islets, glucose-induced increases in NADH and ATP are impaired and both oxidative and glycolytic glucose metabolism are reduced. Reference
Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data
DNA base modifications, such as C5-methylcytosine (5mC) and N6-methyldeoxyadenosine (6mA), are important types of epigenetic regulations. Short-read bisulfite sequencing and long-read PacBio sequencing have inherent limitations to detect DNA modifications.
Here, using raw electric signals of Oxford Nanopore long-read sequencing data, we design DeepMod, a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications. We sequence a human genome HX1 and a Chlamydomonas reinhardtii genome using Nanopore sequencing, and then evaluate DeepMod on three types of genomes (Escherichia coli, Chlamydomonas reinhardtii and human genomes). Reference
A practical guide to methods controlling false discoveries in computational biology
In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses.
However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Reference
Comprehensively benchmarking applications for detecting copy number variation
Recently, copy number variation (CNV) has gained considerable interest as a type of genomic variation that plays an important role in complex phenotypes and disease susceptibility. Since a number of CNV detection methods have recently been developed, it is necessary to help investigators choose suitable methods for CNV detection depending on their objectives.
For this reason, this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands. Taking the DGV gold standard variants as a standard dataset, we evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X. Among the ten methods benchmarked, LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Reference
Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments
Single cell RNA-sequencing (scRNA-seq) technology has undergone rapid development in recent years, leading to an explosion in the number of tailored data analysis methods. However, the current lack of gold-standard benchmark datasets makes it difficult for researchers to systematically compare the performance of the many methods available.
Here, we generated a realistic benchmark experiment that included single cells and admixtures of cells or RNA to create ‘pseudo cells’ from up to five distinct cancer cell lines. In total, 14 datasets were generated using both droplet and plate-based scRNA-seq protocols. We compared 3,913 combinations of data analysis methods for tasks ranging from normalization and imputation to clustering, trajectory analysis and data integration. Reference
Associating somatic mutations to clinical outcomes: a pan-cancer study of survival time
We developed subclone multiplicity allocation and somatic heterogeneity (SMASH), a new statistical method for intra-tumor heterogeneity (ITH) inference. SMASH is tailored to the purpose of large-scale association studies with one tumor sample per patient.
In a pan-cancer study of 14 cancer types, we studied the associations between survival time and ITH quantified by SMASH, together with other features of somatic mutations. Our results show that ITH is associated with survival time in several cancer types and its effect can be modified by other covariates, such as mutation burden. Reference
A few Ascomycota taxa dominate soil fungal communities worldwide
Despite having key functions in terrestrial ecosystems, information on the dominant soil fungi and their ecological preferences at the global scale is lacking. To fill this knowledge gap, we surveyed 235 soils from across the globe. Our findings indicate that 83 phylotypes (<0.1% of the retrieved fungi), mostly belonging to wind dispersed, generalist Ascomycota, dominate soils globally.
We identify patterns and ecological drivers of dominant soil fungal taxa occurrence, and present a map of their distribution in soils worldwide. Whole-genome comparisons with less dominant, generalist fungi point at a significantly higher number of genes related to stress-tolerance and resource uptake in the dominant fungi, suggesting that they might be better in colonising a wide range of environments. Reference
WhoGEM: an admixture-based prediction machine accurately predicts quantitative functional traits in plants
The explosive growth of genomic data provides an opportunity to make increased use of sequence variations for phenotype prediction. We have developed a prediction machine for quantitative phenotypes (WhoGEM) that overcomes some of the bottlenecks limiting the current methods.
We demonstrated its performance by predicting quantitative disease resistance and quantitative functional traits in the wild model plant species, Medicago truncatula, using geographical locations as covariates for admixture analysis. The method’s prediction reliability equals or outperforms all existing algorithms for quantitative phenotype prediction. WhoGEM analysis produces evidence that variation in genome admixture proportions explains most of the phenotypic variation for quantitative phenotypes. Reference
OSCA: a tool for omic-data-based complex trait analysis
The rapid increase of omic data has greatly facilitated the investigation of associations between omic profiles such as DNA methylation (DNAm) and complex traits in large cohorts.
Here, we propose a mixed-linear-model-based method called MOMENT that tests for association between a DNAm probe and trait with all other distal probes fitted in multiple random-effect components to account for unobserved confounders. We demonstrate by simulations that MOMENT shows a lower false positive rate and more robustness than existing methods. MOMENT has been implemented in a versatile software package called OSCA together with a number of other implementations for omic-data-based analyses. Reference
qDSB-Seq is a general method for genome-wide quantification of DNA double-strand breaks using sequencing
DNA double-strand breaks (DSBs) are among the most lethal types of DNA damage and frequently cause genome instability. Sequencing-based methods for mapping DSBs have been developed but they allow measurement only of relative frequencies of DSBs between loci, which limits our understanding of the physiological relevance of detected DSBs.
Here we propose quantitative DSB sequencing (qDSB-Seq), a method providing both DSB frequencies per cell and their precise genomic coordinates. We induce spike-in DSBs by a site-specific endonuclease and use them to quantify detected DSBs (labeled, e.g., using i-BLESS). Utilizing qDSB-Seq, we determine numbers of DSBs induced by a radiomimetic drug and replication stress, and reveal two orders of magnitude differences in DSB frequencies. Reference
Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data
We introduce quanTIseq, a method to quantify the fractions of ten immune cell types from bulk RNA-sequencing data. quanTIseq was extensively validated in blood and tumor samples using simulated, flow cytometry, and immunohistochemistry data.
quanTIseq analysis of 8000 tumor samples revealed that cytotoxic T cell infiltration is more strongly associated with the activation of the CXCR3/CXCL9 axis than with mutational load and that deconvolution-based cell scores have prognostic value in several solid cancers. Finally, we used quanTIseq to show how kinase inhibitors modulate the immune contexture and to reveal immune-cell types that underlie differential patients’ responses to checkpoint blockers. Reference
ChiCMaxima: a robust and simple pipeline for detection and visualization of chromatin looping in Capture Hi-C
Capture Hi-C (CHi-C) is a new technique for assessing genome organization based on chromosome conformation capture coupled to oligonucleotide capture of regions of interest, such as gene promoters.
Chromatin loop detection is challenging because existing Hi-C/4C-like tools, which make different assumptions about the technical biases presented, are often unsuitable. We describe a new approach, ChiCMaxima, which uses local maxima combined with limited filtering to detect DNA looping interactions, integrating information from biological replicates. ChiCMaxima shows more stringency and robustness compared to previously developed tools. Reference
Genome-scale screens identify JNK–JUN signaling as a barrier for pluripotency exit and endoderm differentiation
Human embryonic stem cells (ESCs) and human induced pluripotent stem cells hold great promise for cell-based therapies and drug discovery. However, homogeneous differentiation remains a major challenge, highlighting the need for understanding developmental mechanisms.
We performed genome-scale CRISPR screens to uncover regulators of definitive endoderm (DE) differentiation, which unexpectedly uncovered five Jun N-terminal kinase (JNK)–JUN family genes as key barriers of DE differentiation. The JNK–JUN pathway does not act through directly inhibiting the DE enhancers. Instead, JUN co-occupies ESC enhancers with OCT4, NANOG, SMAD2 and SMAD3, and specifically inhibits the exit from the pluripotent state by impeding the decommissioning of ESC enhancers and inhibiting the reconfiguration of SMAD2 and SMAD3 chromatin binding from ESC to DE enhancers. Reference
Transcriptional cofactors display specificity for distinct types of core promoters
Transcriptional cofactors (COFs) communicate regulatory cues from enhancers to promoters and are central effectors of transcription activation and gene expression.
Although some COFs have been shown to prefer certain promoter types over others the extent to which different COFs display intrinsic specificities for distinct promoters is unclear. Here we use a high-throughput promoter-activity assay in Drosophila melanogaster S2 cells to screen 23 COFs for their ability to activate 72,000 candidate core promoters (CPs). We observe differential activation of CPs, indicating distinct regulatory preferences or ‘compatibilities’ between COFs and specific types of CPs. Reference
A systems biology approach uncovers cell-specific gene regulatory effects of genetic associations in multiple sclerosis
Genome-wide association studies (GWAS) have identified more than 50,000 unique associations with common human traits. While this represents a substantial step forward, establishing the biology underlying these associations has proven extremely difficult.
Even determining which cell types and which particular gene(s) are relevant continues to be a challenge. Here, we conduct a cell-specific pathway analysis of the latest GWAS in multiple sclerosis (MS), which had analyzed a total of 47,351 cases and 68,284 healthy controls and found more than 200 non-MHC genome-wide associations. Our analysis identifies pan immune cell as well as cell-specific susceptibility genes in T cells, B cells and monocytes. Reference
Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures
Changes in bulk transcriptional profiles of heterogeneous samples often reflect changes in proportions of individual cell types. Several robust techniques have been developed to dissect the composition of such mixed samples given transcriptional signatures of the pure components or their proportions.
These approaches are insufficient, however, in situations when no information about individual mixture components is available. This problem is known as the complete deconvolution problem, where the composition is revealed without any a priori knowledge about cell types and their proportions. Here, we identify a previously unrecognized property of tissue-specific genes – their mutual linearity – and use it to reveal the structure of the topological space of mixed transcriptional profiles and provide a noise-robust approach to the complete deconvolution problem. Reference
Genomic signatures accompanying the dietary shift to phytophagy in polyphagan beetles
The diversity and evolutionary success of beetles (Coleoptera) are proposed to be related to the diversity of plants on which they feed. Indeed, the largest beetle suborder, Polyphaga, mostly includes plant eaters among its approximately 315,000 species.
We explore the genomic consequences of beetle-plant trophic interactions by performing comparative gene family analyses across 18 species representative of the two most species-rich beetle suborders. We contrast the gene contents of species from the mostly plant-eating suborder Polyphaga with those of the mainly predatory Adephaga. We find gene repertoire evolution to be more dynamic, with significantly more adaptive lineage-specific expansions, in the more speciose Polyphaga. Reference
Host diet and evolutionary history explain different aspects of gut microbiome diversity among vertebrate clades
Multiple factors modulate microbial community assembly in the vertebrate gut, though studies disagree as to their relative contribution. One cause may be a reliance on captive animals, which can have very different gut microbiomes compared to their wild counterparts.
To resolve this disagreement, we analyze a new, large, and highly diverse animal distal gut 16 S rRNA microbiome dataset, which comprises 80% wild animals and includes members of Mammalia, Aves, Reptilia, Amphibia, and Actinopterygii. We decouple the effects of host evolutionary history and diet on gut microbiome diversity and show that each factor modulates different aspects of diversity. Moreover, we resolve particular microbial taxa associated with host phylogeny or diet and show that Mammalia have a stronger signal of cophylogeny. Reference
In-depth human plasma proteome analysis captures tissue proteins and transfer of protein variants across the placenta
Here, we present a method for in-depth human plasma proteome analysis based on high-resolution isoelectric focusing HiRIEF LC-MS/MS, demonstrating high proteome coverage, reproducibility and the potential for liquid biopsy protein profiling.
By integrating genomic sequence information to the MS-based plasma proteome analysis, we enable detection of single amino acid variants and for the first time demonstrate transfer of multiple protein variants between mother and fetus across the placenta. We further show that our method has the ability to detect both low abundance tissue-annotated proteins and phosphorylated proteins in plasma, as well as quantitate differences in plasma proteomes between the mother and the newborn as well as changes related to pregnancy. Reference
VULCAN integrates ChIP-seq with patient-derived co-expression networks to identify GRHL2 as a key co-regulator of ERa at enhancers in breast cancer
VirtUaL ChIP-seq Analysis through Networks (VULCAN) infers regulatory interactions of transcription factors by overlaying networks generated from publicly available tumor expression data onto ChIP-seq data.
We apply our method to dissect the regulation of estrogen receptor-alpha activation in breast cancer to identify potential co-regulators of the estrogen receptor’s transcriptional response. Reference
Association analyses identify 31 new risk loci for colorectal cancer susceptibility
Colorectal cancer (CRC) is a leading cause of cancer-related death worldwide, and has a strong heritable basis. We report a genome-wide association analysis of 34,627 CRC cases and 71,379 controls of European ancestry that identifies SNPs at 31 new CRC risk loci.
We also identify eight independent risk SNPs at the new and previously reported European CRC loci, and a further nine CRC SNPs at loci previously only identified in Asian populations. We use in situ promoter capture Hi-C (CHi-C), gene expression, and in silico annotation methods to identify likely target genes of CRC SNPs. Whilst these new SNP associations implicate target genes that are enriched for known CRC pathways such as Wnt and BMP, they also highlight novel pathways with no prior links to colorectal tumourigenesis. Reference
Transcriptomics-Based Screening Identifies Pharmacological Inhibition of Hsp90 as a Means to Defer Aging
Aging strongly influences human morbidity and mortality. Thus, aging-preventive compounds could greatly improve our health and lifespan. Here we screened for such compounds, known as geroprotectors, employing the power of transcriptomics to predict biological age.
Using age-stratified human tissue transcriptomes and machine learning, we generated age classifiers and applied these to transcriptomic changes induced by 1,309 different compounds in human cells, ranking these compounds by their ability to induce a “youthful” transcriptional state. Testing the top candidates in C. elegans, we identified two Hsp90 inhibitors, monorden and tanespimycin, which extended the animals’ lifespan and improved their health. Hsp90 inhibition induces expression of heat shock proteins known to improve protein homeostasis. Reference
Stem cell-associated heterogeneity in Glioblastoma results from intrinsic tumor plasticity shaped by the microenvironment
The identity and unique capacity of cancer stem cells (CSC) to drive tumor growth and resistance have been challenged in brain tumors. Here we report that cells expressing CSC-associated cell membrane markers in Glioblastoma (GBM) do not represent a clonal entity defined by distinct functional properties and transcriptomic profiles, but rather a plastic state that most cancer cells can adopt.
We show that phenotypic heterogeneity arises from non-hierarchical, reversible state transitions, instructed by the microenvironment and is predictable by mathematical modeling. Although functional stem cell properties were similar in vitro, accelerated reconstitution of heterogeneity provides a growth advantage in vivo, suggesting that tumorigenic potential is linked to intrinsic plasticity rather than CSC multipotency. Reference
Differential expression analysis of Trichoderma virens RNA reveals a dynamic transcriptome during colonization of Zea mays roots
Trichoderma spp. are majorly composed of plant-beneficial symbionts widely used in agriculture as bio-control agents. Studying the mechanisms behind Trichoderma-derived plant benefits has yielded tangible bio-industrial products.
To better take advantage of this fungal-plant symbiosis it is necessary to obtain detailed knowledge of which genes Trichoderma utilizes during interaction with its plant host. In this study, we explored the transcriptional activity undergone by T. virens during two phases of symbiosis with maize; recognition of roots and after ingress into the root cortex. Reference
Interrogation of human hematopoiesis at single-cell and single-variant resolution
Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.
Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Finally, we leverage fine-mapped variants in conjunction with continuous epigenomic annotations to identify trait–cell type enrichments within closely related populations and in single cells. Reference
Guidelines for using sigQC for systematic evaluation of gene signatures
With the increased use of next-generation sequencing generating large amounts of genomic data, gene expression signatures are becoming critically important tools for the interpretation of these data, and are poised to have a substantial effect on diagnosis, management, and prognosis for a number of diseases.
It is becoming crucial to establish whether the expression patterns and statistical properties of sets of genes, or gene signatures, are conserved across independent datasets. Conversely, it is necessary to compare established signatures on the same dataset to better understand how they capture different clinical or biological characteristics. Here we describe how to use sigQC, a tool that enables a streamlined, systematic approach for the evaluation of previously obtained gene signatures across multiple gene expression datasets. We implemented sigQC in an R package, making it accessible to users who have knowledge of file input/output and matrix manipulation in R and a moderate grasp of core statistical principles. Reference
Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens
Functional genomics approaches can overcome limitations—such as the lack of identification of robust targets and poor clinical efficacy—that hamper cancer drug development.
Here we performed genome-scale CRISPR–Cas9 screens in 324 human cancer cell lines from 30 cancer types and developed a data-driven framework to prioritize candidates for cancer therapeutics. We integrated cell fitness effects with genomic biomarkers and target tractability for drug development to systematically prioritize new targets in defined tissues and genotypes. We verified one of our most promising dependencies, the Werner syndrome ATP-dependent helicase, as a synthetic lethal target in tumours from multiple cancer types with microsatellite instability. Reference
Comparative analysis of sequencing technologies for single-cell transcriptomics
Single-cell RNA-seq technologies require library preparation prior to sequencing. Here, we present the first report to compare the cheaper BGISEQ-500 platform to the Illumina HiSeq platform for scRNA-seq.
We generate a resource of 468 single cells and 1297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on two cell lines with RNA spike-ins. We sequence these libraries on both platforms using single- and paired-end reads. The platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardized scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms. Reference
A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals.
Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Reference
Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects
Only a small fraction of early drug programs progress to the market, due to safety and efficacy failures, despite extensive efforts to predict safety. Characterizing the effect of natural variation in the genes encoding drug targets should present a powerful approach to predict side effects arising from drugging particular proteins.
In this retrospective analysis, we report a correlation between the organ systems affected by genetic variation in drug targets and the organ systems in which side effects are observed. Across 1819 drugs and 21 phenotype categories analyzed, drug side effects are more likely to occur in organ systems where there is genetic evidence of a link between the drug target and a phenotype involving that organ system, compared to when there is no such genetic evidence (30.0 vs 19.2%; OR = 1.80). Reference
Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing
Accurate variant calling and genotyping represent major limiting factors for downstream applications of single-cell genomics. Here, we report Conbase for the identification of somatic mutations in single-cell DNA sequencing data.
Conbase leverages phased read data from multiple samples in a dataset to achieve increased confidence in somatic variant calls and genotype predictions. Comparing the performance of Conbase to three other methods, we find that Conbase performs best in terms of false discovery rate and specificity and provides superior robustness on simulated data, in vitro expanded fibroblasts and clonal lymphocyte populations isolated directly from a healthy human donor. Reference
Meta-analysis of genome-wide association studies provides insights into genetic control of tomato flavor
Tomato flavor has changed over the course of long-term domestication and intensive breeding. To understand the genetic control of flavor, we report the meta-analysis of genome-wide association studies (GWAS) using 775 tomato accessions and 2,316,117 SNPs from three GWAS panels.
We discover 305 significant associations for the contents of sugars, acids, amino acids, and flavor-related volatiles. We demonstrate that fruit citrate and malate contents have been impacted by selection during domestication and improvement, while sugar content has undergone less stringent selection. We suggest that it may be possible to significantly increase volatiles that positively contribute to consumer preferences while reducing unpleasant volatiles, by selection of the relevant allele combinations. Reference
A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms
We report a computational approach (implemented in MS-DIAL 3.0; http://prime.psc.riken.jp/) for metabolite structure characterization using fully 13C-labeled and non-labeled plants and LC–MS/MS. Our approach facilitates carbon number determination and metabolite classification for unknown molecules.
Applying our method to 31 tissues from 12 plant species, we assigned 1,092 structures and 344 formulae to 3,604 carbon-determined metabolite ions, 69 of which were found to represent structures currently not listed in metabolome databases. Reference
Crizotinib-induced immunogenic cell death in non-small cell lung cancer
Immunogenic cell death (ICD) converts dying cancer cells into a therapeutic vaccine and stimulates antitumor immune responses. Here we unravel the results of an unbiased screen identifying high-dose (10 µM) crizotinib as an ICD-inducing tyrosine kinase inhibitor that has exceptional antineoplastic activity when combined with non-ICD inducing chemotherapeutics like cisplatin.
The combination of cisplatin and high-dose crizotinib induces ICD in non-small cell lung carcinoma (NSCLC) cells and effectively controls the growth of distinct (transplantable, carcinogen- or oncogene induced) orthotopic NSCLC models. These anticancer effects are linked to increased T lymphocyte infiltration and are abolished by T cell depletion or interferon-γ neutralization. Crizotinib plus cisplatin leads to an increase in the expression of PD-1 and PD-L1 in tumors, coupled to a strong sensitization of NSCLC to immunotherapy with PD-1 antibodies. Reference
Learning protein constitutive motifs from sequence data
Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information.
We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ‘turning up’ or ‘turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. Reference
The anti-cancer drugs curaxins target spatial genome organization
Recently we characterized a class of anti-cancer agents (curaxins) that disturbs DNA/histone interactions within nucleosomes.
Here, using a combination of genomic and in vitro approaches, we demonstrate that curaxins strongly affect spatial genome organization and compromise enhancer-promoter communication, which is necessary for the expression of several oncogenes, including MYC. We further show that curaxins selectively inhibit enhancer-regulated transcription of chromatinized templates in cell-free conditions. Genomic studies also suggest that curaxins induce partial depletion of CTCF from its binding sites, which contributes to the observed changes in genome topology. Thus, curaxins can be classified as epigenetic drugs that target the 3D genome organization. Reference
Developing a network view of type 2 diabetes risk pathways through integration of genetic, genomic and functional data
Genome-wide association studies (GWAS) have identified several hundred susceptibility loci for type 2 diabetes (T2D). One critical, but unresolved, issue concerns the extent to which the mechanisms through which these diverse signals influencing T2D predisposition converge on a limited set of biological processes.
However, the causal variants identified by GWAS mostly fall into a non-coding sequence, complicating the task of defining the effector transcripts through which they operate. Reference
Systematic benchmarking of omics computational tools
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking.
Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Reference
Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list
Differential gene expression analysis may discover a set of genes too large to easily investigate, so a means of ranking genes by biological interest level is desired. p values are frequently abused for this purpose.
As an alternative, we propose a method of ranking by confidence bounds on the log fold change, based on the previously developed TREAT test. These confidence bounds provide guaranteed false discovery rate and false coverage-statement rate control. When applied to a breast cancer dataset, the top-ranked genes by Topconfects emphasize markedly different biological processes compared to the top-ranked genes by p value. Reference
EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data
Droplet-based single-cell RNA sequencing protocols have dramatically increased the throughput of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets.
Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that EmptyDrops has greater power than existing approaches while controlling the false discovery rate among detected cells. Our method also retains distinct cell types that would have been discarded by existing methods in several real data sets. Reference
Aberrant FGFR signaling mediates resistance to CDK4/6 inhibitors in ER+ breast cancer
Using an ORF kinome screen in MCF-7 cells treated with the CDK4/6 inhibitor ribociclib plus fulvestrant, we identified FGFR1 as a mechanism of drug resistance. FGFR1-amplified/ER+ breast cancer cells and MCF-7 cells transduced with FGFR1 were resistant to fulvestrant ± ribociclib or palbociclib.
This resistance was abrogated by treatment with the FGFR tyrosine kinase inhibitor (TKI) lucitanib. Addition of the FGFR TKI erdafitinib to palbociclib/fulvestrant induced complete responses of FGFR1-amplified/ER+ patient-derived-xenografts. Next generation sequencing of circulating tumor DNA (ctDNA) in 34 patients after progression on CDK4/6 inhibitors identified FGFR1/2 amplification or activating mutations in 14/34 (41%) post-progression specimens. Reference
Dissecting heterogeneity in malignant pleural mesothelioma through histo-molecular gradients for clinical applications
Malignant pleural mesothelioma (MPM) is recognized as heterogeneous based both on histology and molecular profiling. Histology addresses inter-tumor and intra-tumor heterogeneity in MPM and describes three major types: epithelioid, sarcomatoid and biphasic, a combination of the former two types.
Molecular profiling studies have not addressed intra-tumor heterogeneity in MPM to date. Here, we use a deconvolution approach and show that molecular gradients shed new light on the intra-tumor heterogeneity of MPM, leading to a reconsideration of MPM molecular classifications. We show that each tumor can be decomposed as a combination of epithelioid-like and sarcomatoid-like components whose proportions are highly associated with the prognosis. Reference
Identification of pathways associated with chemosensitivity through network embedding
Basal gene expression levels have been shown to be predictive of cellular response to cytotoxic treatments. However, such analyses do not fully reveal complex genotype- phenotype relationships, which are partly encoded in highly interconnected molecular networks. Biological pathways provide a complementary way of understanding drug response variation among individuals.
In this study, we integrate chemosensitivity data from a large-scale pharmacogenomics study with basal gene expression data from the CCLE project and prior knowledge of molecular networks to identify specific pathways mediating chemical response. We first develop a computational method called PACER, which ranks pathways for enrichment in a given set of genes using a novel network embedding method. It examines a molecular network that encodes known gene-gene as well as gene-pathway relationships, and determines a vector representation of each gene and pathway in the same low-dimensional vector space. The relevance of a pathway to the given gene set is then captured by the similarity between the pathway vector and gene vectors. Reference
Neoantigen-directed immune escape in lung cancer evolution
The interplay between an evolving cancer and a dynamic immune microenvironment remains unclear. Here we analyse 258 regions from 88 early-stage, untreated non-small-cell lung cancers using RNA sequencing and histopathology-assessed tumour-infiltrating lymphocyte estimates.
Immune infiltration varied both between and within tumours, with different mechanisms of neoantigen presentation dysfunction enriched in distinct immune microenvironments. Sparsely infiltrated tumours exhibited a waning of neoantigen editing during tumour evolution, indicative of historical immune editing, or copy-number loss of previously clonal neoantigens. Immune-infiltrated tumour regions exhibited ongoing immunoediting, with either loss of heterozygosity in human leukocyte antigens or depletion of expressed neoantigens. We identified promoter hypermethylation of genes that contain neoantigenic mutations as an epigenetic mechanism of immunoediting. Reference
Melissa: Bayesian clustering and imputation of single-cell methylomes
Measurements of single-cell methylation are revolutionizing our understanding of epigenetic control of gene expression, yet the intrinsic data sparsity limits the scope for quantitative analysis of such data.
Here, we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells. The clustering also acts as an effective regularization for data imputation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings and state-of-the-art imputation performance. Reference
Measuring the reproducibility and quality of Hi-C data
Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease.
However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Reference
MGSEA – a multivariate Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a powerful tool to identify enriched functional categories of informative biomarkers. Canonical GSEA takes one-dimensional feature scores derived from the data of one platform as inputs.
Numerous extensions of GSEA handling multimodal OMIC data are proposed, yet none of them explicitly captures combinatorial relations of feature scores from multiple platforms. Reference
A reference-grade wild soybean genome
Efficient crop improvement depends on the application of accurate genetic information contained in diverse germplasm resources.
Here we report a reference-grade genome of wild soybean accession W05, with a final assembled genome size of 1013.2 Mb and a contig N50 of 3.3 Mb. The analytical power of the W05 genome is demonstrated by several examples. First, we identify an inversion at the locus determining seed coat color during domestication. Second, a translocation event between chromosomes 11 and 13 of some genotypes is shown to interfere with the assignment of QTLs. Third, we find a region containing copy number variations of the Kunitz trypsin inhibitor (KTI) genes. Reference
RnBeads 2.0: comprehensive analysis of DNA methylation data
DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples.
Here, we describe a new version of our RnBeads software – an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer. Reference
Network-based prediction of drug combinations
Drug combinations, offering increased therapeutic efficacy and reduced toxicity, play an important role in treating multiple complex diseases. Yet, our ability to identify and validate effective combinations is limited by a combinatorial explosion, driven by both the large number of drug pairs as well as dosage combinations.
Here we propose a network-based methodology to identify clinically efficacious drug combinations for specific diseases. By quantifying the network-based relationship between drug targets and disease proteins in the human protein–protein interactome, we show the existence of six distinct classes of drug–drug–disease combinations. Reference
Best practices for benchmarking germline small-variant calls in human genomes
Standardized benchmarking approaches are required to assess the accuracy of variants called from sequence data. Although variant-calling tools and the metrics used to assess their performance continue to improve, important challenges remain.
Here, as part of the Global Alliance for Genomics and Health (GA4GH), we present a benchmarking framework for variant calling. We provide guidance on how to match variant calls with different representations, define standard performance metrics, and stratify performance by variant type and genome context. We describe limitations of high-confidence calls and regions that can be used as truth sets (for example, single-nucleotide variant concordance of two methods is 99.7% inside versus 76.5% outside high-confidence regions). Reference
Interrogation of human hematopoiesis at single-cell and single-variant resolution
Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.
Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Reference
Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement
Brassica napus (2n = 4x = 38, AACC) is an important allopolyploid crop derived from interspecific crosses between Brassica rapa (2n = 2x = 20, AA) and Brassica oleracea (2n = 2x = 18, CC). However, no truly wild B. napus populations are known; its origin and improvement processes remain unclear.
Here, we resequence 588 B. napus accessions. We uncover that the A subgenome may evolve from the ancestor of European turnip and the C subgenome may evolve from the common ancestor of kohlrabi, cauliflower, broccoli, and Chinese kale. Additionally, winter oilseed may be the original form of B. napus. Subgenome-specific selection of defense-response genes has contributed to environmental adaptation after formation of the species, whereas asymmetrical subgenomic selection has led to ecotype change. Reference
Topological scoring of protein interaction networks
It remains a significant challenge to define individual protein associations within networks where an individual protein can directly interact with other proteins and/or be part of large complexes, which contain functional modules.
Here we demonstrate the topological scoring (TopS) algorithm for the analysis of quantitative proteomic datasets from affinity purifications. Data is analyzed in a parallel fashion where a prey protein is scored in an individual affinity purification by aggregating information from the entire dataset. Topological scores span a broad range of values indicating the enrichment of an individual protein in every bait protein purification. TopS is applied to interaction networks derived from human DNA repair proteins and yeast chromatin remodeling complexes. Reference
I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms
We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods.
By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data. Reference
Osteogenesis depends on commissioning of a network of stem cell transcription factors that act as repressors of adipogenesis
Mesenchymal (stromal) stem cells (MSCs) constitute populations of mesodermal multipotent cells involved in tissue regeneration and homeostasis in many different organs.
Here we performed comprehensive characterization of the transcriptional and epigenomic changes associated with osteoblast and adipocyte differentiation of human MSCs. We demonstrate that adipogenesis is driven by considerable remodeling of the chromatin landscape and de novo activation of enhancers, whereas osteogenesis involves activation of preestablished enhancers. Using machine learning algorithms for in silico modeling of transcriptional regulation, we identify a large and diverse transcriptional network of pro-osteogenic and antiadipogenic transcription factors. Reference
GWAS identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates
Sleep is an essential state of decreased activity and alertness but molecular factors regulating sleep duration remain unknown. Through genome-wide association analysis in 446,118 adults of European ancestry from the UK Biobank, we identify 78 loci for self-reported habitual sleep duration (p < 5 × 10−8; 43 loci at p < 6 × 10−9).
Replication is observed for PAX8, VRK2, and FBXL12/UBL5/PIN1 loci in the CHARGE study (n = 47,180; p < 6.3 × 10−4), and 55 signals show sign-concordant effects. The 78 loci further associate with accelerometer-derived sleep duration, daytime inactivity, sleep efficiency and number of sleep bouts in secondary analysis (n = 85,499). Loci are enriched for pathways including striatum and subpallium development, mechanosensory response, dopamine binding, synaptic neurotransmission and plasticity, among others. Reference
Genome-scale network model of metabolism and histone acetylation reveals metabolic dependencies of histone deacetylase inhibitors
Histone acetylation plays a central role in gene regulation and is sensitive to the levels of metabolic intermediates. However, predicting the impact of metabolic alterations on acetylation in pathological conditions is a significant challenge.
Here, we present a genome-scale network model that predicts the impact of nutritional environment and genetic alterations on histone acetylation. It identifies cell types that are sensitive to histone deacetylase inhibitors based on their metabolic state, and we validate metabolites that alter drug sensitivity. Our model provides a mechanistic framework for predicting how metabolic perturbations contribute to epigenetic changes and sensitivity to deacetylase inhibitors. Reference
A genome-wide association analysis identifies 16 novel susceptibility loci for carpal tunnel syndrome
Carpal tunnel syndrome (CTS) is a common and disabling condition of the hand caused by entrapment of the median nerve at the level of the wrist. It is the commonest entrapment neuropathy, with estimates of prevalence ranging between 5–10%.
Here, we undertake a genome-wide association study (GWAS) of an entrapment neuropathy, using 12,312 CTS cases and 389,344 controls identified in UK Biobank. We discover 16 susceptibility loci for CTS with p < 5 × 10−8. We identify likely causal genes in the pathogenesis of CTS, including ADAMTS17, ADAMTS10 and EFEMP1, and using RNA sequencing demonstrate expression of these genes in surgically resected tenosynovium from CTS patients. We perform Mendelian randomisation and demonstrate a causal relationship between short stature and higher risk of CTS. Reference
Prioritizing Parkinson’s disease genes using population-scale transcriptomic data
Genome-wide association studies (GWAS) have identified over 41 susceptibility loci associated with Parkinson’s Disease (PD) but identifying putative causal genes and the underlying mechanisms remains challenging.
Here, we leverage large-scale transcriptomic datasets to prioritize genes that are likely to affect PD by using a transcriptome-wide association study (TWAS) approach. Using this approach, we identify 66 gene associations whose predicted expression or splicing levels in dorsolateral prefrontal cortex (DLFPC) and peripheral monocytes are significantly associated with PD risk. We uncover many novel genes associated with PD but also novel mechanisms for known associations such as MAPT, for which we find that variation in exon 3 splicing explains the common genetic association. Reference
MMSplice: modular modeling improves the predictions of genetic variant effects on splicing
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge.
The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. Reference
Colonic epithelial cell diversity in health and inflammatory bowel disease
The colonic epithelium facilitates host–microorganism interactions to control mucosal immunity, coordinate nutrient recycling and form a mucus barrier. Breakdown of the epithelial barrier underpins inflammatory bowel disease (IBD). However, the specific contributions of each epithelial-cell subtype to this process are unknown.
Here we profile single colonic epithelial cells from patients with IBD and unaffected controls. We identify previously unknown cellular subtypes, including gradients of progenitor cells, colonocytes and goblet cells within intestinal crypts. At the top of the crypts, we find a previously unknown absorptive cell, expressing the proton channel OTOP2 and the satiety peptide uroguanylin, that senses pH and is dysregulated in inflammation and cancer. In IBD, we observe a positional remodelling of goblet cells that coincides with downregulation of WFDC2—an antiprotease molecule that we find to be expressed by goblet cells and that inhibits bacterial growth. Reference
An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics
Aging promotes lung function decline and susceptibility to chronic lung diseases, which are the third leading cause of death worldwide. Here, we use single cell transcriptomics and mass spectrometry-based proteomics to quantify changes in cellular activity states across 30 cell types and chart the lung proteome of young and old mice.
We show that aging leads to increased transcriptional noise, indicating deregulated epigenetic control. We observe cell type-specific effects of aging, uncovering increased cholesterol biosynthesis in type-2 pneumocytes and lipofibroblasts and altered relative frequency of airway epithelial cells as hallmarks of lung aging. Reference
A network-centric approach to drugging TNF-induced NF-κB signaling
Target-centric drug development strategies prioritize single-target potency in vitro and do not account for connectivity and multi-target effects within a signal transduction network.
Here, we present a systems biology approach that combines transcriptomic and structural analyses with live-cell imaging to predict small molecule inhibitors of TNF-induced NF-κB signaling and elucidate the network response. We identify two first-in-class small molecules that inhibit the NF-κB signaling pathway by preventing the maturation of a rate-limiting multiprotein complex necessary for IKK activation. Our findings suggest that a network-centric drug discovery approach is a promising strategy to evaluate the impact of pharmacologic intervention in signaling. Reference
Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling
DNase-seq and ATAC-seq are broadly used methods to assay open chromatin regions genome-wide. The single nucleotide resolution of DNase-seq has been further exploited to infer transcription factor binding sites (TFBSs) in regulatory regions through footprinting.
Here, we undertake a systematic comparison of the two methods and show that a modification to the ATAC-seq protocol increases its yield and its agreement with DNase-seq data from the same cell line. We demonstrate that the two methods have distinct sequence biases and correct for these protocol-specific biases when performing footprinting. Reference
Ediacaran biozones identified with network analysis provide evidence for pulsed extinctions of early complex life
Rocks of Ediacaran age (~635–541 Ma) contain the oldest fossils of large, complex organisms and their behaviors. These fossils document developmental and ecological innovations, and suggest that extinctions helped to shape the trajectory of early animal evolution.
Conventional methods divide Ediacaran macrofossil localities into taxonomically distinct clusters, which may represent evolutionary, environmental, or preservational variation. Here, we investigate these possibilities with network analysis of body and trace fossil occurrences. By partitioning multipartite networks of taxa, paleoenvironments, and geologic formations into community units, we distinguish between biostratigraphic zones and paleoenvironmentally restricted biotopes, and provide empirically robust and statistically significant evidence for a global, cosmopolitan assemblage unique to terminal Ediacaran strata. Reference
Epigenetic signatures associated with imprinted paternally expressed genes in the Arabidopsis endosperm
Imprinted genes are epigenetically modified during gametogenesis and maintain the established epigenetic signatures after fertilization, causing parental-specific gene expression.
In this study, we show that imprinted paternally expressed genes (PEGs) in the Arabidopsis endosperm are marked by an epigenetic signature of Polycomb Repressive Complex2 (PRC2)-mediated H3K27me3 together with heterochromatic H3K9me2 and CHG methylation, which specifically mark the silenced maternal alleles of PEGs. Reference
Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis
Streptococcus pyogenes causes 700 million human infections annually worldwide, yet, despite a century of intensive effort, there is no licensed vaccine against this bacterium.
Although a number of large-scale genomic studies of bacterial pathogens have been published, the relationships among the genome, transcriptome, and virulence in large bacterial populations remain poorly understood. We sequenced the genomes of 2,101 emm28 S. pyogenes invasive strains, from which we selected 492 phylogenetically diverse strains for transcriptome analysis and 50 strains for virulence assessment. Data integration provided a novel understanding of the virulence mechanisms of this model organism. Reference
Genomic analyses of an extensive collection of wild and cultivated accessions provide new insights into peach breeding history
Human selection has a long history of transforming crop genomes. Peach (Prunus persica) has undergone more than 5000 years of domestication that led to remarkable changes in a series of agronomically important traits, but genetic bases underlying these changes and the effects of artificial selection on genomic diversity are not well understood.
Here, we report a comprehensive analysis of peach evolution based on genome sequences of 480 wild and cultivated accessions. By focusing on a set of quantitative trait loci (QTLs), we provide evidence supporting that distinct phases of domestication and improvement have led to an increase in fruit size and taste and extended its geographic distribution. Reference
Precise tuning of gene expression levels in mammalian cells
Precise, analogue regulation of gene expression is critical for cellular function in mammals. In contrast, widely employed experimental and therapeutic approaches such as knock-in/out strategies are more suitable for binary control of gene activity.
Here we report on a method for precise control of gene expression levels in mammalian cells using engineered microRNA response elements (MREs). First, we measure the efficacy of thousands of synthetic MRE variants under the control of an endogenous microRNA by high-throughput sequencing. Guided by this data, we establish a library of microRNA silencing-mediated fine-tuners (miSFITs) of varying strength that can be employed to precisely control the expression of user-specified genes. We apply this technology to tune the T-cell co-inhibitory receptor PD-1 and to explore how antigen expression influences T-cell activation and tumour growth. Finally, we employ CRISPR/Cas9 mediated homology directed repair to introduce miSFITs into the BRCA1 3′UTR, demonstrating that this versatile tool can be used to tune endogenous genes. Reference
Multi-omic measurements of heterogeneity in HeLa cells across laboratories
Reproducibility in research can be compromised by both biological and technical variation, but most of the focus is on removing the latter. Here we investigate the effects of biological variation in HeLa cell lines using a systems-wide approach.
We determine the degree of molecular and phenotypic variability across 14 stock HeLa samples from 13 international laboratories. We cultured cells in uniform conditions and profiled genome-wide copy numbers, mRNAs, proteins and protein turnover rates in each cell line. We discovered substantial heterogeneity between HeLa variants, especially between lines of the CCL2 and Kyoto varieties, and observed progressive divergence within a specific cell line over 50 successive passages. Reference
Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data
t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets.
We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Reference
A mathematical-descriptor of tumor-mesoscopic-structure from CT images annotates prognostic- and molecular-phenotypes of epithelial ovarian cancer
The five-year survival rate of epithelial ovarian cancer (EOC) is approximately 35–40% despite maximal treatment efforts, highlighting a need for stratification biomarkers for personalized treatment.
Here we extract 657 quantitative mathematical descriptors from the preoperative CT images of 364 EOC patients at their initial presentation. Using machine learning, we derive a non-invasive summary-statistic of the primary ovarian tumor based on 4 descriptors, which we name “Radiomic Prognostic Vector” (RPV). RPV reliably identifies the 5% of patients with median overall survival less than 2 years, significantly improves established prognostic methods, and is validated in two independent, multi-center cohorts. Reference