qDSB-Seq is a general method for genome-wide quantification of DNA double-strand breaks using sequencing
DNA double-strand breaks (DSBs) are among the most lethal types of DNA damage and frequently cause genome instability. Sequencing-based methods for mapping DSBs have been developed but they allow measurement only of relative frequencies of DSBs between loci, which limits our understanding of the physiological relevance of detected DSBs.
Here we propose quantitative DSB sequencing (qDSB-Seq), a method providing both DSB frequencies per cell and their precise genomic coordinates. We induce spike-in DSBs by a site-specific endonuclease and use them to quantify detected DSBs (labeled, e.g., using i-BLESS). Utilizing qDSB-Seq, we determine numbers of DSBs induced by a radiomimetic drug and replication stress, and reveal two orders of magnitude differences in DSB frequencies. Reference
Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data
We introduce quanTIseq, a method to quantify the fractions of ten immune cell types from bulk RNA-sequencing data. quanTIseq was extensively validated in blood and tumor samples using simulated, flow cytometry, and immunohistochemistry data.
quanTIseq analysis of 8000 tumor samples revealed that cytotoxic T cell infiltration is more strongly associated with the activation of the CXCR3/CXCL9 axis than with mutational load and that deconvolution-based cell scores have prognostic value in several solid cancers. Finally, we used quanTIseq to show how kinase inhibitors modulate the immune contexture and to reveal immune-cell types that underlie differential patients’ responses to checkpoint blockers. Reference
ChiCMaxima: a robust and simple pipeline for detection and visualization of chromatin looping in Capture Hi-C
Capture Hi-C (CHi-C) is a new technique for assessing genome organization based on chromosome conformation capture coupled to oligonucleotide capture of regions of interest, such as gene promoters.
Chromatin loop detection is challenging because existing Hi-C/4C-like tools, which make different assumptions about the technical biases presented, are often unsuitable. We describe a new approach, ChiCMaxima, which uses local maxima combined with limited filtering to detect DNA looping interactions, integrating information from biological replicates. ChiCMaxima shows more stringency and robustness compared to previously developed tools. Reference
Genome-scale screens identify JNK–JUN signaling as a barrier for pluripotency exit and endoderm differentiation
Human embryonic stem cells (ESCs) and human induced pluripotent stem cells hold great promise for cell-based therapies and drug discovery. However, homogeneous differentiation remains a major challenge, highlighting the need for understanding developmental mechanisms.
We performed genome-scale CRISPR screens to uncover regulators of definitive endoderm (DE) differentiation, which unexpectedly uncovered five Jun N-terminal kinase (JNK)–JUN family genes as key barriers of DE differentiation. The JNK–JUN pathway does not act through directly inhibiting the DE enhancers. Instead, JUN co-occupies ESC enhancers with OCT4, NANOG, SMAD2 and SMAD3, and specifically inhibits the exit from the pluripotent state by impeding the decommissioning of ESC enhancers and inhibiting the reconfiguration of SMAD2 and SMAD3 chromatin binding from ESC to DE enhancers. Reference
Transcriptional cofactors display specificity for distinct types of core promoters
Transcriptional cofactors (COFs) communicate regulatory cues from enhancers to promoters and are central effectors of transcription activation and gene expression.
Although some COFs have been shown to prefer certain promoter types over others the extent to which different COFs display intrinsic specificities for distinct promoters is unclear. Here we use a high-throughput promoter-activity assay in Drosophila melanogaster S2 cells to screen 23 COFs for their ability to activate 72,000 candidate core promoters (CPs). We observe differential activation of CPs, indicating distinct regulatory preferences or ‘compatibilities’ between COFs and specific types of CPs. Reference
A systems biology approach uncovers cell-specific gene regulatory effects of genetic associations in multiple sclerosis
Genome-wide association studies (GWAS) have identified more than 50,000 unique associations with common human traits. While this represents a substantial step forward, establishing the biology underlying these associations has proven extremely difficult.
Even determining which cell types and which particular gene(s) are relevant continues to be a challenge. Here, we conduct a cell-specific pathway analysis of the latest GWAS in multiple sclerosis (MS), which had analyzed a total of 47,351 cases and 68,284 healthy controls and found more than 200 non-MHC genome-wide associations. Our analysis identifies pan immune cell as well as cell-specific susceptibility genes in T cells, B cells and monocytes. Reference
Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures
Changes in bulk transcriptional profiles of heterogeneous samples often reflect changes in proportions of individual cell types. Several robust techniques have been developed to dissect the composition of such mixed samples given transcriptional signatures of the pure components or their proportions.
These approaches are insufficient, however, in situations when no information about individual mixture components is available. This problem is known as the complete deconvolution problem, where the composition is revealed without any a priori knowledge about cell types and their proportions. Here, we identify a previously unrecognized property of tissue-specific genes – their mutual linearity – and use it to reveal the structure of the topological space of mixed transcriptional profiles and provide a noise-robust approach to the complete deconvolution problem. Reference
Genomic signatures accompanying the dietary shift to phytophagy in polyphagan beetles
The diversity and evolutionary success of beetles (Coleoptera) are proposed to be related to the diversity of plants on which they feed. Indeed, the largest beetle suborder, Polyphaga, mostly includes plant eaters among its approximately 315,000 species.
We explore the genomic consequences of beetle-plant trophic interactions by performing comparative gene family analyses across 18 species representative of the two most species-rich beetle suborders. We contrast the gene contents of species from the mostly plant-eating suborder Polyphaga with those of the mainly predatory Adephaga. We find gene repertoire evolution to be more dynamic, with significantly more adaptive lineage-specific expansions, in the more speciose Polyphaga. Reference
Host diet and evolutionary history explain different aspects of gut microbiome diversity among vertebrate clades
Multiple factors modulate microbial community assembly in the vertebrate gut, though studies disagree as to their relative contribution. One cause may be a reliance on captive animals, which can have very different gut microbiomes compared to their wild counterparts.
To resolve this disagreement, we analyze a new, large, and highly diverse animal distal gut 16 S rRNA microbiome dataset, which comprises 80% wild animals and includes members of Mammalia, Aves, Reptilia, Amphibia, and Actinopterygii. We decouple the effects of host evolutionary history and diet on gut microbiome diversity and show that each factor modulates different aspects of diversity. Moreover, we resolve particular microbial taxa associated with host phylogeny or diet and show that Mammalia have a stronger signal of cophylogeny. Reference
In-depth human plasma proteome analysis captures tissue proteins and transfer of protein variants across the placenta
Here, we present a method for in-depth human plasma proteome analysis based on high-resolution isoelectric focusing HiRIEF LC-MS/MS, demonstrating high proteome coverage, reproducibility and the potential for liquid biopsy protein profiling.
By integrating genomic sequence information to the MS-based plasma proteome analysis, we enable detection of single amino acid variants and for the first time demonstrate transfer of multiple protein variants between mother and fetus across the placenta. We further show that our method has the ability to detect both low abundance tissue-annotated proteins and phosphorylated proteins in plasma, as well as quantitate differences in plasma proteomes between the mother and the newborn as well as changes related to pregnancy. Reference
VULCAN integrates ChIP-seq with patient-derived co-expression networks to identify GRHL2 as a key co-regulator of ERa at enhancers in breast cancer
VirtUaL ChIP-seq Analysis through Networks (VULCAN) infers regulatory interactions of transcription factors by overlaying networks generated from publicly available tumor expression data onto ChIP-seq data.
We apply our method to dissect the regulation of estrogen receptor-alpha activation in breast cancer to identify potential co-regulators of the estrogen receptor’s transcriptional response. Reference
Association analyses identify 31 new risk loci for colorectal cancer susceptibility
Colorectal cancer (CRC) is a leading cause of cancer-related death worldwide, and has a strong heritable basis. We report a genome-wide association analysis of 34,627 CRC cases and 71,379 controls of European ancestry that identifies SNPs at 31 new CRC risk loci.
We also identify eight independent risk SNPs at the new and previously reported European CRC loci, and a further nine CRC SNPs at loci previously only identified in Asian populations. We use in situ promoter capture Hi-C (CHi-C), gene expression, and in silico annotation methods to identify likely target genes of CRC SNPs. Whilst these new SNP associations implicate target genes that are enriched for known CRC pathways such as Wnt and BMP, they also highlight novel pathways with no prior links to colorectal tumourigenesis. Reference
Transcriptomics-Based Screening Identifies Pharmacological Inhibition of Hsp90 as a Means to Defer Aging
Aging strongly influences human morbidity and mortality. Thus, aging-preventive compounds could greatly improve our health and lifespan. Here we screened for such compounds, known as geroprotectors, employing the power of transcriptomics to predict biological age.
Using age-stratified human tissue transcriptomes and machine learning, we generated age classifiers and applied these to transcriptomic changes induced by 1,309 different compounds in human cells, ranking these compounds by their ability to induce a “youthful” transcriptional state. Testing the top candidates in C. elegans, we identified two Hsp90 inhibitors, monorden and tanespimycin, which extended the animals’ lifespan and improved their health. Hsp90 inhibition induces expression of heat shock proteins known to improve protein homeostasis. Reference
Stem cell-associated heterogeneity in Glioblastoma results from intrinsic tumor plasticity shaped by the microenvironment
The identity and unique capacity of cancer stem cells (CSC) to drive tumor growth and resistance have been challenged in brain tumors. Here we report that cells expressing CSC-associated cell membrane markers in Glioblastoma (GBM) do not represent a clonal entity defined by distinct functional properties and transcriptomic profiles, but rather a plastic state that most cancer cells can adopt.
We show that phenotypic heterogeneity arises from non-hierarchical, reversible state transitions, instructed by the microenvironment and is predictable by mathematical modeling. Although functional stem cell properties were similar in vitro, accelerated reconstitution of heterogeneity provides a growth advantage in vivo, suggesting that tumorigenic potential is linked to intrinsic plasticity rather than CSC multipotency. Reference
Differential expression analysis of Trichoderma virens RNA reveals a dynamic transcriptome during colonization of Zea mays roots
Trichoderma spp. are majorly composed of plant-beneficial symbionts widely used in agriculture as bio-control agents. Studying the mechanisms behind Trichoderma-derived plant benefits has yielded tangible bio-industrial products.
To better take advantage of this fungal-plant symbiosis it is necessary to obtain detailed knowledge of which genes Trichoderma utilizes during interaction with its plant host. In this study, we explored the transcriptional activity undergone by T. virens during two phases of symbiosis with maize; recognition of roots and after ingress into the root cortex. Reference
Interrogation of human hematopoiesis at single-cell and single-variant resolution
Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.
Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Finally, we leverage fine-mapped variants in conjunction with continuous epigenomic annotations to identify trait–cell type enrichments within closely related populations and in single cells. Reference
Guidelines for using sigQC for systematic evaluation of gene signatures
With the increased use of next-generation sequencing generating large amounts of genomic data, gene expression signatures are becoming critically important tools for the interpretation of these data, and are poised to have a substantial effect on diagnosis, management, and prognosis for a number of diseases.
It is becoming crucial to establish whether the expression patterns and statistical properties of sets of genes, or gene signatures, are conserved across independent datasets. Conversely, it is necessary to compare established signatures on the same dataset to better understand how they capture different clinical or biological characteristics. Here we describe how to use sigQC, a tool that enables a streamlined, systematic approach for the evaluation of previously obtained gene signatures across multiple gene expression datasets. We implemented sigQC in an R package, making it accessible to users who have knowledge of file input/output and matrix manipulation in R and a moderate grasp of core statistical principles. Reference
Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens
Functional genomics approaches can overcome limitations—such as the lack of identification of robust targets and poor clinical efficacy—that hamper cancer drug development.
Here we performed genome-scale CRISPR–Cas9 screens in 324 human cancer cell lines from 30 cancer types and developed a data-driven framework to prioritize candidates for cancer therapeutics. We integrated cell fitness effects with genomic biomarkers and target tractability for drug development to systematically prioritize new targets in defined tissues and genotypes. We verified one of our most promising dependencies, the Werner syndrome ATP-dependent helicase, as a synthetic lethal target in tumours from multiple cancer types with microsatellite instability. Reference
Comparative analysis of sequencing technologies for single-cell transcriptomics
Single-cell RNA-seq technologies require library preparation prior to sequencing. Here, we present the first report to compare the cheaper BGISEQ-500 platform to the Illumina HiSeq platform for scRNA-seq.
We generate a resource of 468 single cells and 1297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on two cell lines with RNA spike-ins. We sequence these libraries on both platforms using single- and paired-end reads. The platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardized scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms. Reference
A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals.
Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Reference
Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects
Only a small fraction of early drug programs progress to the market, due to safety and efficacy failures, despite extensive efforts to predict safety. Characterizing the effect of natural variation in the genes encoding drug targets should present a powerful approach to predict side effects arising from drugging particular proteins.
In this retrospective analysis, we report a correlation between the organ systems affected by genetic variation in drug targets and the organ systems in which side effects are observed. Across 1819 drugs and 21 phenotype categories analyzed, drug side effects are more likely to occur in organ systems where there is genetic evidence of a link between the drug target and a phenotype involving that organ system, compared to when there is no such genetic evidence (30.0 vs 19.2%; OR = 1.80). Reference
Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing
Accurate variant calling and genotyping represent major limiting factors for downstream applications of single-cell genomics. Here, we report Conbase for the identification of somatic mutations in single-cell DNA sequencing data.
Conbase leverages phased read data from multiple samples in a dataset to achieve increased confidence in somatic variant calls and genotype predictions. Comparing the performance of Conbase to three other methods, we find that Conbase performs best in terms of false discovery rate and specificity and provides superior robustness on simulated data, in vitro expanded fibroblasts and clonal lymphocyte populations isolated directly from a healthy human donor. Reference
Meta-analysis of genome-wide association studies provides insights into genetic control of tomato flavor
Tomato flavor has changed over the course of long-term domestication and intensive breeding. To understand the genetic control of flavor, we report the meta-analysis of genome-wide association studies (GWAS) using 775 tomato accessions and 2,316,117 SNPs from three GWAS panels.
We discover 305 significant associations for the contents of sugars, acids, amino acids, and flavor-related volatiles. We demonstrate that fruit citrate and malate contents have been impacted by selection during domestication and improvement, while sugar content has undergone less stringent selection. We suggest that it may be possible to significantly increase volatiles that positively contribute to consumer preferences while reducing unpleasant volatiles, by selection of the relevant allele combinations. Reference
A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms
We report a computational approach (implemented in MS-DIAL 3.0; http://prime.psc.riken.jp/) for metabolite structure characterization using fully 13C-labeled and non-labeled plants and LC–MS/MS. Our approach facilitates carbon number determination and metabolite classification for unknown molecules.
Applying our method to 31 tissues from 12 plant species, we assigned 1,092 structures and 344 formulae to 3,604 carbon-determined metabolite ions, 69 of which were found to represent structures currently not listed in metabolome databases. Reference
Crizotinib-induced immunogenic cell death in non-small cell lung cancer
Immunogenic cell death (ICD) converts dying cancer cells into a therapeutic vaccine and stimulates antitumor immune responses. Here we unravel the results of an unbiased screen identifying high-dose (10 µM) crizotinib as an ICD-inducing tyrosine kinase inhibitor that has exceptional antineoplastic activity when combined with non-ICD inducing chemotherapeutics like cisplatin.
The combination of cisplatin and high-dose crizotinib induces ICD in non-small cell lung carcinoma (NSCLC) cells and effectively controls the growth of distinct (transplantable, carcinogen- or oncogene induced) orthotopic NSCLC models. These anticancer effects are linked to increased T lymphocyte infiltration and are abolished by T cell depletion or interferon-γ neutralization. Crizotinib plus cisplatin leads to an increase in the expression of PD-1 and PD-L1 in tumors, coupled to a strong sensitization of NSCLC to immunotherapy with PD-1 antibodies. Reference
Learning protein constitutive motifs from sequence data
Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information.
We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ‘turning up’ or ‘turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. Reference
The anti-cancer drugs curaxins target spatial genome organization
Recently we characterized a class of anti-cancer agents (curaxins) that disturbs DNA/histone interactions within nucleosomes.
Here, using a combination of genomic and in vitro approaches, we demonstrate that curaxins strongly affect spatial genome organization and compromise enhancer-promoter communication, which is necessary for the expression of several oncogenes, including MYC. We further show that curaxins selectively inhibit enhancer-regulated transcription of chromatinized templates in cell-free conditions. Genomic studies also suggest that curaxins induce partial depletion of CTCF from its binding sites, which contributes to the observed changes in genome topology. Thus, curaxins can be classified as epigenetic drugs that target the 3D genome organization. Reference
Developing a network view of type 2 diabetes risk pathways through integration of genetic, genomic and functional data
Genome-wide association studies (GWAS) have identified several hundred susceptibility loci for type 2 diabetes (T2D). One critical, but unresolved, issue concerns the extent to which the mechanisms through which these diverse signals influencing T2D predisposition converge on a limited set of biological processes.
However, the causal variants identified by GWAS mostly fall into a non-coding sequence, complicating the task of defining the effector transcripts through which they operate. Reference
Systematic benchmarking of omics computational tools
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking.
Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Reference
Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list
Differential gene expression analysis may discover a set of genes too large to easily investigate, so a means of ranking genes by biological interest level is desired. p values are frequently abused for this purpose.
As an alternative, we propose a method of ranking by confidence bounds on the log fold change, based on the previously developed TREAT test. These confidence bounds provide guaranteed false discovery rate and false coverage-statement rate control. When applied to a breast cancer dataset, the top-ranked genes by Topconfects emphasize markedly different biological processes compared to the top-ranked genes by p value. Reference
EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data
Droplet-based single-cell RNA sequencing protocols have dramatically increased the throughput of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets.
Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that EmptyDrops has greater power than existing approaches while controlling the false discovery rate among detected cells. Our method also retains distinct cell types that would have been discarded by existing methods in several real data sets. Reference
Aberrant FGFR signaling mediates resistance to CDK4/6 inhibitors in ER+ breast cancer
Using an ORF kinome screen in MCF-7 cells treated with the CDK4/6 inhibitor ribociclib plus fulvestrant, we identified FGFR1 as a mechanism of drug resistance. FGFR1-amplified/ER+ breast cancer cells and MCF-7 cells transduced with FGFR1 were resistant to fulvestrant ± ribociclib or palbociclib.
This resistance was abrogated by treatment with the FGFR tyrosine kinase inhibitor (TKI) lucitanib. Addition of the FGFR TKI erdafitinib to palbociclib/fulvestrant induced complete responses of FGFR1-amplified/ER+ patient-derived-xenografts. Next generation sequencing of circulating tumor DNA (ctDNA) in 34 patients after progression on CDK4/6 inhibitors identified FGFR1/2 amplification or activating mutations in 14/34 (41%) post-progression specimens. Reference
Dissecting heterogeneity in malignant pleural mesothelioma through histo-molecular gradients for clinical applications
Malignant pleural mesothelioma (MPM) is recognized as heterogeneous based both on histology and molecular profiling. Histology addresses inter-tumor and intra-tumor heterogeneity in MPM and describes three major types: epithelioid, sarcomatoid and biphasic, a combination of the former two types.
Molecular profiling studies have not addressed intra-tumor heterogeneity in MPM to date. Here, we use a deconvolution approach and show that molecular gradients shed new light on the intra-tumor heterogeneity of MPM, leading to a reconsideration of MPM molecular classifications. We show that each tumor can be decomposed as a combination of epithelioid-like and sarcomatoid-like components whose proportions are highly associated with the prognosis. Reference
Identification of pathways associated with chemosensitivity through network embedding
Basal gene expression levels have been shown to be predictive of cellular response to cytotoxic treatments. However, such analyses do not fully reveal complex genotype- phenotype relationships, which are partly encoded in highly interconnected molecular networks. Biological pathways provide a complementary way of understanding drug response variation among individuals.
In this study, we integrate chemosensitivity data from a large-scale pharmacogenomics study with basal gene expression data from the CCLE project and prior knowledge of molecular networks to identify specific pathways mediating chemical response. We first develop a computational method called PACER, which ranks pathways for enrichment in a given set of genes using a novel network embedding method. It examines a molecular network that encodes known gene-gene as well as gene-pathway relationships, and determines a vector representation of each gene and pathway in the same low-dimensional vector space. The relevance of a pathway to the given gene set is then captured by the similarity between the pathway vector and gene vectors. Reference
Neoantigen-directed immune escape in lung cancer evolution
The interplay between an evolving cancer and a dynamic immune microenvironment remains unclear. Here we analyse 258 regions from 88 early-stage, untreated non-small-cell lung cancers using RNA sequencing and histopathology-assessed tumour-infiltrating lymphocyte estimates.
Immune infiltration varied both between and within tumours, with different mechanisms of neoantigen presentation dysfunction enriched in distinct immune microenvironments. Sparsely infiltrated tumours exhibited a waning of neoantigen editing during tumour evolution, indicative of historical immune editing, or copy-number loss of previously clonal neoantigens. Immune-infiltrated tumour regions exhibited ongoing immunoediting, with either loss of heterozygosity in human leukocyte antigens or depletion of expressed neoantigens. We identified promoter hypermethylation of genes that contain neoantigenic mutations as an epigenetic mechanism of immunoediting. Reference
Melissa: Bayesian clustering and imputation of single-cell methylomes
Measurements of single-cell methylation are revolutionizing our understanding of epigenetic control of gene expression, yet the intrinsic data sparsity limits the scope for quantitative analysis of such data.
Here, we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells. The clustering also acts as an effective regularization for data imputation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings and state-of-the-art imputation performance. Reference
Measuring the reproducibility and quality of Hi-C data
Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease.
However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Reference
MGSEA – a multivariate Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a powerful tool to identify enriched functional categories of informative biomarkers. Canonical GSEA takes one-dimensional feature scores derived from the data of one platform as inputs.
Numerous extensions of GSEA handling multimodal OMIC data are proposed, yet none of them explicitly captures combinatorial relations of feature scores from multiple platforms. Reference
A reference-grade wild soybean genome
Efficient crop improvement depends on the application of accurate genetic information contained in diverse germplasm resources.
Here we report a reference-grade genome of wild soybean accession W05, with a final assembled genome size of 1013.2 Mb and a contig N50 of 3.3 Mb. The analytical power of the W05 genome is demonstrated by several examples. First, we identify an inversion at the locus determining seed coat color during domestication. Second, a translocation event between chromosomes 11 and 13 of some genotypes is shown to interfere with the assignment of QTLs. Third, we find a region containing copy number variations of the Kunitz trypsin inhibitor (KTI) genes. Reference
RnBeads 2.0: comprehensive analysis of DNA methylation data
DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples.
Here, we describe a new version of our RnBeads software – an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer. Reference
Network-based prediction of drug combinations
Drug combinations, offering increased therapeutic efficacy and reduced toxicity, play an important role in treating multiple complex diseases. Yet, our ability to identify and validate effective combinations is limited by a combinatorial explosion, driven by both the large number of drug pairs as well as dosage combinations.
Here we propose a network-based methodology to identify clinically efficacious drug combinations for specific diseases. By quantifying the network-based relationship between drug targets and disease proteins in the human protein–protein interactome, we show the existence of six distinct classes of drug–drug–disease combinations. Reference
Best practices for benchmarking germline small-variant calls in human genomes
Standardized benchmarking approaches are required to assess the accuracy of variants called from sequence data. Although variant-calling tools and the metrics used to assess their performance continue to improve, important challenges remain.
Here, as part of the Global Alliance for Genomics and Health (GA4GH), we present a benchmarking framework for variant calling. We provide guidance on how to match variant calls with different representations, define standard performance metrics, and stratify performance by variant type and genome context. We describe limitations of high-confidence calls and regions that can be used as truth sets (for example, single-nucleotide variant concordance of two methods is 99.7% inside versus 76.5% outside high-confidence regions). Reference
Interrogation of human hematopoiesis at single-cell and single-variant resolution
Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.
Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Reference
Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement
Brassica napus (2n = 4x = 38, AACC) is an important allopolyploid crop derived from interspecific crosses between Brassica rapa (2n = 2x = 20, AA) and Brassica oleracea (2n = 2x = 18, CC). However, no truly wild B. napus populations are known; its origin and improvement processes remain unclear.
Here, we resequence 588 B. napus accessions. We uncover that the A subgenome may evolve from the ancestor of European turnip and the C subgenome may evolve from the common ancestor of kohlrabi, cauliflower, broccoli, and Chinese kale. Additionally, winter oilseed may be the original form of B. napus. Subgenome-specific selection of defense-response genes has contributed to environmental adaptation after formation of the species, whereas asymmetrical subgenomic selection has led to ecotype change. Reference
Topological scoring of protein interaction networks
It remains a significant challenge to define individual protein associations within networks where an individual protein can directly interact with other proteins and/or be part of large complexes, which contain functional modules.
Here we demonstrate the topological scoring (TopS) algorithm for the analysis of quantitative proteomic datasets from affinity purifications. Data is analyzed in a parallel fashion where a prey protein is scored in an individual affinity purification by aggregating information from the entire dataset. Topological scores span a broad range of values indicating the enrichment of an individual protein in every bait protein purification. TopS is applied to interaction networks derived from human DNA repair proteins and yeast chromatin remodeling complexes. Reference
I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms
We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods.
By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data. Reference
Osteogenesis depends on commissioning of a network of stem cell transcription factors that act as repressors of adipogenesis
Mesenchymal (stromal) stem cells (MSCs) constitute populations of mesodermal multipotent cells involved in tissue regeneration and homeostasis in many different organs.
Here we performed comprehensive characterization of the transcriptional and epigenomic changes associated with osteoblast and adipocyte differentiation of human MSCs. We demonstrate that adipogenesis is driven by considerable remodeling of the chromatin landscape and de novo activation of enhancers, whereas osteogenesis involves activation of preestablished enhancers. Using machine learning algorithms for in silico modeling of transcriptional regulation, we identify a large and diverse transcriptional network of pro-osteogenic and antiadipogenic transcription factors. Reference
GWAS identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates
Sleep is an essential state of decreased activity and alertness but molecular factors regulating sleep duration remain unknown. Through genome-wide association analysis in 446,118 adults of European ancestry from the UK Biobank, we identify 78 loci for self-reported habitual sleep duration (p < 5 × 10−8; 43 loci at p < 6 × 10−9).
Replication is observed for PAX8, VRK2, and FBXL12/UBL5/PIN1 loci in the CHARGE study (n = 47,180; p < 6.3 × 10−4), and 55 signals show sign-concordant effects. The 78 loci further associate with accelerometer-derived sleep duration, daytime inactivity, sleep efficiency and number of sleep bouts in secondary analysis (n = 85,499). Loci are enriched for pathways including striatum and subpallium development, mechanosensory response, dopamine binding, synaptic neurotransmission and plasticity, among others. Reference
Genome-scale network model of metabolism and histone acetylation reveals metabolic dependencies of histone deacetylase inhibitors
Histone acetylation plays a central role in gene regulation and is sensitive to the levels of metabolic intermediates. However, predicting the impact of metabolic alterations on acetylation in pathological conditions is a significant challenge.
Here, we present a genome-scale network model that predicts the impact of nutritional environment and genetic alterations on histone acetylation. It identifies cell types that are sensitive to histone deacetylase inhibitors based on their metabolic state, and we validate metabolites that alter drug sensitivity. Our model provides a mechanistic framework for predicting how metabolic perturbations contribute to epigenetic changes and sensitivity to deacetylase inhibitors. Reference
A genome-wide association analysis identifies 16 novel susceptibility loci for carpal tunnel syndrome
Carpal tunnel syndrome (CTS) is a common and disabling condition of the hand caused by entrapment of the median nerve at the level of the wrist. It is the commonest entrapment neuropathy, with estimates of prevalence ranging between 5–10%.
Here, we undertake a genome-wide association study (GWAS) of an entrapment neuropathy, using 12,312 CTS cases and 389,344 controls identified in UK Biobank. We discover 16 susceptibility loci for CTS with p < 5 × 10−8. We identify likely causal genes in the pathogenesis of CTS, including ADAMTS17, ADAMTS10 and EFEMP1, and using RNA sequencing demonstrate expression of these genes in surgically resected tenosynovium from CTS patients. We perform Mendelian randomisation and demonstrate a causal relationship between short stature and higher risk of CTS. Reference
Prioritizing Parkinson’s disease genes using population-scale transcriptomic data
Genome-wide association studies (GWAS) have identified over 41 susceptibility loci associated with Parkinson’s Disease (PD) but identifying putative causal genes and the underlying mechanisms remains challenging.
Here, we leverage large-scale transcriptomic datasets to prioritize genes that are likely to affect PD by using a transcriptome-wide association study (TWAS) approach. Using this approach, we identify 66 gene associations whose predicted expression or splicing levels in dorsolateral prefrontal cortex (DLFPC) and peripheral monocytes are significantly associated with PD risk. We uncover many novel genes associated with PD but also novel mechanisms for known associations such as MAPT, for which we find that variation in exon 3 splicing explains the common genetic association. Reference
MMSplice: modular modeling improves the predictions of genetic variant effects on splicing
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge.
The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. Reference
Colonic epithelial cell diversity in health and inflammatory bowel disease
The colonic epithelium facilitates host–microorganism interactions to control mucosal immunity, coordinate nutrient recycling and form a mucus barrier. Breakdown of the epithelial barrier underpins inflammatory bowel disease (IBD). However, the specific contributions of each epithelial-cell subtype to this process are unknown.
Here we profile single colonic epithelial cells from patients with IBD and unaffected controls. We identify previously unknown cellular subtypes, including gradients of progenitor cells, colonocytes and goblet cells within intestinal crypts. At the top of the crypts, we find a previously unknown absorptive cell, expressing the proton channel OTOP2 and the satiety peptide uroguanylin, that senses pH and is dysregulated in inflammation and cancer. In IBD, we observe a positional remodelling of goblet cells that coincides with downregulation of WFDC2—an antiprotease molecule that we find to be expressed by goblet cells and that inhibits bacterial growth. Reference
An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics
Aging promotes lung function decline and susceptibility to chronic lung diseases, which are the third leading cause of death worldwide. Here, we use single cell transcriptomics and mass spectrometry-based proteomics to quantify changes in cellular activity states across 30 cell types and chart the lung proteome of young and old mice.
We show that aging leads to increased transcriptional noise, indicating deregulated epigenetic control. We observe cell type-specific effects of aging, uncovering increased cholesterol biosynthesis in type-2 pneumocytes and lipofibroblasts and altered relative frequency of airway epithelial cells as hallmarks of lung aging. Reference
A network-centric approach to drugging TNF-induced NF-κB signaling
Target-centric drug development strategies prioritize single-target potency in vitro and do not account for connectivity and multi-target effects within a signal transduction network.
Here, we present a systems biology approach that combines transcriptomic and structural analyses with live-cell imaging to predict small molecule inhibitors of TNF-induced NF-κB signaling and elucidate the network response. We identify two first-in-class small molecules that inhibit the NF-κB signaling pathway by preventing the maturation of a rate-limiting multiprotein complex necessary for IKK activation. Our findings suggest that a network-centric drug discovery approach is a promising strategy to evaluate the impact of pharmacologic intervention in signaling. Reference
Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling
DNase-seq and ATAC-seq are broadly used methods to assay open chromatin regions genome-wide. The single nucleotide resolution of DNase-seq has been further exploited to infer transcription factor binding sites (TFBSs) in regulatory regions through footprinting.
Here, we undertake a systematic comparison of the two methods and show that a modification to the ATAC-seq protocol increases its yield and its agreement with DNase-seq data from the same cell line. We demonstrate that the two methods have distinct sequence biases and correct for these protocol-specific biases when performing footprinting. Reference
Ediacaran biozones identified with network analysis provide evidence for pulsed extinctions of early complex life
Rocks of Ediacaran age (~635–541 Ma) contain the oldest fossils of large, complex organisms and their behaviors. These fossils document developmental and ecological innovations, and suggest that extinctions helped to shape the trajectory of early animal evolution.
Conventional methods divide Ediacaran macrofossil localities into taxonomically distinct clusters, which may represent evolutionary, environmental, or preservational variation. Here, we investigate these possibilities with network analysis of body and trace fossil occurrences. By partitioning multipartite networks of taxa, paleoenvironments, and geologic formations into community units, we distinguish between biostratigraphic zones and paleoenvironmentally restricted biotopes, and provide empirically robust and statistically significant evidence for a global, cosmopolitan assemblage unique to terminal Ediacaran strata. Reference
Epigenetic signatures associated with imprinted paternally expressed genes in the Arabidopsis endosperm
Imprinted genes are epigenetically modified during gametogenesis and maintain the established epigenetic signatures after fertilization, causing parental-specific gene expression.
In this study, we show that imprinted paternally expressed genes (PEGs) in the Arabidopsis endosperm are marked by an epigenetic signature of Polycomb Repressive Complex2 (PRC2)-mediated H3K27me3 together with heterochromatic H3K9me2 and CHG methylation, which specifically mark the silenced maternal alleles of PEGs. Reference
Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis
Streptococcus pyogenes causes 700 million human infections annually worldwide, yet, despite a century of intensive effort, there is no licensed vaccine against this bacterium.
Although a number of large-scale genomic studies of bacterial pathogens have been published, the relationships among the genome, transcriptome, and virulence in large bacterial populations remain poorly understood. We sequenced the genomes of 2,101 emm28 S. pyogenes invasive strains, from which we selected 492 phylogenetically diverse strains for transcriptome analysis and 50 strains for virulence assessment. Data integration provided a novel understanding of the virulence mechanisms of this model organism. Reference
Genomic analyses of an extensive collection of wild and cultivated accessions provide new insights into peach breeding history
Human selection has a long history of transforming crop genomes. Peach (Prunus persica) has undergone more than 5000 years of domestication that led to remarkable changes in a series of agronomically important traits, but genetic bases underlying these changes and the effects of artificial selection on genomic diversity are not well understood.
Here, we report a comprehensive analysis of peach evolution based on genome sequences of 480 wild and cultivated accessions. By focusing on a set of quantitative trait loci (QTLs), we provide evidence supporting that distinct phases of domestication and improvement have led to an increase in fruit size and taste and extended its geographic distribution. Reference
Precise tuning of gene expression levels in mammalian cells
Precise, analogue regulation of gene expression is critical for cellular function in mammals. In contrast, widely employed experimental and therapeutic approaches such as knock-in/out strategies are more suitable for binary control of gene activity.
Here we report on a method for precise control of gene expression levels in mammalian cells using engineered microRNA response elements (MREs). First, we measure the efficacy of thousands of synthetic MRE variants under the control of an endogenous microRNA by high-throughput sequencing. Guided by this data, we establish a library of microRNA silencing-mediated fine-tuners (miSFITs) of varying strength that can be employed to precisely control the expression of user-specified genes. We apply this technology to tune the T-cell co-inhibitory receptor PD-1 and to explore how antigen expression influences T-cell activation and tumour growth. Finally, we employ CRISPR/Cas9 mediated homology directed repair to introduce miSFITs into the BRCA1 3′UTR, demonstrating that this versatile tool can be used to tune endogenous genes. Reference
Multi-omic measurements of heterogeneity in HeLa cells across laboratories
Reproducibility in research can be compromised by both biological and technical variation, but most of the focus is on removing the latter. Here we investigate the effects of biological variation in HeLa cell lines using a systems-wide approach.
We determine the degree of molecular and phenotypic variability across 14 stock HeLa samples from 13 international laboratories. We cultured cells in uniform conditions and profiled genome-wide copy numbers, mRNAs, proteins and protein turnover rates in each cell line. We discovered substantial heterogeneity between HeLa variants, especially between lines of the CCL2 and Kyoto varieties, and observed progressive divergence within a specific cell line over 50 successive passages. Reference
Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data
t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets.
We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Reference
A mathematical-descriptor of tumor-mesoscopic-structure from CT images annotates prognostic- and molecular-phenotypes of epithelial ovarian cancer
The five-year survival rate of epithelial ovarian cancer (EOC) is approximately 35–40% despite maximal treatment efforts, highlighting a need for stratification biomarkers for personalized treatment.
Here we extract 657 quantitative mathematical descriptors from the preoperative CT images of 364 EOC patients at their initial presentation. Using machine learning, we derive a non-invasive summary-statistic of the primary ovarian tumor based on 4 descriptors, which we name “Radiomic Prognostic Vector” (RPV). RPV reliably identifies the 5% of patients with median overall survival less than 2 years, significantly improves established prognostic methods, and is validated in two independent, multi-center cohorts. Reference