A cellular census of human lungs identifies novel cell states in health and in asthma
Human lungs enable efficient gas exchange and form an interface with the environment, which depends on mucosal immunity for protection against infectious agents. Tightly controlled interactions between structural and immune cells are required to maintain lung homeostasis.
Here, we use single-cell transcriptomics to chart the cellular landscape of upper and lower airways and lung parenchyma in healthy lungs, and lower airways in asthmatic lungs. We report location-dependent airway epithelial cell states and a novel subset of tissue-resident memory T cells. In the lower airways of patients with asthma, mucous cell hyperplasia is shown to stem from a novel mucous ciliated cell state, as well as goblet cell hyperplasia. Reference
Cold stress induces enhanced chromatin accessibility and bivalent histone modifications H3K4me3 and H3K27me3 of active genes in potato
Cold stress can greatly affect plant growth and development. Plants have developed special systems to respond to and tolerate cold stress. While plant scientists have discovered numerous genes involved in responses to cold stress, few studies have been dedicated to investigation of genome-wide chromatin dynamics induced by cold or other abiotic stresses.
Genomic regions containing active cis-regulatory DNA elements can be identified as DNase I hypersensitive sites (DHSs). We develop high-resolution DHS maps in potato (Solanum tuberosum) using chromatin isolated from tubers stored under room (22 °C) and cold (4 °C) conditions. We find that cold stress induces a large number of DHSs enriched in genic regions which are frequently associated with differential gene expression in response to temperature variation. Reference
Predicting three-dimensional genome organization with chromatin states
We introduce a computational model to simulate chromatin structure and dynamics. Starting from one-dimensional genomics and epigenomics data that are available for hundreds of cell types, this model enables de novo prediction of chromatin structures at five-kilo-base resolution.
Simulated chromatin structures recapitulate known features of genome organization, including the formation of chromatin loops, topologically associating domains (TADs) and compartments, and are in quantitative agreement with chromosome conformation capture experiments and super-resolution microscopy measurements. Detailed characterization of the predicted structural ensemble reveals the dynamical flexibility of chromatin loops and the presence of cross-talk among neighboring TADs. Analysis of the model’s energy function uncovers distinct mechanisms for chromatin folding at various length scales and suggests a need to go beyond simple A/B compartment types to predict specific contacts between regulatory elements using polymer simulations. Reference
HumanMycobiomeScan: a new bioinformatics tool for the characterization of the fungal fraction in metagenomic samples
Modern metagenomic analysis of complex microbial communities produces large amounts of sequence data containing information on the microbiome in terms of bacterial, archaeal, viral and eukaryotic composition.
HumanMycobiomeScan is a bioinformatics tool for the taxonomic profiling of the mycobiome directly from raw data of next-generation sequencing. The tool uses hierarchical databases of fungi in order to unambiguously assign reads to fungal species more accurately and > 10,000 times faster than other comparable approaches. HumanMycobiomeScan was validated using in silico generated synthetic communities and then applied to metagenomic data, to characterize the intestinal fungal components in subjects adhering to different subsistence strategies. Reference
Invasive DNA elements modify the nuclear architecture of their insertion site by KNOT-linked silencing in Arabidopsis thaliana
The three-dimensional (3D) organization of chromosomes is linked to epigenetic regulation and transcriptional activity. However, only few functional features of 3D chromatin architecture have been described to date.
Here, we report the KNOT’s involvement in regulating invasive DNA elements. Transgenes can specifically interact with the KNOT, leading to perturbations of 3D nuclear organization, which correlates with the transgene’s expression: high KNOT interaction frequencies are associated with transgene silencing. KNOT-linked silencing (KLS) cannot readily be connected to canonical silencing mechanisms, such as RNA-directed DNA methylation and post-transcriptional gene silencing, as both cytosine methylation and small RNA abundance do not correlate with KLS. Reference
COMPASS for rapid combinatorial optimization of biochemical pathways based on artificial transcription factors
Balanced expression of multiple genes is central for establishing new biosynthetic pathways or multiprotein cellular complexes. Methods for efficient combinatorial assembly of regulatory sequences (promoters) and protein coding sequences are therefore highly wanted.
Here, we report a high-throughput cloning method, called COMPASS for COMbinatorial Pathway ASSembly, for the balanced expression of multiple genes in Saccharomyces cerevisiae. COMPASS employs orthogonal, plant-derived artificial transcription factors (ATFs) and homologous recombination-based cloning for the generation of thousands of individual DNA constructs in parallel. Reference
Codon usage optimization in pluripotent embryonic stem cells
The uneven use of synonymous codons in the transcriptome regulates the efficiency and fidelity of protein translation rates. Yet, the importance of this codon bias in regulating cell state-specific expression programmes is currently debated.
Here, we ask whether different codon usage controls gene expression programmes in self-renewing and differentiating embryonic stem cells. Using ribosome and transcriptome profiling, we identify distinct codon signatures during human embryonic stem cell differentiation. We find that cell state-specific codon bias is determined by the guanine-cytosine (GC) content of differentially expressed genes. Reference
A high-density BAC physical map covering the entire MHC region of addax antelope genome
The mammalian major histocompatibility complex (MHC) harbours clusters of genes associated with the immunological defence of animals against infectious pathogens. At present, no complete MHC physical map is available for any of the wild ruminant species in the world.
The high-density physical map is composed of two contigs of 47 overlapping bacterial artificial chromosome (BAC) clones, with an average of 115 Kb for each BAC, covering the entire addax MHC genome. The first contig has 40 overlapping BAC clones covering an approximately 2.9 Mb region of MHC class I, class III, and class IIa, and the second contig has 7 BAC clones covering an approximately 500 Kb genomic region that harbours MHC class IIb. Reference
Comprehensive study of nuclear receptor DNA binding provides a revised framework for understanding receptor specificity
The type II nuclear receptors (NRs) function as heterodimeric transcription factors with the retinoid X receptor (RXR) to regulate diverse biological processes in response to endogenous ligands and therapeutic drugs. DNA-binding specificity has been proposed as a primary mechanism for NR gene regulatory specificity.
Here we use protein-binding microarrays (PBMs) to comprehensively analyze the DNA binding of 12 NR:RXRα dimers. We find more promiscuous NR-DNA binding than has been reported, challenging the view that NR binding specificity is defined by half-site spacing. We show that NRs bind DNA using two distinct modes, explaining widespread NR binding to half-sites in vivo. Finally, we show that the current models of NR specificity better reflect binding-site activity rather than binding-site affinity. Reference
Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer
In most cases of sporadic colorectal cancers, tumorigenesis is a multistep process, involving genomic alterations in parallel with morphologic changes. In addition, accumulating evidence suggests that the human gut microbiome is linked to the development of colorectal cancer.
Here we performed fecal metagenomic and metabolomic studies on samples from a large cohort of 616 participants who underwent colonoscopy to assess taxonomic and functional characteristics of gut microbiota and metabolites. Microbiome and metabolome shifts were apparent in cases of multiple polypoid adenomas and intramucosal carcinomas, in addition to more advanced lesions. We found two distinct patterns of microbiome elevations. Reference
The Genomic and Immune Landscapes of Lethal Metastatic Breast Cancer
The detailed molecular characterization of lethal cancers is a prerequisite to understanding resistance to therapy and escape from cancer immunoediting. We performed extensive multi-platform profiling of multi-regional metastases in autopsies from 10 patients with therapy-resistant breast cancer.
The integrated genomic and immune landscapes show that metastases propagate and evolve as communities of clones, reveal their predicted neo-antigen landscapes, and show that they can accumulate HLA loss of heterozygosity (LOH). The data further identify variable tumor microenvironments and reveal, through analyses of T cell receptor repertoires, that adaptive immune responses appear to co-evolve with the metastatic genomes. These findings reveal in fine detail the landscapes of lethal metastatic breast cancer. Reference
Diabetes causes marked inhibition of mitochondrial metabolism in pancreatic β-cells
Diabetes is a global health problem caused primarily by the inability of pancreatic β-cells to secrete adequate levels of insulin. The molecular mechanisms underlying the progressive failure of β-cells to respond to glucose in type-2 diabetes remain unresolved.
Using a combination of transcriptomics and proteomics, we find significant dysregulation of major metabolic pathways in islets of diabetic βV59M mice, a non-obese, eulipidaemic diabetes model. Multiple genes/proteins involved in glycolysis/gluconeogenesis are upregulated, whereas those involved in oxidative phosphorylation are downregulated. In isolated islets, glucose-induced increases in NADH and ATP are impaired and both oxidative and glycolytic glucose metabolism are reduced. Reference
Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data
DNA base modifications, such as C5-methylcytosine (5mC) and N6-methyldeoxyadenosine (6mA), are important types of epigenetic regulations. Short-read bisulfite sequencing and long-read PacBio sequencing have inherent limitations to detect DNA modifications.
Here, using raw electric signals of Oxford Nanopore long-read sequencing data, we design DeepMod, a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications. We sequence a human genome HX1 and a Chlamydomonas reinhardtii genome using Nanopore sequencing, and then evaluate DeepMod on three types of genomes (Escherichia coli, Chlamydomonas reinhardtii and human genomes). Reference
A practical guide to methods controlling false discoveries in computational biology
In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses.
However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Reference
Comprehensively benchmarking applications for detecting copy number variation
Recently, copy number variation (CNV) has gained considerable interest as a type of genomic variation that plays an important role in complex phenotypes and disease susceptibility. Since a number of CNV detection methods have recently been developed, it is necessary to help investigators choose suitable methods for CNV detection depending on their objectives.
For this reason, this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands. Taking the DGV gold standard variants as a standard dataset, we evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X. Among the ten methods benchmarked, LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Reference
Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments
Single cell RNA-sequencing (scRNA-seq) technology has undergone rapid development in recent years, leading to an explosion in the number of tailored data analysis methods. However, the current lack of gold-standard benchmark datasets makes it difficult for researchers to systematically compare the performance of the many methods available.
Here, we generated a realistic benchmark experiment that included single cells and admixtures of cells or RNA to create ‘pseudo cells’ from up to five distinct cancer cell lines. In total, 14 datasets were generated using both droplet and plate-based scRNA-seq protocols. We compared 3,913 combinations of data analysis methods for tasks ranging from normalization and imputation to clustering, trajectory analysis and data integration. Reference
Associating somatic mutations to clinical outcomes: a pan-cancer study of survival time
We developed subclone multiplicity allocation and somatic heterogeneity (SMASH), a new statistical method for intra-tumor heterogeneity (ITH) inference. SMASH is tailored to the purpose of large-scale association studies with one tumor sample per patient.
In a pan-cancer study of 14 cancer types, we studied the associations between survival time and ITH quantified by SMASH, together with other features of somatic mutations. Our results show that ITH is associated with survival time in several cancer types and its effect can be modified by other covariates, such as mutation burden. Reference
A few Ascomycota taxa dominate soil fungal communities worldwide
Despite having key functions in terrestrial ecosystems, information on the dominant soil fungi and their ecological preferences at the global scale is lacking. To fill this knowledge gap, we surveyed 235 soils from across the globe. Our findings indicate that 83 phylotypes (<0.1% of the retrieved fungi), mostly belonging to wind dispersed, generalist Ascomycota, dominate soils globally.
We identify patterns and ecological drivers of dominant soil fungal taxa occurrence, and present a map of their distribution in soils worldwide. Whole-genome comparisons with less dominant, generalist fungi point at a significantly higher number of genes related to stress-tolerance and resource uptake in the dominant fungi, suggesting that they might be better in colonising a wide range of environments. Reference
WhoGEM: an admixture-based prediction machine accurately predicts quantitative functional traits in plants
The explosive growth of genomic data provides an opportunity to make increased use of sequence variations for phenotype prediction. We have developed a prediction machine for quantitative phenotypes (WhoGEM) that overcomes some of the bottlenecks limiting the current methods.
We demonstrated its performance by predicting quantitative disease resistance and quantitative functional traits in the wild model plant species, Medicago truncatula, using geographical locations as covariates for admixture analysis. The method’s prediction reliability equals or outperforms all existing algorithms for quantitative phenotype prediction. WhoGEM analysis produces evidence that variation in genome admixture proportions explains most of the phenotypic variation for quantitative phenotypes. Reference
OSCA: a tool for omic-data-based complex trait analysis
The rapid increase of omic data has greatly facilitated the investigation of associations between omic profiles such as DNA methylation (DNAm) and complex traits in large cohorts.
Here, we propose a mixed-linear-model-based method called MOMENT that tests for association between a DNAm probe and trait with all other distal probes fitted in multiple random-effect components to account for unobserved confounders. We demonstrate by simulations that MOMENT shows a lower false positive rate and more robustness than existing methods. MOMENT has been implemented in a versatile software package called OSCA together with a number of other implementations for omic-data-based analyses. Reference
qDSB-Seq is a general method for genome-wide quantification of DNA double-strand breaks using sequencing
DNA double-strand breaks (DSBs) are among the most lethal types of DNA damage and frequently cause genome instability. Sequencing-based methods for mapping DSBs have been developed but they allow measurement only of relative frequencies of DSBs between loci, which limits our understanding of the physiological relevance of detected DSBs.
Here we propose quantitative DSB sequencing (qDSB-Seq), a method providing both DSB frequencies per cell and their precise genomic coordinates. We induce spike-in DSBs by a site-specific endonuclease and use them to quantify detected DSBs (labeled, e.g., using i-BLESS). Utilizing qDSB-Seq, we determine numbers of DSBs induced by a radiomimetic drug and replication stress, and reveal two orders of magnitude differences in DSB frequencies. Reference
Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data
We introduce quanTIseq, a method to quantify the fractions of ten immune cell types from bulk RNA-sequencing data. quanTIseq was extensively validated in blood and tumor samples using simulated, flow cytometry, and immunohistochemistry data.
quanTIseq analysis of 8000 tumor samples revealed that cytotoxic T cell infiltration is more strongly associated with the activation of the CXCR3/CXCL9 axis than with mutational load and that deconvolution-based cell scores have prognostic value in several solid cancers. Finally, we used quanTIseq to show how kinase inhibitors modulate the immune contexture and to reveal immune-cell types that underlie differential patients’ responses to checkpoint blockers. Reference
ChiCMaxima: a robust and simple pipeline for detection and visualization of chromatin looping in Capture Hi-C
Capture Hi-C (CHi-C) is a new technique for assessing genome organization based on chromosome conformation capture coupled to oligonucleotide capture of regions of interest, such as gene promoters.
Chromatin loop detection is challenging because existing Hi-C/4C-like tools, which make different assumptions about the technical biases presented, are often unsuitable. We describe a new approach, ChiCMaxima, which uses local maxima combined with limited filtering to detect DNA looping interactions, integrating information from biological replicates. ChiCMaxima shows more stringency and robustness compared to previously developed tools. Reference
Genome-scale screens identify JNK–JUN signaling as a barrier for pluripotency exit and endoderm differentiation
Human embryonic stem cells (ESCs) and human induced pluripotent stem cells hold great promise for cell-based therapies and drug discovery. However, homogeneous differentiation remains a major challenge, highlighting the need for understanding developmental mechanisms.
We performed genome-scale CRISPR screens to uncover regulators of definitive endoderm (DE) differentiation, which unexpectedly uncovered five Jun N-terminal kinase (JNK)–JUN family genes as key barriers of DE differentiation. The JNK–JUN pathway does not act through directly inhibiting the DE enhancers. Instead, JUN co-occupies ESC enhancers with OCT4, NANOG, SMAD2 and SMAD3, and specifically inhibits the exit from the pluripotent state by impeding the decommissioning of ESC enhancers and inhibiting the reconfiguration of SMAD2 and SMAD3 chromatin binding from ESC to DE enhancers. Reference
Transcriptional cofactors display specificity for distinct types of core promoters
Transcriptional cofactors (COFs) communicate regulatory cues from enhancers to promoters and are central effectors of transcription activation and gene expression.
Although some COFs have been shown to prefer certain promoter types over others the extent to which different COFs display intrinsic specificities for distinct promoters is unclear. Here we use a high-throughput promoter-activity assay in Drosophila melanogaster S2 cells to screen 23 COFs for their ability to activate 72,000 candidate core promoters (CPs). We observe differential activation of CPs, indicating distinct regulatory preferences or ‘compatibilities’ between COFs and specific types of CPs. Reference
A systems biology approach uncovers cell-specific gene regulatory effects of genetic associations in multiple sclerosis
Genome-wide association studies (GWAS) have identified more than 50,000 unique associations with common human traits. While this represents a substantial step forward, establishing the biology underlying these associations has proven extremely difficult.
Even determining which cell types and which particular gene(s) are relevant continues to be a challenge. Here, we conduct a cell-specific pathway analysis of the latest GWAS in multiple sclerosis (MS), which had analyzed a total of 47,351 cases and 68,284 healthy controls and found more than 200 non-MHC genome-wide associations. Our analysis identifies pan immune cell as well as cell-specific susceptibility genes in T cells, B cells and monocytes. Reference
Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures
Changes in bulk transcriptional profiles of heterogeneous samples often reflect changes in proportions of individual cell types. Several robust techniques have been developed to dissect the composition of such mixed samples given transcriptional signatures of the pure components or their proportions.
These approaches are insufficient, however, in situations when no information about individual mixture components is available. This problem is known as the complete deconvolution problem, where the composition is revealed without any a priori knowledge about cell types and their proportions. Here, we identify a previously unrecognized property of tissue-specific genes – their mutual linearity – and use it to reveal the structure of the topological space of mixed transcriptional profiles and provide a noise-robust approach to the complete deconvolution problem. Reference
Genomic signatures accompanying the dietary shift to phytophagy in polyphagan beetles
The diversity and evolutionary success of beetles (Coleoptera) are proposed to be related to the diversity of plants on which they feed. Indeed, the largest beetle suborder, Polyphaga, mostly includes plant eaters among its approximately 315,000 species.
We explore the genomic consequences of beetle-plant trophic interactions by performing comparative gene family analyses across 18 species representative of the two most species-rich beetle suborders. We contrast the gene contents of species from the mostly plant-eating suborder Polyphaga with those of the mainly predatory Adephaga. We find gene repertoire evolution to be more dynamic, with significantly more adaptive lineage-specific expansions, in the more speciose Polyphaga. Reference
Host diet and evolutionary history explain different aspects of gut microbiome diversity among vertebrate clades
Multiple factors modulate microbial community assembly in the vertebrate gut, though studies disagree as to their relative contribution. One cause may be a reliance on captive animals, which can have very different gut microbiomes compared to their wild counterparts.
To resolve this disagreement, we analyze a new, large, and highly diverse animal distal gut 16 S rRNA microbiome dataset, which comprises 80% wild animals and includes members of Mammalia, Aves, Reptilia, Amphibia, and Actinopterygii. We decouple the effects of host evolutionary history and diet on gut microbiome diversity and show that each factor modulates different aspects of diversity. Moreover, we resolve particular microbial taxa associated with host phylogeny or diet and show that Mammalia have a stronger signal of cophylogeny. Reference
In-depth human plasma proteome analysis captures tissue proteins and transfer of protein variants across the placenta
Here, we present a method for in-depth human plasma proteome analysis based on high-resolution isoelectric focusing HiRIEF LC-MS/MS, demonstrating high proteome coverage, reproducibility and the potential for liquid biopsy protein profiling.
By integrating genomic sequence information to the MS-based plasma proteome analysis, we enable detection of single amino acid variants and for the first time demonstrate transfer of multiple protein variants between mother and fetus across the placenta. We further show that our method has the ability to detect both low abundance tissue-annotated proteins and phosphorylated proteins in plasma, as well as quantitate differences in plasma proteomes between the mother and the newborn as well as changes related to pregnancy. Reference
VULCAN integrates ChIP-seq with patient-derived co-expression networks to identify GRHL2 as a key co-regulator of ERa at enhancers in breast cancer
VirtUaL ChIP-seq Analysis through Networks (VULCAN) infers regulatory interactions of transcription factors by overlaying networks generated from publicly available tumor expression data onto ChIP-seq data.
We apply our method to dissect the regulation of estrogen receptor-alpha activation in breast cancer to identify potential co-regulators of the estrogen receptor’s transcriptional response. Reference
Association analyses identify 31 new risk loci for colorectal cancer susceptibility
Colorectal cancer (CRC) is a leading cause of cancer-related death worldwide, and has a strong heritable basis. We report a genome-wide association analysis of 34,627 CRC cases and 71,379 controls of European ancestry that identifies SNPs at 31 new CRC risk loci.
We also identify eight independent risk SNPs at the new and previously reported European CRC loci, and a further nine CRC SNPs at loci previously only identified in Asian populations. We use in situ promoter capture Hi-C (CHi-C), gene expression, and in silico annotation methods to identify likely target genes of CRC SNPs. Whilst these new SNP associations implicate target genes that are enriched for known CRC pathways such as Wnt and BMP, they also highlight novel pathways with no prior links to colorectal tumourigenesis. Reference
Transcriptomics-Based Screening Identifies Pharmacological Inhibition of Hsp90 as a Means to Defer Aging
Aging strongly influences human morbidity and mortality. Thus, aging-preventive compounds could greatly improve our health and lifespan. Here we screened for such compounds, known as geroprotectors, employing the power of transcriptomics to predict biological age.
Using age-stratified human tissue transcriptomes and machine learning, we generated age classifiers and applied these to transcriptomic changes induced by 1,309 different compounds in human cells, ranking these compounds by their ability to induce a “youthful” transcriptional state. Testing the top candidates in C. elegans, we identified two Hsp90 inhibitors, monorden and tanespimycin, which extended the animals’ lifespan and improved their health. Hsp90 inhibition induces expression of heat shock proteins known to improve protein homeostasis. Reference
Stem cell-associated heterogeneity in Glioblastoma results from intrinsic tumor plasticity shaped by the microenvironment
The identity and unique capacity of cancer stem cells (CSC) to drive tumor growth and resistance have been challenged in brain tumors. Here we report that cells expressing CSC-associated cell membrane markers in Glioblastoma (GBM) do not represent a clonal entity defined by distinct functional properties and transcriptomic profiles, but rather a plastic state that most cancer cells can adopt.
We show that phenotypic heterogeneity arises from non-hierarchical, reversible state transitions, instructed by the microenvironment and is predictable by mathematical modeling. Although functional stem cell properties were similar in vitro, accelerated reconstitution of heterogeneity provides a growth advantage in vivo, suggesting that tumorigenic potential is linked to intrinsic plasticity rather than CSC multipotency. Reference
Differential expression analysis of Trichoderma virens RNA reveals a dynamic transcriptome during colonization of Zea mays roots
Trichoderma spp. are majorly composed of plant-beneficial symbionts widely used in agriculture as bio-control agents. Studying the mechanisms behind Trichoderma-derived plant benefits has yielded tangible bio-industrial products.
To better take advantage of this fungal-plant symbiosis it is necessary to obtain detailed knowledge of which genes Trichoderma utilizes during interaction with its plant host. In this study, we explored the transcriptional activity undergone by T. virens during two phases of symbiosis with maize; recognition of roots and after ingress into the root cortex. Reference
Interrogation of human hematopoiesis at single-cell and single-variant resolution
Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.
Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Finally, we leverage fine-mapped variants in conjunction with continuous epigenomic annotations to identify trait–cell type enrichments within closely related populations and in single cells. Reference
Guidelines for using sigQC for systematic evaluation of gene signatures
With the increased use of next-generation sequencing generating large amounts of genomic data, gene expression signatures are becoming critically important tools for the interpretation of these data, and are poised to have a substantial effect on diagnosis, management, and prognosis for a number of diseases.
It is becoming crucial to establish whether the expression patterns and statistical properties of sets of genes, or gene signatures, are conserved across independent datasets. Conversely, it is necessary to compare established signatures on the same dataset to better understand how they capture different clinical or biological characteristics. Here we describe how to use sigQC, a tool that enables a streamlined, systematic approach for the evaluation of previously obtained gene signatures across multiple gene expression datasets. We implemented sigQC in an R package, making it accessible to users who have knowledge of file input/output and matrix manipulation in R and a moderate grasp of core statistical principles. Reference
Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens
Functional genomics approaches can overcome limitations—such as the lack of identification of robust targets and poor clinical efficacy—that hamper cancer drug development.
Here we performed genome-scale CRISPR–Cas9 screens in 324 human cancer cell lines from 30 cancer types and developed a data-driven framework to prioritize candidates for cancer therapeutics. We integrated cell fitness effects with genomic biomarkers and target tractability for drug development to systematically prioritize new targets in defined tissues and genotypes. We verified one of our most promising dependencies, the Werner syndrome ATP-dependent helicase, as a synthetic lethal target in tumours from multiple cancer types with microsatellite instability. Reference
Comparative analysis of sequencing technologies for single-cell transcriptomics
Single-cell RNA-seq technologies require library preparation prior to sequencing. Here, we present the first report to compare the cheaper BGISEQ-500 platform to the Illumina HiSeq platform for scRNA-seq.
We generate a resource of 468 single cells and 1297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on two cell lines with RNA spike-ins. We sequence these libraries on both platforms using single- and paired-end reads. The platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardized scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms. Reference
A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals.
Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Reference
Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects
Only a small fraction of early drug programs progress to the market, due to safety and efficacy failures, despite extensive efforts to predict safety. Characterizing the effect of natural variation in the genes encoding drug targets should present a powerful approach to predict side effects arising from drugging particular proteins.
In this retrospective analysis, we report a correlation between the organ systems affected by genetic variation in drug targets and the organ systems in which side effects are observed. Across 1819 drugs and 21 phenotype categories analyzed, drug side effects are more likely to occur in organ systems where there is genetic evidence of a link between the drug target and a phenotype involving that organ system, compared to when there is no such genetic evidence (30.0 vs 19.2%; OR = 1.80). Reference
Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing
Accurate variant calling and genotyping represent major limiting factors for downstream applications of single-cell genomics. Here, we report Conbase for the identification of somatic mutations in single-cell DNA sequencing data.
Conbase leverages phased read data from multiple samples in a dataset to achieve increased confidence in somatic variant calls and genotype predictions. Comparing the performance of Conbase to three other methods, we find that Conbase performs best in terms of false discovery rate and specificity and provides superior robustness on simulated data, in vitro expanded fibroblasts and clonal lymphocyte populations isolated directly from a healthy human donor. Reference
Meta-analysis of genome-wide association studies provides insights into genetic control of tomato flavor
Tomato flavor has changed over the course of long-term domestication and intensive breeding. To understand the genetic control of flavor, we report the meta-analysis of genome-wide association studies (GWAS) using 775 tomato accessions and 2,316,117 SNPs from three GWAS panels.
We discover 305 significant associations for the contents of sugars, acids, amino acids, and flavor-related volatiles. We demonstrate that fruit citrate and malate contents have been impacted by selection during domestication and improvement, while sugar content has undergone less stringent selection. We suggest that it may be possible to significantly increase volatiles that positively contribute to consumer preferences while reducing unpleasant volatiles, by selection of the relevant allele combinations. Reference
A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms
We report a computational approach (implemented in MS-DIAL 3.0; http://prime.psc.riken.jp/) for metabolite structure characterization using fully 13C-labeled and non-labeled plants and LC–MS/MS. Our approach facilitates carbon number determination and metabolite classification for unknown molecules.
Applying our method to 31 tissues from 12 plant species, we assigned 1,092 structures and 344 formulae to 3,604 carbon-determined metabolite ions, 69 of which were found to represent structures currently not listed in metabolome databases. Reference
Crizotinib-induced immunogenic cell death in non-small cell lung cancer
Immunogenic cell death (ICD) converts dying cancer cells into a therapeutic vaccine and stimulates antitumor immune responses. Here we unravel the results of an unbiased screen identifying high-dose (10 µM) crizotinib as an ICD-inducing tyrosine kinase inhibitor that has exceptional antineoplastic activity when combined with non-ICD inducing chemotherapeutics like cisplatin.
The combination of cisplatin and high-dose crizotinib induces ICD in non-small cell lung carcinoma (NSCLC) cells and effectively controls the growth of distinct (transplantable, carcinogen- or oncogene induced) orthotopic NSCLC models. These anticancer effects are linked to increased T lymphocyte infiltration and are abolished by T cell depletion or interferon-γ neutralization. Crizotinib plus cisplatin leads to an increase in the expression of PD-1 and PD-L1 in tumors, coupled to a strong sensitization of NSCLC to immunotherapy with PD-1 antibodies. Reference
Learning protein constitutive motifs from sequence data
Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information.
We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ‘turning up’ or ‘turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. Reference
The anti-cancer drugs curaxins target spatial genome organization
Recently we characterized a class of anti-cancer agents (curaxins) that disturbs DNA/histone interactions within nucleosomes.
Here, using a combination of genomic and in vitro approaches, we demonstrate that curaxins strongly affect spatial genome organization and compromise enhancer-promoter communication, which is necessary for the expression of several oncogenes, including MYC. We further show that curaxins selectively inhibit enhancer-regulated transcription of chromatinized templates in cell-free conditions. Genomic studies also suggest that curaxins induce partial depletion of CTCF from its binding sites, which contributes to the observed changes in genome topology. Thus, curaxins can be classified as epigenetic drugs that target the 3D genome organization. Reference
Developing a network view of type 2 diabetes risk pathways through integration of genetic, genomic and functional data
Genome-wide association studies (GWAS) have identified several hundred susceptibility loci for type 2 diabetes (T2D). One critical, but unresolved, issue concerns the extent to which the mechanisms through which these diverse signals influencing T2D predisposition converge on a limited set of biological processes.
However, the causal variants identified by GWAS mostly fall into a non-coding sequence, complicating the task of defining the effector transcripts through which they operate. Reference
Systematic benchmarking of omics computational tools
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking.
Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Reference
Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list
Differential gene expression analysis may discover a set of genes too large to easily investigate, so a means of ranking genes by biological interest level is desired. p values are frequently abused for this purpose.
As an alternative, we propose a method of ranking by confidence bounds on the log fold change, based on the previously developed TREAT test. These confidence bounds provide guaranteed false discovery rate and false coverage-statement rate control. When applied to a breast cancer dataset, the top-ranked genes by Topconfects emphasize markedly different biological processes compared to the top-ranked genes by p value. Reference
EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data
Droplet-based single-cell RNA sequencing protocols have dramatically increased the throughput of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets.
Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that EmptyDrops has greater power than existing approaches while controlling the false discovery rate among detected cells. Our method also retains distinct cell types that would have been discarded by existing methods in several real data sets. Reference
Aberrant FGFR signaling mediates resistance to CDK4/6 inhibitors in ER+ breast cancer
Using an ORF kinome screen in MCF-7 cells treated with the CDK4/6 inhibitor ribociclib plus fulvestrant, we identified FGFR1 as a mechanism of drug resistance. FGFR1-amplified/ER+ breast cancer cells and MCF-7 cells transduced with FGFR1 were resistant to fulvestrant ± ribociclib or palbociclib.
This resistance was abrogated by treatment with the FGFR tyrosine kinase inhibitor (TKI) lucitanib. Addition of the FGFR TKI erdafitinib to palbociclib/fulvestrant induced complete responses of FGFR1-amplified/ER+ patient-derived-xenografts. Next generation sequencing of circulating tumor DNA (ctDNA) in 34 patients after progression on CDK4/6 inhibitors identified FGFR1/2 amplification or activating mutations in 14/34 (41%) post-progression specimens. Reference
Dissecting heterogeneity in malignant pleural mesothelioma through histo-molecular gradients for clinical applications
Malignant pleural mesothelioma (MPM) is recognized as heterogeneous based both on histology and molecular profiling. Histology addresses inter-tumor and intra-tumor heterogeneity in MPM and describes three major types: epithelioid, sarcomatoid and biphasic, a combination of the former two types.
Molecular profiling studies have not addressed intra-tumor heterogeneity in MPM to date. Here, we use a deconvolution approach and show that molecular gradients shed new light on the intra-tumor heterogeneity of MPM, leading to a reconsideration of MPM molecular classifications. We show that each tumor can be decomposed as a combination of epithelioid-like and sarcomatoid-like components whose proportions are highly associated with the prognosis. Reference
Identification of pathways associated with chemosensitivity through network embedding
Basal gene expression levels have been shown to be predictive of cellular response to cytotoxic treatments. However, such analyses do not fully reveal complex genotype- phenotype relationships, which are partly encoded in highly interconnected molecular networks. Biological pathways provide a complementary way of understanding drug response variation among individuals.
In this study, we integrate chemosensitivity data from a large-scale pharmacogenomics study with basal gene expression data from the CCLE project and prior knowledge of molecular networks to identify specific pathways mediating chemical response. We first develop a computational method called PACER, which ranks pathways for enrichment in a given set of genes using a novel network embedding method. It examines a molecular network that encodes known gene-gene as well as gene-pathway relationships, and determines a vector representation of each gene and pathway in the same low-dimensional vector space. The relevance of a pathway to the given gene set is then captured by the similarity between the pathway vector and gene vectors. Reference
Neoantigen-directed immune escape in lung cancer evolution
The interplay between an evolving cancer and a dynamic immune microenvironment remains unclear. Here we analyse 258 regions from 88 early-stage, untreated non-small-cell lung cancers using RNA sequencing and histopathology-assessed tumour-infiltrating lymphocyte estimates.
Immune infiltration varied both between and within tumours, with different mechanisms of neoantigen presentation dysfunction enriched in distinct immune microenvironments. Sparsely infiltrated tumours exhibited a waning of neoantigen editing during tumour evolution, indicative of historical immune editing, or copy-number loss of previously clonal neoantigens. Immune-infiltrated tumour regions exhibited ongoing immunoediting, with either loss of heterozygosity in human leukocyte antigens or depletion of expressed neoantigens. We identified promoter hypermethylation of genes that contain neoantigenic mutations as an epigenetic mechanism of immunoediting. Reference
Melissa: Bayesian clustering and imputation of single-cell methylomes
Measurements of single-cell methylation are revolutionizing our understanding of epigenetic control of gene expression, yet the intrinsic data sparsity limits the scope for quantitative analysis of such data.
Here, we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells. The clustering also acts as an effective regularization for data imputation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings and state-of-the-art imputation performance. Reference
Measuring the reproducibility and quality of Hi-C data
Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease.
However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Reference
MGSEA – a multivariate Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a powerful tool to identify enriched functional categories of informative biomarkers. Canonical GSEA takes one-dimensional feature scores derived from the data of one platform as inputs.
Numerous extensions of GSEA handling multimodal OMIC data are proposed, yet none of them explicitly captures combinatorial relations of feature scores from multiple platforms. Reference
A reference-grade wild soybean genome
Efficient crop improvement depends on the application of accurate genetic information contained in diverse germplasm resources.
Here we report a reference-grade genome of wild soybean accession W05, with a final assembled genome size of 1013.2 Mb and a contig N50 of 3.3 Mb. The analytical power of the W05 genome is demonstrated by several examples. First, we identify an inversion at the locus determining seed coat color during domestication. Second, a translocation event between chromosomes 11 and 13 of some genotypes is shown to interfere with the assignment of QTLs. Third, we find a region containing copy number variations of the Kunitz trypsin inhibitor (KTI) genes. Reference
RnBeads 2.0: comprehensive analysis of DNA methylation data
DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples.
Here, we describe a new version of our RnBeads software – an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer. Reference
Network-based prediction of drug combinations
Drug combinations, offering increased therapeutic efficacy and reduced toxicity, play an important role in treating multiple complex diseases. Yet, our ability to identify and validate effective combinations is limited by a combinatorial explosion, driven by both the large number of drug pairs as well as dosage combinations.
Here we propose a network-based methodology to identify clinically efficacious drug combinations for specific diseases. By quantifying the network-based relationship between drug targets and disease proteins in the human protein–protein interactome, we show the existence of six distinct classes of drug–drug–disease combinations. Reference
Best practices for benchmarking germline small-variant calls in human genomes
Standardized benchmarking approaches are required to assess the accuracy of variants called from sequence data. Although variant-calling tools and the metrics used to assess their performance continue to improve, important challenges remain.
Here, as part of the Global Alliance for Genomics and Health (GA4GH), we present a benchmarking framework for variant calling. We provide guidance on how to match variant calls with different representations, define standard performance metrics, and stratify performance by variant type and genome context. We describe limitations of high-confidence calls and regions that can be used as truth sets (for example, single-nucleotide variant concordance of two methods is 99.7% inside versus 76.5% outside high-confidence regions). Reference
Interrogation of human hematopoiesis at single-cell and single-variant resolution
Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.
Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Reference
Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement
Brassica napus (2n = 4x = 38, AACC) is an important allopolyploid crop derived from interspecific crosses between Brassica rapa (2n = 2x = 20, AA) and Brassica oleracea (2n = 2x = 18, CC). However, no truly wild B. napus populations are known; its origin and improvement processes remain unclear.
Here, we resequence 588 B. napus accessions. We uncover that the A subgenome may evolve from the ancestor of European turnip and the C subgenome may evolve from the common ancestor of kohlrabi, cauliflower, broccoli, and Chinese kale. Additionally, winter oilseed may be the original form of B. napus. Subgenome-specific selection of defense-response genes has contributed to environmental adaptation after formation of the species, whereas asymmetrical subgenomic selection has led to ecotype change. Reference
Topological scoring of protein interaction networks
It remains a significant challenge to define individual protein associations within networks where an individual protein can directly interact with other proteins and/or be part of large complexes, which contain functional modules.
Here we demonstrate the topological scoring (TopS) algorithm for the analysis of quantitative proteomic datasets from affinity purifications. Data is analyzed in a parallel fashion where a prey protein is scored in an individual affinity purification by aggregating information from the entire dataset. Topological scores span a broad range of values indicating the enrichment of an individual protein in every bait protein purification. TopS is applied to interaction networks derived from human DNA repair proteins and yeast chromatin remodeling complexes. Reference
I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms
We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods.
By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data. Reference
Osteogenesis depends on commissioning of a network of stem cell transcription factors that act as repressors of adipogenesis
Mesenchymal (stromal) stem cells (MSCs) constitute populations of mesodermal multipotent cells involved in tissue regeneration and homeostasis in many different organs.
Here we performed comprehensive characterization of the transcriptional and epigenomic changes associated with osteoblast and adipocyte differentiation of human MSCs. We demonstrate that adipogenesis is driven by considerable remodeling of the chromatin landscape and de novo activation of enhancers, whereas osteogenesis involves activation of preestablished enhancers. Using machine learning algorithms for in silico modeling of transcriptional regulation, we identify a large and diverse transcriptional network of pro-osteogenic and antiadipogenic transcription factors. Reference
GWAS identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates
Sleep is an essential state of decreased activity and alertness but molecular factors regulating sleep duration remain unknown. Through genome-wide association analysis in 446,118 adults of European ancestry from the UK Biobank, we identify 78 loci for self-reported habitual sleep duration (p < 5 × 10−8; 43 loci at p < 6 × 10−9).
Replication is observed for PAX8, VRK2, and FBXL12/UBL5/PIN1 loci in the CHARGE study (n = 47,180; p < 6.3 × 10−4), and 55 signals show sign-concordant effects. The 78 loci further associate with accelerometer-derived sleep duration, daytime inactivity, sleep efficiency and number of sleep bouts in secondary analysis (n = 85,499). Loci are enriched for pathways including striatum and subpallium development, mechanosensory response, dopamine binding, synaptic neurotransmission and plasticity, among others. Reference
Genome-scale network model of metabolism and histone acetylation reveals metabolic dependencies of histone deacetylase inhibitors
Histone acetylation plays a central role in gene regulation and is sensitive to the levels of metabolic intermediates. However, predicting the impact of metabolic alterations on acetylation in pathological conditions is a significant challenge.
Here, we present a genome-scale network model that predicts the impact of nutritional environment and genetic alterations on histone acetylation. It identifies cell types that are sensitive to histone deacetylase inhibitors based on their metabolic state, and we validate metabolites that alter drug sensitivity. Our model provides a mechanistic framework for predicting how metabolic perturbations contribute to epigenetic changes and sensitivity to deacetylase inhibitors. Reference
A genome-wide association analysis identifies 16 novel susceptibility loci for carpal tunnel syndrome
Carpal tunnel syndrome (CTS) is a common and disabling condition of the hand caused by entrapment of the median nerve at the level of the wrist. It is the commonest entrapment neuropathy, with estimates of prevalence ranging between 5–10%.
Here, we undertake a genome-wide association study (GWAS) of an entrapment neuropathy, using 12,312 CTS cases and 389,344 controls identified in UK Biobank. We discover 16 susceptibility loci for CTS with p < 5 × 10−8. We identify likely causal genes in the pathogenesis of CTS, including ADAMTS17, ADAMTS10 and EFEMP1, and using RNA sequencing demonstrate expression of these genes in surgically resected tenosynovium from CTS patients. We perform Mendelian randomisation and demonstrate a causal relationship between short stature and higher risk of CTS. Reference
Prioritizing Parkinson’s disease genes using population-scale transcriptomic data
Genome-wide association studies (GWAS) have identified over 41 susceptibility loci associated with Parkinson’s Disease (PD) but identifying putative causal genes and the underlying mechanisms remains challenging.
Here, we leverage large-scale transcriptomic datasets to prioritize genes that are likely to affect PD by using a transcriptome-wide association study (TWAS) approach. Using this approach, we identify 66 gene associations whose predicted expression or splicing levels in dorsolateral prefrontal cortex (DLFPC) and peripheral monocytes are significantly associated with PD risk. We uncover many novel genes associated with PD but also novel mechanisms for known associations such as MAPT, for which we find that variation in exon 3 splicing explains the common genetic association. Reference
MMSplice: modular modeling improves the predictions of genetic variant effects on splicing
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge.
The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. Reference
Colonic epithelial cell diversity in health and inflammatory bowel disease
The colonic epithelium facilitates host–microorganism interactions to control mucosal immunity, coordinate nutrient recycling and form a mucus barrier. Breakdown of the epithelial barrier underpins inflammatory bowel disease (IBD). However, the specific contributions of each epithelial-cell subtype to this process are unknown.
Here we profile single colonic epithelial cells from patients with IBD and unaffected controls. We identify previously unknown cellular subtypes, including gradients of progenitor cells, colonocytes and goblet cells within intestinal crypts. At the top of the crypts, we find a previously unknown absorptive cell, expressing the proton channel OTOP2 and the satiety peptide uroguanylin, that senses pH and is dysregulated in inflammation and cancer. In IBD, we observe a positional remodelling of goblet cells that coincides with downregulation of WFDC2—an antiprotease molecule that we find to be expressed by goblet cells and that inhibits bacterial growth. Reference
An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics
Aging promotes lung function decline and susceptibility to chronic lung diseases, which are the third leading cause of death worldwide. Here, we use single cell transcriptomics and mass spectrometry-based proteomics to quantify changes in cellular activity states across 30 cell types and chart the lung proteome of young and old mice.
We show that aging leads to increased transcriptional noise, indicating deregulated epigenetic control. We observe cell type-specific effects of aging, uncovering increased cholesterol biosynthesis in type-2 pneumocytes and lipofibroblasts and altered relative frequency of airway epithelial cells as hallmarks of lung aging. Reference
A network-centric approach to drugging TNF-induced NF-κB signaling
Target-centric drug development strategies prioritize single-target potency in vitro and do not account for connectivity and multi-target effects within a signal transduction network.
Here, we present a systems biology approach that combines transcriptomic and structural analyses with live-cell imaging to predict small molecule inhibitors of TNF-induced NF-κB signaling and elucidate the network response. We identify two first-in-class small molecules that inhibit the NF-κB signaling pathway by preventing the maturation of a rate-limiting multiprotein complex necessary for IKK activation. Our findings suggest that a network-centric drug discovery approach is a promising strategy to evaluate the impact of pharmacologic intervention in signaling. Reference
Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling
DNase-seq and ATAC-seq are broadly used methods to assay open chromatin regions genome-wide. The single nucleotide resolution of DNase-seq has been further exploited to infer transcription factor binding sites (TFBSs) in regulatory regions through footprinting.
Here, we undertake a systematic comparison of the two methods and show that a modification to the ATAC-seq protocol increases its yield and its agreement with DNase-seq data from the same cell line. We demonstrate that the two methods have distinct sequence biases and correct for these protocol-specific biases when performing footprinting. Reference
Ediacaran biozones identified with network analysis provide evidence for pulsed extinctions of early complex life
Rocks of Ediacaran age (~635–541 Ma) contain the oldest fossils of large, complex organisms and their behaviors. These fossils document developmental and ecological innovations, and suggest that extinctions helped to shape the trajectory of early animal evolution.
Conventional methods divide Ediacaran macrofossil localities into taxonomically distinct clusters, which may represent evolutionary, environmental, or preservational variation. Here, we investigate these possibilities with network analysis of body and trace fossil occurrences. By partitioning multipartite networks of taxa, paleoenvironments, and geologic formations into community units, we distinguish between biostratigraphic zones and paleoenvironmentally restricted biotopes, and provide empirically robust and statistically significant evidence for a global, cosmopolitan assemblage unique to terminal Ediacaran strata. Reference
Epigenetic signatures associated with imprinted paternally expressed genes in the Arabidopsis endosperm
Imprinted genes are epigenetically modified during gametogenesis and maintain the established epigenetic signatures after fertilization, causing parental-specific gene expression.
In this study, we show that imprinted paternally expressed genes (PEGs) in the Arabidopsis endosperm are marked by an epigenetic signature of Polycomb Repressive Complex2 (PRC2)-mediated H3K27me3 together with heterochromatic H3K9me2 and CHG methylation, which specifically mark the silenced maternal alleles of PEGs. Reference
Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis
Streptococcus pyogenes causes 700 million human infections annually worldwide, yet, despite a century of intensive effort, there is no licensed vaccine against this bacterium.
Although a number of large-scale genomic studies of bacterial pathogens have been published, the relationships among the genome, transcriptome, and virulence in large bacterial populations remain poorly understood. We sequenced the genomes of 2,101 emm28 S. pyogenes invasive strains, from which we selected 492 phylogenetically diverse strains for transcriptome analysis and 50 strains for virulence assessment. Data integration provided a novel understanding of the virulence mechanisms of this model organism. Reference
Genomic analyses of an extensive collection of wild and cultivated accessions provide new insights into peach breeding history
Human selection has a long history of transforming crop genomes. Peach (Prunus persica) has undergone more than 5000 years of domestication that led to remarkable changes in a series of agronomically important traits, but genetic bases underlying these changes and the effects of artificial selection on genomic diversity are not well understood.
Here, we report a comprehensive analysis of peach evolution based on genome sequences of 480 wild and cultivated accessions. By focusing on a set of quantitative trait loci (QTLs), we provide evidence supporting that distinct phases of domestication and improvement have led to an increase in fruit size and taste and extended its geographic distribution. Reference
Precise tuning of gene expression levels in mammalian cells
Precise, analogue regulation of gene expression is critical for cellular function in mammals. In contrast, widely employed experimental and therapeutic approaches such as knock-in/out strategies are more suitable for binary control of gene activity.
Here we report on a method for precise control of gene expression levels in mammalian cells using engineered microRNA response elements (MREs). First, we measure the efficacy of thousands of synthetic MRE variants under the control of an endogenous microRNA by high-throughput sequencing. Guided by this data, we establish a library of microRNA silencing-mediated fine-tuners (miSFITs) of varying strength that can be employed to precisely control the expression of user-specified genes. We apply this technology to tune the T-cell co-inhibitory receptor PD-1 and to explore how antigen expression influences T-cell activation and tumour growth. Finally, we employ CRISPR/Cas9 mediated homology directed repair to introduce miSFITs into the BRCA1 3′UTR, demonstrating that this versatile tool can be used to tune endogenous genes. Reference
Multi-omic measurements of heterogeneity in HeLa cells across laboratories
Reproducibility in research can be compromised by both biological and technical variation, but most of the focus is on removing the latter. Here we investigate the effects of biological variation in HeLa cell lines using a systems-wide approach.
We determine the degree of molecular and phenotypic variability across 14 stock HeLa samples from 13 international laboratories. We cultured cells in uniform conditions and profiled genome-wide copy numbers, mRNAs, proteins and protein turnover rates in each cell line. We discovered substantial heterogeneity between HeLa variants, especially between lines of the CCL2 and Kyoto varieties, and observed progressive divergence within a specific cell line over 50 successive passages. Reference
Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data
t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets.
We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Reference
A mathematical-descriptor of tumor-mesoscopic-structure from CT images annotates prognostic- and molecular-phenotypes of epithelial ovarian cancer
The five-year survival rate of epithelial ovarian cancer (EOC) is approximately 35–40% despite maximal treatment efforts, highlighting a need for stratification biomarkers for personalized treatment.
Here we extract 657 quantitative mathematical descriptors from the preoperative CT images of 364 EOC patients at their initial presentation. Using machine learning, we derive a non-invasive summary-statistic of the primary ovarian tumor based on 4 descriptors, which we name “Radiomic Prognostic Vector” (RPV). RPV reliably identifies the 5% of patients with median overall survival less than 2 years, significantly improves established prognostic methods, and is validated in two independent, multi-center cohorts. Reference