Science in this Week (March, 2019)

Update on: March 22, 2019

Neoantigen-directed immune escape in lung cancer evolution

The interplay between an evolving cancer and a dynamic immune microenvironment remains unclear. Here we analyse 258 regions from 88 early-stage, untreated non-small-cell lung cancers using RNA sequencing and histopathology-assessed tumour-infiltrating lymphocyte estimates.

Immune infiltration varied both between and within tumours, with different mechanisms of neoantigen presentation dysfunction enriched in distinct immune microenvironments. Sparsely infiltrated tumours exhibited a waning of neoantigen editing during tumour evolution, indicative of historical immune editing, or copy-number loss of previously clonal neoantigens. Immune-infiltrated tumour regions exhibited ongoing immunoediting, with either loss of heterozygosity in human leukocyte antigens or depletion of expressed neoantigens. We identified promoter hypermethylation of genes that contain neoantigenic mutations as an epigenetic mechanism of immunoediting. Reference

Melissa: Bayesian clustering and imputation of single-cell methylomes

Measurements of single-cell methylation are revolutionizing our understanding of epigenetic control of gene expression, yet the intrinsic data sparsity limits the scope for quantitative analysis of such data.

Here, we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells. The clustering also acts as an effective regularization for data imputation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings and state-of-the-art imputation performance. Reference

Measuring the reproducibility and quality of Hi-C data

Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease.

However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Reference

MGSEA – a multivariate Gene set enrichment analysis

Gene Set Enrichment Analysis (GSEA) is a powerful tool to identify enriched functional categories of informative biomarkers. Canonical GSEA takes one-dimensional feature scores derived from the data of one platform as inputs.

Numerous extensions of GSEA handling multimodal OMIC data are proposed, yet none of them explicitly captures combinatorial relations of feature scores from multiple platforms. Reference

A reference-grade wild soybean genome

Efficient crop improvement depends on the application of accurate genetic information contained in diverse germplasm resources.

Here we report a reference-grade genome of wild soybean accession W05, with a final assembled genome size of 1013.2 Mb and a contig N50 of 3.3 Mb. The analytical power of the W05 genome is demonstrated by several examples. First, we identify an inversion at the locus determining seed coat color during domestication. Second, a translocation event between chromosomes 11 and 13 of some genotypes is shown to interfere with the assignment of QTLs. Third, we find a region containing copy number variations of the Kunitz trypsin inhibitor (KTI) genes. Reference

RnBeads 2.0: comprehensive analysis of DNA methylation data

DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples.

Here, we describe a new version of our RnBeads software – an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer. Reference

Network-based prediction of drug combinations

Drug combinations, offering increased therapeutic efficacy and reduced toxicity, play an important role in treating multiple complex diseases. Yet, our ability to identify and validate effective combinations is limited by a combinatorial explosion, driven by both the large number of drug pairs as well as dosage combinations.

Here we propose a network-based methodology to identify clinically efficacious drug combinations for specific diseases. By quantifying the network-based relationship between drug targets and disease proteins in the human protein–protein interactome, we show the existence of six distinct classes of drug–drug–disease combinations. Reference

Best practices for benchmarking germline small-variant calls in human genomes

Standardized benchmarking approaches are required to assess the accuracy of variants called from sequence data. Although variant-calling tools and the metrics used to assess their performance continue to improve, important challenges remain.

Here, as part of the Global Alliance for Genomics and Health (GA4GH), we present a benchmarking framework for variant calling. We provide guidance on how to match variant calls with different representations, define standard performance metrics, and stratify performance by variant type and genome context. We describe limitations of high-confidence calls and regions that can be used as truth sets (for example, single-nucleotide variant concordance of two methods is 99.7% inside versus 76.5% outside high-confidence regions). Reference

Interrogation of human hematopoiesis at single-cell and single-variant resolution

Widespread linkage disequilibrium and incomplete annotation of cell-to-cell state variation represent substantial challenges to elucidating mechanisms of trait-associated genetic variation.

Here we perform genetic fine-mapping for blood cell traits in the UK Biobank to identify putative causal variants. These variants are enriched in genes encoding proteins in trait-relevant biological pathways and in accessible chromatin of hematopoietic progenitors. For regulatory variants, we explore patterns of developmental enhancer activity, predict molecular mechanisms, and identify likely target genes. In several instances, we localize multiple independent variants to the same regulatory element or gene. We further observe that variants with pleiotropic effects preferentially act in common progenitor populations to direct the production of distinct lineages. Reference

Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement

Brassica napus (2n = 4x = 38, AACC) is an important allopolyploid crop derived from interspecific crosses between Brassica rapa (2n = 2x = 20, AA) and Brassica oleracea (2n = 2x = 18, CC). However, no truly wild B. napus populations are known; its origin and improvement processes remain unclear.

Here, we resequence 588 B. napus accessions. We uncover that the A subgenome may evolve from the ancestor of European turnip and the C subgenome may evolve from the common ancestor of kohlrabi, cauliflower, broccoli, and Chinese kale. Additionally, winter oilseed may be the original form of B. napus. Subgenome-specific selection of defense-response genes has contributed to environmental adaptation after formation of the species, whereas asymmetrical subgenomic selection has led to ecotype change. Reference

Topological scoring of protein interaction networks

It remains a significant challenge to define individual protein associations within networks where an individual protein can directly interact with other proteins and/or be part of large complexes, which contain functional modules.

Here we demonstrate the topological scoring (TopS) algorithm for the analysis of quantitative proteomic datasets from affinity purifications. Data is analyzed in a parallel fashion where a prey protein is scored in an individual affinity purification by aggregating information from the entire dataset. Topological scores span a broad range of values indicating the enrichment of an individual protein in every bait protein purification. TopS is applied to interaction networks derived from human DNA repair proteins and yeast chromatin remodeling complexes. Reference

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods.

By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data. Reference

Osteogenesis depends on commissioning of a network of stem cell transcription factors that act as repressors of adipogenesis

Mesenchymal (stromal) stem cells (MSCs) constitute populations of mesodermal multipotent cells involved in tissue regeneration and homeostasis in many different organs.

Here we performed comprehensive characterization of the transcriptional and epigenomic changes associated with osteoblast and adipocyte differentiation of human MSCs. We demonstrate that adipogenesis is driven by considerable remodeling of the chromatin landscape and de novo activation of enhancers, whereas osteogenesis involves activation of preestablished enhancers. Using machine learning algorithms for in silico modeling of transcriptional regulation, we identify a large and diverse transcriptional network of pro-osteogenic and antiadipogenic transcription factors. Reference

GWAS identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates

Sleep is an essential state of decreased activity and alertness but molecular factors regulating sleep duration remain unknown. Through genome-wide association analysis in 446,118 adults of European ancestry from the UK Biobank, we identify 78 loci for self-reported habitual sleep duration (p < 5 × 10−8; 43 loci at p < 6 × 10−9).

Replication is observed for PAX8, VRK2, and FBXL12/UBL5/PIN1 loci in the CHARGE study (n = 47,180; p < 6.3 × 10−4), and 55 signals show sign-concordant effects. The 78 loci further associate with accelerometer-derived sleep duration, daytime inactivity, sleep efficiency and number of sleep bouts in secondary analysis (n = 85,499). Loci are enriched for pathways including striatum and subpallium development, mechanosensory response, dopamine binding, synaptic neurotransmission and plasticity, among others. Reference

Genome-scale network model of metabolism and histone acetylation reveals metabolic dependencies of histone deacetylase inhibitors

Histone acetylation plays a central role in gene regulation and is sensitive to the levels of metabolic intermediates. However, predicting the impact of metabolic alterations on acetylation in pathological conditions is a significant challenge.

Here, we present a genome-scale network model that predicts the impact of nutritional environment and genetic alterations on histone acetylation. It identifies cell types that are sensitive to histone deacetylase inhibitors based on their metabolic state, and we validate metabolites that alter drug sensitivity. Our model provides a mechanistic framework for predicting how metabolic perturbations contribute to epigenetic changes and sensitivity to deacetylase inhibitors. Reference

A genome-wide association analysis identifies 16 novel susceptibility loci for carpal tunnel syndrome

Carpal tunnel syndrome (CTS) is a common and disabling condition of the hand caused by entrapment of the median nerve at the level of the wrist. It is the commonest entrapment neuropathy, with estimates of prevalence ranging between 5–10%.

Here, we undertake a genome-wide association study (GWAS) of an entrapment neuropathy, using 12,312 CTS cases and 389,344 controls identified in UK Biobank. We discover 16 susceptibility loci for CTS with p < 5 × 10−8. We identify likely causal genes in the pathogenesis of CTS, including ADAMTS17, ADAMTS10 and EFEMP1, and using RNA sequencing demonstrate expression of these genes in surgically resected tenosynovium from CTS patients. We perform Mendelian randomisation and demonstrate a causal relationship between short stature and higher risk of CTS. Reference

Prioritizing Parkinson’s disease genes using population-scale transcriptomic data

Genome-wide association studies (GWAS) have identified over 41 susceptibility loci associated with Parkinson’s Disease (PD) but identifying putative causal genes and the underlying mechanisms remains challenging.

Here, we leverage large-scale transcriptomic datasets to prioritize genes that are likely to affect PD by using a transcriptome-wide association study (TWAS) approach. Using this approach, we identify 66 gene associations whose predicted expression or splicing levels in dorsolateral prefrontal cortex (DLFPC) and peripheral monocytes are significantly associated with PD risk. We uncover many novel genes associated with PD but also novel mechanisms for known associations such as MAPT, for which we find that variation in exon 3 splicing explains the common genetic association. Reference

MMSplice: modular modeling improves the predictions of genetic variant effects on splicing

Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge.

The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. Reference

Colonic epithelial cell diversity in health and inflammatory bowel disease

The colonic epithelium facilitates host–microorganism interactions to control mucosal immunity, coordinate nutrient recycling and form a mucus barrier. Breakdown of the epithelial barrier underpins inflammatory bowel disease (IBD). However, the specific contributions of each epithelial-cell subtype to this process are unknown.

Here we profile single colonic epithelial cells from patients with IBD and unaffected controls. We identify previously unknown cellular subtypes, including gradients of progenitor cells, colonocytes and goblet cells within intestinal crypts. At the top of the crypts, we find a previously unknown absorptive cell, expressing the proton channel OTOP2 and the satiety peptide uroguanylin, that senses pH and is dysregulated in inflammation and cancer. In IBD, we observe a positional remodelling of goblet cells that coincides with downregulation of WFDC2—an antiprotease molecule that we find to be expressed by goblet cells and that inhibits bacterial growth. Reference

An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics

Aging promotes lung function decline and susceptibility to chronic lung diseases, which are the third leading cause of death worldwide. Here, we use single cell transcriptomics and mass spectrometry-based proteomics to quantify changes in cellular activity states across 30 cell types and chart the lung proteome of young and old mice.

We show that aging leads to increased transcriptional noise, indicating deregulated epigenetic control. We observe cell type-specific effects of aging, uncovering increased cholesterol biosynthesis in type-2 pneumocytes and lipofibroblasts and altered relative frequency of airway epithelial cells as hallmarks of lung aging. Reference

A network-centric approach to drugging TNF-induced NF-κB signaling

Target-centric drug development strategies prioritize single-target potency in vitro and do not account for connectivity and multi-target effects within a signal transduction network.

Here, we present a systems biology approach that combines transcriptomic and structural analyses with live-cell imaging to predict small molecule inhibitors of TNF-induced NF-κB signaling and elucidate the network response. We identify two first-in-class small molecules that inhibit the NF-κB signaling pathway by preventing the maturation of a rate-limiting multiprotein complex necessary for IKK activation. Our findings suggest that a network-centric drug discovery approach is a promising strategy to evaluate the impact of pharmacologic intervention in signaling. Reference

Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling

DNase-seq and ATAC-seq are broadly used methods to assay open chromatin regions genome-wide. The single nucleotide resolution of DNase-seq has been further exploited to infer transcription factor binding sites (TFBSs) in regulatory regions through footprinting.

Here, we undertake a systematic comparison of the two methods and show that a modification to the ATAC-seq protocol increases its yield and its agreement with DNase-seq data from the same cell line. We demonstrate that the two methods have distinct sequence biases and correct for these protocol-specific biases when performing footprinting. Reference

Ediacaran biozones identified with network analysis provide evidence for pulsed extinctions of early complex life

Rocks of Ediacaran age (~635–541 Ma) contain the oldest fossils of large, complex organisms and their behaviors. These fossils document developmental and ecological innovations, and suggest that extinctions helped to shape the trajectory of early animal evolution.

Conventional methods divide Ediacaran macrofossil localities into taxonomically distinct clusters, which may represent evolutionary, environmental, or preservational variation. Here, we investigate these possibilities with network analysis of body and trace fossil occurrences. By partitioning multipartite networks of taxa, paleoenvironments, and geologic formations into community units, we distinguish between biostratigraphic zones and paleoenvironmentally restricted biotopes, and provide empirically robust and statistically significant evidence for a global, cosmopolitan assemblage unique to terminal Ediacaran strata. Reference

Epigenetic signatures associated with imprinted paternally expressed genes in the Arabidopsis endosperm

Imprinted genes are epigenetically modified during gametogenesis and maintain the established epigenetic signatures after fertilization, causing parental-specific gene expression.

In this study, we show that imprinted paternally expressed genes (PEGs) in the Arabidopsis endosperm are marked by an epigenetic signature of Polycomb Repressive Complex2 (PRC2)-mediated H3K27me3 together with heterochromatic H3K9me2 and CHG methylation, which specifically mark the silenced maternal alleles of PEGs. Reference

Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis

Streptococcus pyogenes causes 700 million human infections annually worldwide, yet, despite a century of intensive effort, there is no licensed vaccine against this bacterium.

Although a number of large-scale genomic studies of bacterial pathogens have been published, the relationships among the genome, transcriptome, and virulence in large bacterial populations remain poorly understood. We sequenced the genomes of 2,101 emm28 S. pyogenes invasive strains, from which we selected 492 phylogenetically diverse strains for transcriptome analysis and 50 strains for virulence assessment. Data integration provided a novel understanding of the virulence mechanisms of this model organism. Reference

Genomic analyses of an extensive collection of wild and cultivated accessions provide new insights into peach breeding history

Human selection has a long history of transforming crop genomes. Peach (Prunus persica) has undergone more than 5000 years of domestication that led to remarkable changes in a series of agronomically important traits, but genetic bases underlying these changes and the effects of artificial selection on genomic diversity are not well understood.

Here, we report a comprehensive analysis of peach evolution based on genome sequences of 480 wild and cultivated accessions. By focusing on a set of quantitative trait loci (QTLs), we provide evidence supporting that distinct phases of domestication and improvement have led to an increase in fruit size and taste and extended its geographic distribution. Reference

Precise tuning of gene expression levels in mammalian cells

Precise, analogue regulation of gene expression is critical for cellular function in mammals. In contrast, widely employed experimental and therapeutic approaches such as knock-in/out strategies are more suitable for binary control of gene activity.

Here we report on a method for precise control of gene expression levels in mammalian cells using engineered microRNA response elements (MREs). First, we measure the efficacy of thousands of synthetic MRE variants under the control of an endogenous microRNA by high-throughput sequencing. Guided by this data, we establish a library of microRNA silencing-mediated fine-tuners (miSFITs) of varying strength that can be employed to precisely control the expression of user-specified genes. We apply this technology to tune the T-cell co-inhibitory receptor PD-1 and to explore how antigen expression influences T-cell activation and tumour growth. Finally, we employ CRISPR/Cas9 mediated homology directed repair to introduce miSFITs into the BRCA1 3′UTR, demonstrating that this versatile tool can be used to tune endogenous genes. Reference

Multi-omic measurements of heterogeneity in HeLa cells across laboratories

Reproducibility in research can be compromised by both biological and technical variation, but most of the focus is on removing the latter. Here we investigate the effects of biological variation in HeLa cell lines using a systems-wide approach.

We determine the degree of molecular and phenotypic variability across 14 stock HeLa samples from 13 international laboratories. We cultured cells in uniform conditions and profiled genome-wide copy numbers, mRNAs, proteins and protein turnover rates in each cell line. We discovered substantial heterogeneity between HeLa variants, especially between lines of the CCL2 and Kyoto varieties, and observed progressive divergence within a specific cell line over 50 successive passages. Reference

Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data

t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets.

We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Reference

A mathematical-descriptor of tumor-mesoscopic-structure from CT images annotates prognostic- and molecular-phenotypes of epithelial ovarian cancer

The five-year survival rate of epithelial ovarian cancer (EOC) is approximately 35–40% despite maximal treatment efforts, highlighting a need for stratification biomarkers for personalized treatment.

Here we extract 657 quantitative mathematical descriptors from the preoperative CT images of 364 EOC patients at their initial presentation. Using machine learning, we derive a non-invasive summary-statistic of the primary ovarian tumor based on 4 descriptors, which we name “Radiomic Prognostic Vector” (RPV). RPV reliably identifies the 5% of patients with median overall survival less than 2 years, significantly improves established prognostic methods, and is validated in two independent, multi-center cohorts. Reference

Single cell functional genomics reveals the importance of mitochondria in cell-to-cell phenotypic variation

Mutations frequently have outcomes that differ across individuals, even when these individuals are genetically identical and share a common environment.

Moreover, individual microbial and mammalian cells can vary substantially in their proliferation rates, stress tolerance, and drug resistance, with important implications for the treatment of infections and cancer. To investigate the causes of cell-to-cell variation in proliferation, we used a high-throughput automated microscopy assay to quantify the impact of deleting >1500 genes in yeast. Mutations affecting mitochondria were particularly variable in their outcome. In both mutant and wild-type cells mitochondrial membrane potential – but not amount – varied substantially across individual cells and predicted cell-to-cell variation in proliferation, mutation outcome, stress tolerance, and resistance to a clinically used anti-fungal drug. These results suggest an important role for cell-to-cell variation in the state of an organelle in single cell phenotypic variation. Reference

Mismatch repair-signature mutations activate gene enhancers across human colorectal cancer epigenomes

Commonly-mutated genes have been found for many cancers, but less is known about mutations in cis-regulatory elements. We leverage gains in tumor-specific enhancer activity, coupled with allele-biased mutation detection from H3K27ac ChIP-seq data, to pinpoint potential enhancer-activating mutations in colorectal cancer (CRC).

Analysis of a genetically-diverse cohort of CRC specimens revealed that microsatellite instable (MSI) samples have a high indel rate within active enhancers. Enhancers with indels show evidence of positive selection, increased target gene expression, and a subset is highly recurrent. The indels affect short homopolymer tracts of A/T and increase affinity for FOX transcription factors. We further demonstrate that signature mismatch-repair (MMR) mutations activate enhancers using a xenograft tumor metastasis model, where mutations are induced naturally via CRISPR/Cas9 inactivation of MLH1 prior to tumor cell injection. Our results suggest that MMR signature mutations activate enhancers in CRC tumor epigenomes to provide a selective advantage. Reference

WGS identifies ADGRG6 enhancer mutations and FRS2 duplications as angiogenesis-related drivers in bladder cancer

Bladder cancer is one of the most common and highly vascularized cancers. To better understand its genomic structure and underlying etiology, we conduct whole-genome and targeted sequencing in urothelial bladder carcinomas (UBCs, the most common type of bladder cancer).

Recurrent mutations in noncoding regions affecting gene regulatory elements and structural variations (SVs) leading to gene disruptions are prevalent. Notably, we find recurrent ADGRG6 enhancer mutations and FRS2 duplications which are associated with higher protein expression in the tumor and poor prognosis. Functional assays demonstrate that depletion of ADGRG6 or FRS2 expression in UBC cells compromise their abilities to recruit endothelial cells and induce tube formation. Reference

Modeling double strand break susceptibility to interrogate structural variation in cancer

Structural variants (SVs) are known to play important roles in a variety of cancers, but their origins and functional consequences are still poorly understood.

Many SVs are thought to emerge from errors in the repair processes following DNA double strand breaks (DSBs).  We used experimentally quantified DSB frequencies in cell lines with matched chromatin and sequence features to derive the first quantitative genome-wide models of DSB susceptibility. These models are accurate and provide novel insights into the mutational mechanisms generating DSBs. Models trained in one cell type can be successfully applied to others, but a substantial proportion of DSBs appear to reflect cell type-specific processes. Reference

Functional genomics reveal gene regulatory mechanisms underlying schizophrenia risk

Genome-wide association studies (GWASs) have identified over 180 independent schizophrenia risk loci. Nevertheless, how the risk variants in the reported loci confer schizophrenia susceptibility remains largely unknown.

Here we systematically investigate the gene regulatory mechanisms underpinning schizophrenia risk through integrating data from functional genomics (including 30 ChIP-Seq experiments) and position weight matrix (PWM). We identify 132 risk single nucleotide polymorphisms (SNPs) that disrupt transcription factor binding and we find that 97 of the 132 TF binding-disrupting SNPs are associated with gene expression in human brain tissues. Reference

Multiplexed profiling of RNA and protein expression signatures in individual cells using flow or mass cytometry

Advances in single-cell analysis technologies are providing novel insights into phenotypic and functional heterogeneity within seemingly identical cell populations. RNA within single cells can be analyzed using unbiased sequencing protocols or through more targeted approaches using in situ hybridization (ISH).

The proximity ligation assay for RNA (PLAYR) approach is a sensitive and high-throughput technique that relies on in situ and proximal ligation to measure at least 27 specific RNAs by flow or mass cytometry. We provide detailed instructions for combining this technique with antibody-based detection of surface/internal protein, allowing simultaneous highly multiplexed profiling of RNA and protein expression at single-cell resolution. PLAYR overcomes limitations on multiplexing seen in previous branching DNA–based RNA detection techniques by integration of a transcript-specific oligonucleotide sequence within a rolling-circle amplification (RCA). Reference

A human gut bacterial genome and culture collection for improved metagenomic analyses

Understanding gut microbiome functions requires cultivated bacteria for experimental validation and reference bacterial genome sequences to interpret metagenome datasets and guide functional analyses.

We present the Human Gastrointestinal Bacteria Culture Collection (HBC), a comprehensive set of 737 whole-genome-sequenced bacterial isolates, representing 273 species (105 novel species) from 31 families found in the human gastrointestinal microbiota. The HBC increases the number of bacterial genomes derived from human gastrointestinal microbiota by 37%. The resulting global Human Gastrointestinal Bacteria Genome Collection (HGG) classifies 83% of genera by abundance across 13,490 shotgun-sequenced metagenomic samples, improves taxonomic classification by 61% compared to the Human Microbiome Project (HMP) genome collection and achieves subspecies-level classification for almost 50% of sequences. Reference

Neutrophils escort circulating tumour cells to enable cell cycle progression

A better understanding of the features that define the interaction between cancer cells and immune cells is important for the development of new cancer therapies.

However, focus is often given to interactions that occur within the primary tumour and its microenvironment, whereas the role of immune cells during cancer dissemination in patients remains largely uncharacterized. Circulating tumour cells (CTCs) are precursors of metastasis in several types of cancer, and are occasionally found within the bloodstream in association with non-malignant cells such as white blood cells (WBCs). The identity and function of these CTC-associated WBCs, as well as the molecular features that define the interaction between WBCs and CTCs, are unknown. Here we isolate and characterize individual CTC-associated WBCs, as well as corresponding cancer cells within each CTC–WBC cluster, from patients with breast cancer and from mouse models. Reference

Ancient human genome-wide data from a 3000-year interval in the Caucasus corresponds with eco-geographic regions

rchaeogenetic studies have described the formation of Eurasian ‘steppe ancestry’ as a mixture of Eastern and Caucasus hunter-gatherers.

However, it remains unclear when and where this ancestry arose and whether it was related to a horizon of cultural innovations in the 4th millennium BCE that subsequently facilitated the advance of pastoral societies in Eurasia. Here we generated genome-wide SNP data from 45 prehistoric individuals along a 3000-year temporal transect in the North Caucasus. We observe a genetic separation between the groups of the Caucasus and those of the adjacent steppe. The northern Caucasus groups are genetically similar to contemporaneous populations south of it, suggesting human movement across the mountain range during the Bronze Age. Reference

A comparative evaluation of hybrid error correction methods for error-prone long reads

Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies.

However, their notorious high error rate impedes straightforward data analysis and limits their application.  Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences. Reference

Trait-based community assembly and succession of the infant gut microbiome

The human gut microbiome develops over early childhood and aids in food digestion and immunomodulation, but the mechanisms driving its development remain elusive.

Here we use data curated from literature and online repositories to examine trait-based patterns of gut microbiome succession in 56 infants over their first three years of life. We also develop a new phylogeny-based approach of inferring trait values that can extend readily to other microbial systems and questions. Trait-based patterns suggest that infant gut succession begins with a functionally variable cohort of taxa, adept at proliferating rapidly within hosts, which gradually matures into a more functionally uniform cohort of taxa adapted to thrive in the anoxic gut and disperse between anoxic patches as oxygen-tolerant spores. Reference

Gene editing of the multi-copy H2A.B gene and its importance for fertility

Altering the biochemical makeup of chromatin by the incorporation of histone variants during development represents a key mechanism in regulating gene expression.

The histone variant H2A.B, H2A.B.3 in mice, appeared late in evolution and is most highly expressed in the testis. In the mouse, it is encoded by three different genes. H2A.B expression is spatially and temporally regulated during spermatogenesis being most highly expressed in the haploid round spermatid stage. Active genes gain H2A.B where it directly interacts with polymerase II and RNA processing factors within splicing speckles. However, the importance of H2A.B for gene expression and fertility are unknown. Reference

Pipelines for cross-species and genome-wide prediction of long noncoding RNA binding

Abundant long, noncoding RNAs (lncRNAs) in mammals can bind to DNA sequences and recruit histone- and DNA-modifying enzymes to binding sites to epigenetically regulate target genes.

However, most lncRNAs’ binding motifs and target sites are unknown. The large numbers of lncRNAs and target sites in the whole genome make it infeasible to examine lncRNA binding to DNA purely experimentally. Here, we report a protocol for lncRNA/DNA-binding analysis that is built upon a database containing the GENCODE-annotated human and mouse lncRNAs, the orthologs of these lncRNAs in 17 mammals, and the genome sequences of the 17 mammals. Cross-species and genome-wide lncRNA/DNA-binding analysis begins with and is driven by database search. Reference

A quantitative approach for measuring the reservoir of latent HIV-1 proviruses

A stable latent reservoir for HIV-1 in resting CD4+ T cells is the principal barrier to a cure. Curative strategies that target the reservoir are being tested and require accurate, scalable reservoir assays.

The reservoir was defined with quantitative viral outgrowth assays for cells that release infectious virus after one round of T cell activation. However, these quantitative outgrowth assays and newer assays for cells that produce viral RNA after activation6 may underestimate the reservoir size because one round of activation does not induce all proviruses. Many studies rely on simple assays based on polymerase chain reaction to detect proviral DNA regardless of transcriptional status, but the clinical relevance of these assays is unclear, as the vast majority of proviruses are defective. Here we describe a more accurate method of measuring the HIV-1 reservoir that separately quantifies intact and defective proviruses. Reference

Meta-Research: Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines

The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database.

To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. Reference

Multiple-gene targeting and mismatch tolerance can confound analysis of genome-wide pooled CRISPR screens

Genome-wide loss-of-function screens using the CRISPR/Cas9 system allow the efficient discovery of cancer cell vulnerabilities. While several studies have focused on correcting for DNA cleavage toxicity biases associated with copy number alterations, the effects of sgRNAs co-targeting multiple genomic loci in CRISPR screens have not been discussed.

In this work, we analyze CRISPR essentiality screen data from 391 cancer cell lines to characterize biases induced by multi-target sgRNAs. We investigate two types of multi-targets: on-targets predicted through perfect sequence complementarity and off-targets predicted through sequence complementarity with up to two nucleotide mismatches. Reference

Single-cell analysis reveals congruence between kidney organoids and human fetal kidney

Human kidney organoids hold promise for studying development, disease modelling and drug screening. However, the utility of stem cell-derived kidney tissues will depend on how faithfully these replicate normal fetal development at the level of cellular identity and complexity.

Here, we present an integrated analysis of single cell datasets from human kidney organoids and human fetal kidney to assess similarities and differences between the component cell types. Reference

The genome of broomcorn millet

Broomcorn millet (Panicum miliaceum L.) is the most water-efficient cereal and one of the earliest domesticated plants. Here we report its high-quality, chromosome-scale genome assembly using a combination of short-read sequencing, single-molecule real-time sequencing, Hi-C, and a high-density genetic map.

Phylogenetic analyses reveal two sets of homologous chromosomes that may have merged ~5.6 million years ago, both of which exhibit strong synteny with other grass species. Broomcorn millet contains 55,930 protein-coding genes and 339 microRNA genes. We find Paniceae-specific expansion in several subfamilies of the BTB (broad complex/tramtrack/bric-a-brac) subunit of ubiquitin E3 ligases, suggesting enhanced regulation of protein dynamics may have contributed to the evolution of broomcorn millet. Reference

Diverse motif ensembles specify non-redundant DNA binding activities of AP-1 family members in macrophages

Mechanisms by which members of the AP-1 family of transcription factors play non-redundant biological roles despite recognizing the same DNA sequence remain poorly understood.

To address this question, here we investigate the molecular functions and genome-wide DNA binding patterns of AP-1 family members in primary and immortalized mouse macrophages. ChIP-sequencing shows overlapping and distinct binding profiles for each factor that were remodeled following TLR4 ligation. Development of a machine learning approach that jointly weighs hundreds of DNA recognition elements yields dozens of motifs predicted to drive factor-specific binding profiles. Machine learning-based predictions are confirmed by analysis of the effects of mutations in genetically diverse mice and by loss of function experiments. Reference

Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data

We perform a large-scale RNA sequencing study to experimentally identify genes that are downregulated by 25 miRNAs.

This RNA-seq dataset is combined with public miRNA target binding data to systematically identify miRNA targeting features that are characteristic of both miRNA binding and target downregulation. By integrating these common features in a machine learning framework, we develop and validate an improved computational model for genome-wide miRNA target prediction. Reference

Population structure of human gut bacteria in a diverse cohort from rural Tanzania and Botswana

Gut microbiota from individuals in rural, non-industrialized societies differ from those in individuals from industrialized societies.

Here, we use 16S rRNA sequencing to survey the gut bacteria of seven non-industrialized populations from Tanzania and Botswana. These include populations practicing traditional hunter-gatherer, pastoralist, and agropastoralist subsistence lifestyles and a comparative urban cohort from the greater Philadelphia region. Reference

Microbial network disturbances in relapsing refractory Crohn’s disease

Inflammatory bowel diseases (IBD) can be broadly divided into Crohn’s disease (CD) and ulcerative colitis (UC) from their clinical phenotypes.

Over 150 host susceptibility genes have been described, although most overlap between CD, UC and their subtypes, and they do not adequately account for the overall incidence or the highly variable severity of disease. Replicating key findings between two long-term IBD cohorts, we have defined distinct networks of taxa associations within intestinal biopsies of CD and UC patients. Disturbances in an association network containing taxa of the Lachnospiraceae and Ruminococcaceae families, typically producing short chain fatty acids, characterize frequently relapsing disease and poor responses to treatment with anti-TNF-α therapeutic antibodies. Reference

Reconstruction of full-length circular RNAs enables isoform-level quantification

Currently, circRNA studies are shifting from the identification of circular transcripts to understanding their biological functions. However, such endeavors have been limited by large-scale determination of their full-length sequences and also by the inability of accurate quantification at the isoform level.

Here, we propose a new feature, reverse overlap (RO), for circRNA detection, which outperforms back-splice junction (BSJ)-based methods in identifying low-abundance circRNAs. By combining RO and BSJ features, we present a novel approach for effective reconstruction of full-length circRNAs and isoform-level quantification from the transcriptome. We systematically compared the difference between the BSJ-level and isoform-level differential expression analyses using human liver tumor and normal tissues and highlight the necessity of deepening circRNA studies to the isoform-level resolution. Reference

Aberrant enhancer hypomethylation contributes to hepatic carcinogenesis through global transcriptional reprogramming

Hepatocellular carcinomas (HCC) exhibit distinct promoter hypermethylation patterns, but the epigenetic regulation and function of transcriptional enhancers remain unclear. Here, our affinity- and bisulfite-based whole-genome sequencing analyses reveal global enhancer hypomethylation in human HCCs.

Integrative epigenomic characterization further pinpoints a recurrent hypomethylated enhancer of CCAAT/enhancer-binding protein-beta (C/EBPβ) which correlates with C/EBPβ over-expression and poorer prognosis of patients. Demethylation of C/EBPβ enhancer reactivates a self-reinforcing enhancer-target loop via direct transcriptional up-regulation of enhancer RNA. Conversely, deletion of this enhancer via CRISPR/Cas9 reduces C/EBPβ expression and its genome-wide co-occupancy with BRD4 at H3K27ac-marked enhancers and super-enhancers, leading to drastic suppression of driver oncogenes and HCC tumorigenicity. Reference

iGUIDE: an improved pipeline for analyzing CRISPR cleavage specificity

Genome engineering methods have advanced greatly with the development of programmable nucleases, but methods for quantifying on- and off-target cleavage sites and associated deletions remain nascent.

Here, we report an improvement of the GUIDE-seq method, iGUIDE, which allows filtering of mispriming events to clarify the true cleavage signal. Using iGUIDE, we specify the locations of Cas9-guided cleavage for four guide RNAs, characterize associated deletions, and show that naturally occurring background DNA double-strand breaks are associated with open chromatin, gene dense regions, and chromosomal fragile sites. Reference

An automated Bayesian pipeline for rapid analysis of single-molecule binding data

Single-molecule binding assays enable the study of how molecular machines assemble and function. Current algorithms can identify and locate individual molecules, but require tedious manual validation of each spot.

Moreover, no solution for high-throughput analysis of single-molecule binding data exists. Here, we describe an automated pipeline to analyze single-molecule data over a wide range of experimental conditions. In addition, our method enables state estimation on multivariate Gaussian signals. We validate our approach using simulated data, and benchmark the pipeline by measuring the binding properties of the well-studied, DNA-guided DNA endonuclease, TtAgo, an Argonaute protein from the Eubacterium Thermus thermophilus. Reference

Tumor mutational load predicts survival after immunotherapy across multiple cancer types

Immune checkpoint inhibitor (ICI) treatments benefit some patients with metastatic cancers, but predictive biomarkers are needed. Findings in selected cancer types suggest that tumor mutational burden (TMB) may predict clinical response to ICI.

To examine this association more broadly, we analyzed the clinical and genomic data of 1,662 advanced cancer patients treated with ICI, and 5,371 non-ICI-treated patients, whose tumors underwent targeted next-generation sequencing (MSK-IMPACT). Among all patients, higher somatic TMB (highest 20% in each histology) was associated with better overall survival. For most cancer histologies, an association between higher TMB and improved survival was observed. The TMB cutpoints associated with improved survival varied markedly between cancer types. Reference

A macrophage-based screen identifies antibacterial compounds selective for intracellular Salmonella Typhimurium

Salmonella Typhimurium (S. Tm) establishes systemic infection in susceptible hosts by evading the innate immune response and replicating within host phagocytes.

Here, we sought to identify inhibitors of intracellular S. Tm replication by conducting parallel chemical screens against S. Tm growing in macrophage-mimicking media and within macrophages. We identify several compounds that inhibit Salmonella growth in the intracellular environment and in acidic, ion-limited media. We report on the antimicrobial activity of the psychoactive drug metergoline, which is specific against intracellular S. Tm. Screening an S. Tm deletion library in the presence of metergoline reveals hypersensitization of outer membrane mutants to metergoline activity. Reference

A gene expression map of shoot domains reveals regulatory mechanisms

Gene regulatory networks control development via domain-specific gene expression. In seed plants, self-renewing stem cells located in the shoot apical meristem (SAM) produce leaves from the SAM peripheral zone. After initiation, leaves develop polarity patterns to form a planar shape.

Here we compare translating RNAs among SAM and leaf domains. Using translating ribosome affinity purification and RNA sequencing to quantify gene expression in target domains, we generate a domain-specific translatome map covering representative vegetative stage SAM and leaf domains. We discuss the predicted cellular functions of these domains and provide evidence that dome seemingly unrelated domains, utilize common regulatory modules. Experimental follow up shows that the RABBIT EARS and HANABA TARANU transcription factors have roles in axillary meristem initiation. Reference