Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data
t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets.
We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Reference
A mathematical-descriptor of tumor-mesoscopic-structure from CT images annotates prognostic- and molecular-phenotypes of epithelial ovarian cancer
The five-year survival rate of epithelial ovarian cancer (EOC) is approximately 35–40% despite maximal treatment efforts, highlighting a need for stratification biomarkers for personalized treatment.
Here we extract 657 quantitative mathematical descriptors from the preoperative CT images of 364 EOC patients at their initial presentation. Using machine learning, we derive a non-invasive summary-statistic of the primary ovarian tumor based on 4 descriptors, which we name “Radiomic Prognostic Vector” (RPV). RPV reliably identifies the 5% of patients with median overall survival less than 2 years, significantly improves established prognostic methods, and is validated in two independent, multi-center cohorts. Reference
Single cell functional genomics reveals the importance of mitochondria in cell-to-cell phenotypic variation
Mutations frequently have outcomes that differ across individuals, even when these individuals are genetically identical and share a common environment.
Moreover, individual microbial and mammalian cells can vary substantially in their proliferation rates, stress tolerance, and drug resistance, with important implications for the treatment of infections and cancer. To investigate the causes of cell-to-cell variation in proliferation, we used a high-throughput automated microscopy assay to quantify the impact of deleting >1500 genes in yeast. Mutations affecting mitochondria were particularly variable in their outcome. In both mutant and wild-type cells mitochondrial membrane potential – but not amount – varied substantially across individual cells and predicted cell-to-cell variation in proliferation, mutation outcome, stress tolerance, and resistance to a clinically used anti-fungal drug. These results suggest an important role for cell-to-cell variation in the state of an organelle in single cell phenotypic variation. Reference
Mismatch repair-signature mutations activate gene enhancers across human colorectal cancer epigenomes
Commonly-mutated genes have been found for many cancers, but less is known about mutations in cis-regulatory elements. We leverage gains in tumor-specific enhancer activity, coupled with allele-biased mutation detection from H3K27ac ChIP-seq data, to pinpoint potential enhancer-activating mutations in colorectal cancer (CRC).
Analysis of a genetically-diverse cohort of CRC specimens revealed that microsatellite instable (MSI) samples have a high indel rate within active enhancers. Enhancers with indels show evidence of positive selection, increased target gene expression, and a subset is highly recurrent. The indels affect short homopolymer tracts of A/T and increase affinity for FOX transcription factors. We further demonstrate that signature mismatch-repair (MMR) mutations activate enhancers using a xenograft tumor metastasis model, where mutations are induced naturally via CRISPR/Cas9 inactivation of MLH1 prior to tumor cell injection. Our results suggest that MMR signature mutations activate enhancers in CRC tumor epigenomes to provide a selective advantage. Reference
WGS identifies ADGRG6 enhancer mutations and FRS2 duplications as angiogenesis-related drivers in bladder cancer
Bladder cancer is one of the most common and highly vascularized cancers. To better understand its genomic structure and underlying etiology, we conduct whole-genome and targeted sequencing in urothelial bladder carcinomas (UBCs, the most common type of bladder cancer).
Recurrent mutations in noncoding regions affecting gene regulatory elements and structural variations (SVs) leading to gene disruptions are prevalent. Notably, we find recurrent ADGRG6 enhancer mutations and FRS2 duplications which are associated with higher protein expression in the tumor and poor prognosis. Functional assays demonstrate that depletion of ADGRG6 or FRS2 expression in UBC cells compromise their abilities to recruit endothelial cells and induce tube formation. Reference
Modeling double strand break susceptibility to interrogate structural variation in cancer
Structural variants (SVs) are known to play important roles in a variety of cancers, but their origins and functional consequences are still poorly understood.
Many SVs are thought to emerge from errors in the repair processes following DNA double strand breaks (DSBs). We used experimentally quantified DSB frequencies in cell lines with matched chromatin and sequence features to derive the first quantitative genome-wide models of DSB susceptibility. These models are accurate and provide novel insights into the mutational mechanisms generating DSBs. Models trained in one cell type can be successfully applied to others, but a substantial proportion of DSBs appear to reflect cell type-specific processes. Reference
Functional genomics reveal gene regulatory mechanisms underlying schizophrenia risk
Genome-wide association studies (GWASs) have identified over 180 independent schizophrenia risk loci. Nevertheless, how the risk variants in the reported loci confer schizophrenia susceptibility remains largely unknown.
Here we systematically investigate the gene regulatory mechanisms underpinning schizophrenia risk through integrating data from functional genomics (including 30 ChIP-Seq experiments) and position weight matrix (PWM). We identify 132 risk single nucleotide polymorphisms (SNPs) that disrupt transcription factor binding and we find that 97 of the 132 TF binding-disrupting SNPs are associated with gene expression in human brain tissues. Reference
Multiplexed profiling of RNA and protein expression signatures in individual cells using flow or mass cytometry
Advances in single-cell analysis technologies are providing novel insights into phenotypic and functional heterogeneity within seemingly identical cell populations. RNA within single cells can be analyzed using unbiased sequencing protocols or through more targeted approaches using in situ hybridization (ISH).
The proximity ligation assay for RNA (PLAYR) approach is a sensitive and high-throughput technique that relies on in situ and proximal ligation to measure at least 27 specific RNAs by flow or mass cytometry. We provide detailed instructions for combining this technique with antibody-based detection of surface/internal protein, allowing simultaneous highly multiplexed profiling of RNA and protein expression at single-cell resolution. PLAYR overcomes limitations on multiplexing seen in previous branching DNA–based RNA detection techniques by integration of a transcript-specific oligonucleotide sequence within a rolling-circle amplification (RCA). Reference
A human gut bacterial genome and culture collection for improved metagenomic analyses
Understanding gut microbiome functions requires cultivated bacteria for experimental validation and reference bacterial genome sequences to interpret metagenome datasets and guide functional analyses.
We present the Human Gastrointestinal Bacteria Culture Collection (HBC), a comprehensive set of 737 whole-genome-sequenced bacterial isolates, representing 273 species (105 novel species) from 31 families found in the human gastrointestinal microbiota. The HBC increases the number of bacterial genomes derived from human gastrointestinal microbiota by 37%. The resulting global Human Gastrointestinal Bacteria Genome Collection (HGG) classifies 83% of genera by abundance across 13,490 shotgun-sequenced metagenomic samples, improves taxonomic classification by 61% compared to the Human Microbiome Project (HMP) genome collection and achieves subspecies-level classification for almost 50% of sequences. Reference
Neutrophils escort circulating tumour cells to enable cell cycle progression
A better understanding of the features that define the interaction between cancer cells and immune cells is important for the development of new cancer therapies.
However, focus is often given to interactions that occur within the primary tumour and its microenvironment, whereas the role of immune cells during cancer dissemination in patients remains largely uncharacterized. Circulating tumour cells (CTCs) are precursors of metastasis in several types of cancer, and are occasionally found within the bloodstream in association with non-malignant cells such as white blood cells (WBCs). The identity and function of these CTC-associated WBCs, as well as the molecular features that define the interaction between WBCs and CTCs, are unknown. Here we isolate and characterize individual CTC-associated WBCs, as well as corresponding cancer cells within each CTC–WBC cluster, from patients with breast cancer and from mouse models. Reference
Ancient human genome-wide data from a 3000-year interval in the Caucasus corresponds with eco-geographic regions
rchaeogenetic studies have described the formation of Eurasian ‘steppe ancestry’ as a mixture of Eastern and Caucasus hunter-gatherers.
However, it remains unclear when and where this ancestry arose and whether it was related to a horizon of cultural innovations in the 4th millennium BCE that subsequently facilitated the advance of pastoral societies in Eurasia. Here we generated genome-wide SNP data from 45 prehistoric individuals along a 3000-year temporal transect in the North Caucasus. We observe a genetic separation between the groups of the Caucasus and those of the adjacent steppe. The northern Caucasus groups are genetically similar to contemporaneous populations south of it, suggesting human movement across the mountain range during the Bronze Age. Reference
A comparative evaluation of hybrid error correction methods for error-prone long reads
Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies.
However, their notorious high error rate impedes straightforward data analysis and limits their application. Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences. Reference
Trait-based community assembly and succession of the infant gut microbiome
The human gut microbiome develops over early childhood and aids in food digestion and immunomodulation, but the mechanisms driving its development remain elusive.
Here we use data curated from literature and online repositories to examine trait-based patterns of gut microbiome succession in 56 infants over their first three years of life. We also develop a new phylogeny-based approach of inferring trait values that can extend readily to other microbial systems and questions. Trait-based patterns suggest that infant gut succession begins with a functionally variable cohort of taxa, adept at proliferating rapidly within hosts, which gradually matures into a more functionally uniform cohort of taxa adapted to thrive in the anoxic gut and disperse between anoxic patches as oxygen-tolerant spores. Reference
Gene editing of the multi-copy H2A.B gene and its importance for fertility
Altering the biochemical makeup of chromatin by the incorporation of histone variants during development represents a key mechanism in regulating gene expression.
The histone variant H2A.B, H2A.B.3 in mice, appeared late in evolution and is most highly expressed in the testis. In the mouse, it is encoded by three different genes. H2A.B expression is spatially and temporally regulated during spermatogenesis being most highly expressed in the haploid round spermatid stage. Active genes gain H2A.B where it directly interacts with polymerase II and RNA processing factors within splicing speckles. However, the importance of H2A.B for gene expression and fertility are unknown. Reference
Pipelines for cross-species and genome-wide prediction of long noncoding RNA binding
Abundant long, noncoding RNAs (lncRNAs) in mammals can bind to DNA sequences and recruit histone- and DNA-modifying enzymes to binding sites to epigenetically regulate target genes.
However, most lncRNAs’ binding motifs and target sites are unknown. The large numbers of lncRNAs and target sites in the whole genome make it infeasible to examine lncRNA binding to DNA purely experimentally. Here, we report a protocol for lncRNA/DNA-binding analysis that is built upon a database containing the GENCODE-annotated human and mouse lncRNAs, the orthologs of these lncRNAs in 17 mammals, and the genome sequences of the 17 mammals. Cross-species and genome-wide lncRNA/DNA-binding analysis begins with and is driven by database search. Reference
A quantitative approach for measuring the reservoir of latent HIV-1 proviruses
A stable latent reservoir for HIV-1 in resting CD4+ T cells is the principal barrier to a cure. Curative strategies that target the reservoir are being tested and require accurate, scalable reservoir assays.
The reservoir was defined with quantitative viral outgrowth assays for cells that release infectious virus after one round of T cell activation. However, these quantitative outgrowth assays and newer assays for cells that produce viral RNA after activation6 may underestimate the reservoir size because one round of activation does not induce all proviruses. Many studies rely on simple assays based on polymerase chain reaction to detect proviral DNA regardless of transcriptional status, but the clinical relevance of these assays is unclear, as the vast majority of proviruses are defective. Here we describe a more accurate method of measuring the HIV-1 reservoir that separately quantifies intact and defective proviruses. Reference
Meta-Research: Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines
The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database.
To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. Reference
Multiple-gene targeting and mismatch tolerance can confound analysis of genome-wide pooled CRISPR screens
Genome-wide loss-of-function screens using the CRISPR/Cas9 system allow the efficient discovery of cancer cell vulnerabilities. While several studies have focused on correcting for DNA cleavage toxicity biases associated with copy number alterations, the effects of sgRNAs co-targeting multiple genomic loci in CRISPR screens have not been discussed.
In this work, we analyze CRISPR essentiality screen data from 391 cancer cell lines to characterize biases induced by multi-target sgRNAs. We investigate two types of multi-targets: on-targets predicted through perfect sequence complementarity and off-targets predicted through sequence complementarity with up to two nucleotide mismatches. Reference
Single-cell analysis reveals congruence between kidney organoids and human fetal kidney
Human kidney organoids hold promise for studying development, disease modelling and drug screening. However, the utility of stem cell-derived kidney tissues will depend on how faithfully these replicate normal fetal development at the level of cellular identity and complexity.
Here, we present an integrated analysis of single cell datasets from human kidney organoids and human fetal kidney to assess similarities and differences between the component cell types. Reference
The genome of broomcorn millet
Broomcorn millet (Panicum miliaceum L.) is the most water-efficient cereal and one of the earliest domesticated plants. Here we report its high-quality, chromosome-scale genome assembly using a combination of short-read sequencing, single-molecule real-time sequencing, Hi-C, and a high-density genetic map.
Phylogenetic analyses reveal two sets of homologous chromosomes that may have merged ~5.6 million years ago, both of which exhibit strong synteny with other grass species. Broomcorn millet contains 55,930 protein-coding genes and 339 microRNA genes. We find Paniceae-specific expansion in several subfamilies of the BTB (broad complex/tramtrack/bric-a-brac) subunit of ubiquitin E3 ligases, suggesting enhanced regulation of protein dynamics may have contributed to the evolution of broomcorn millet. Reference
Diverse motif ensembles specify non-redundant DNA binding activities of AP-1 family members in macrophages
Mechanisms by which members of the AP-1 family of transcription factors play non-redundant biological roles despite recognizing the same DNA sequence remain poorly understood.
To address this question, here we investigate the molecular functions and genome-wide DNA binding patterns of AP-1 family members in primary and immortalized mouse macrophages. ChIP-sequencing shows overlapping and distinct binding profiles for each factor that were remodeled following TLR4 ligation. Development of a machine learning approach that jointly weighs hundreds of DNA recognition elements yields dozens of motifs predicted to drive factor-specific binding profiles. Machine learning-based predictions are confirmed by analysis of the effects of mutations in genetically diverse mice and by loss of function experiments. Reference
Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data
We perform a large-scale RNA sequencing study to experimentally identify genes that are downregulated by 25 miRNAs.
This RNA-seq dataset is combined with public miRNA target binding data to systematically identify miRNA targeting features that are characteristic of both miRNA binding and target downregulation. By integrating these common features in a machine learning framework, we develop and validate an improved computational model for genome-wide miRNA target prediction. Reference
Population structure of human gut bacteria in a diverse cohort from rural Tanzania and Botswana
Gut microbiota from individuals in rural, non-industrialized societies differ from those in individuals from industrialized societies.
Here, we use 16S rRNA sequencing to survey the gut bacteria of seven non-industrialized populations from Tanzania and Botswana. These include populations practicing traditional hunter-gatherer, pastoralist, and agropastoralist subsistence lifestyles and a comparative urban cohort from the greater Philadelphia region. Reference
Microbial network disturbances in relapsing refractory Crohn’s disease
Inflammatory bowel diseases (IBD) can be broadly divided into Crohn’s disease (CD) and ulcerative colitis (UC) from their clinical phenotypes.
Over 150 host susceptibility genes have been described, although most overlap between CD, UC and their subtypes, and they do not adequately account for the overall incidence or the highly variable severity of disease. Replicating key findings between two long-term IBD cohorts, we have defined distinct networks of taxa associations within intestinal biopsies of CD and UC patients. Disturbances in an association network containing taxa of the Lachnospiraceae and Ruminococcaceae families, typically producing short chain fatty acids, characterize frequently relapsing disease and poor responses to treatment with anti-TNF-α therapeutic antibodies. Reference
Reconstruction of full-length circular RNAs enables isoform-level quantification
Currently, circRNA studies are shifting from the identification of circular transcripts to understanding their biological functions. However, such endeavors have been limited by large-scale determination of their full-length sequences and also by the inability of accurate quantification at the isoform level.
Here, we propose a new feature, reverse overlap (RO), for circRNA detection, which outperforms back-splice junction (BSJ)-based methods in identifying low-abundance circRNAs. By combining RO and BSJ features, we present a novel approach for effective reconstruction of full-length circRNAs and isoform-level quantification from the transcriptome. We systematically compared the difference between the BSJ-level and isoform-level differential expression analyses using human liver tumor and normal tissues and highlight the necessity of deepening circRNA studies to the isoform-level resolution. Reference
Aberrant enhancer hypomethylation contributes to hepatic carcinogenesis through global transcriptional reprogramming
Hepatocellular carcinomas (HCC) exhibit distinct promoter hypermethylation patterns, but the epigenetic regulation and function of transcriptional enhancers remain unclear. Here, our affinity- and bisulfite-based whole-genome sequencing analyses reveal global enhancer hypomethylation in human HCCs.
Integrative epigenomic characterization further pinpoints a recurrent hypomethylated enhancer of CCAAT/enhancer-binding protein-beta (C/EBPβ) which correlates with C/EBPβ over-expression and poorer prognosis of patients. Demethylation of C/EBPβ enhancer reactivates a self-reinforcing enhancer-target loop via direct transcriptional up-regulation of enhancer RNA. Conversely, deletion of this enhancer via CRISPR/Cas9 reduces C/EBPβ expression and its genome-wide co-occupancy with BRD4 at H3K27ac-marked enhancers and super-enhancers, leading to drastic suppression of driver oncogenes and HCC tumorigenicity. Reference
iGUIDE: an improved pipeline for analyzing CRISPR cleavage specificity
Genome engineering methods have advanced greatly with the development of programmable nucleases, but methods for quantifying on- and off-target cleavage sites and associated deletions remain nascent.
Here, we report an improvement of the GUIDE-seq method, iGUIDE, which allows filtering of mispriming events to clarify the true cleavage signal. Using iGUIDE, we specify the locations of Cas9-guided cleavage for four guide RNAs, characterize associated deletions, and show that naturally occurring background DNA double-strand breaks are associated with open chromatin, gene dense regions, and chromosomal fragile sites. Reference
An automated Bayesian pipeline for rapid analysis of single-molecule binding data
Single-molecule binding assays enable the study of how molecular machines assemble and function. Current algorithms can identify and locate individual molecules, but require tedious manual validation of each spot.
Moreover, no solution for high-throughput analysis of single-molecule binding data exists. Here, we describe an automated pipeline to analyze single-molecule data over a wide range of experimental conditions. In addition, our method enables state estimation on multivariate Gaussian signals. We validate our approach using simulated data, and benchmark the pipeline by measuring the binding properties of the well-studied, DNA-guided DNA endonuclease, TtAgo, an Argonaute protein from the Eubacterium Thermus thermophilus. Reference
Tumor mutational load predicts survival after immunotherapy across multiple cancer types
Immune checkpoint inhibitor (ICI) treatments benefit some patients with metastatic cancers, but predictive biomarkers are needed. Findings in selected cancer types suggest that tumor mutational burden (TMB) may predict clinical response to ICI.
To examine this association more broadly, we analyzed the clinical and genomic data of 1,662 advanced cancer patients treated with ICI, and 5,371 non-ICI-treated patients, whose tumors underwent targeted next-generation sequencing (MSK-IMPACT). Among all patients, higher somatic TMB (highest 20% in each histology) was associated with better overall survival. For most cancer histologies, an association between higher TMB and improved survival was observed. The TMB cutpoints associated with improved survival varied markedly between cancer types. Reference
A macrophage-based screen identifies antibacterial compounds selective for intracellular Salmonella Typhimurium
Salmonella Typhimurium (S. Tm) establishes systemic infection in susceptible hosts by evading the innate immune response and replicating within host phagocytes.
Here, we sought to identify inhibitors of intracellular S. Tm replication by conducting parallel chemical screens against S. Tm growing in macrophage-mimicking media and within macrophages. We identify several compounds that inhibit Salmonella growth in the intracellular environment and in acidic, ion-limited media. We report on the antimicrobial activity of the psychoactive drug metergoline, which is specific against intracellular S. Tm. Screening an S. Tm deletion library in the presence of metergoline reveals hypersensitization of outer membrane mutants to metergoline activity. Reference
A gene expression map of shoot domains reveals regulatory mechanisms
Gene regulatory networks control development via domain-specific gene expression. In seed plants, self-renewing stem cells located in the shoot apical meristem (SAM) produce leaves from the SAM peripheral zone. After initiation, leaves develop polarity patterns to form a planar shape.
Here we compare translating RNAs among SAM and leaf domains. Using translating ribosome affinity purification and RNA sequencing to quantify gene expression in target domains, we generate a domain-specific translatome map covering representative vegetative stage SAM and leaf domains. We discuss the predicted cellular functions of these domains and provide evidence that dome seemingly unrelated domains, utilize common regulatory modules. Experimental follow up shows that the RABBIT EARS and HANABA TARANU transcription factors have roles in axillary meristem initiation. Reference
High-performance medicine: the convergence of human and artificial intelligence
The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage, across all sectors.
In medicine, this is beginning to have an impact at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health. The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely be actualized, but whether that will be used to improve the patient–doctor relationship or facilitate its erosion remains to be seen. Reference
NG-TAS: an optimised protocol and computational pipeline for cost-effective profiling of circulating tumour DNA
Circulating tumour DNA (ctDNA) detection and monitoring have enormous potential clinical utility in oncology.
We describe here a fast, flexible and cost-effective method to profile multiple genes simultaneously in low input cell-free DNA (cfDNA): Next Generation-Targeted Amplicon Sequencing (NG-TAS). We designed a panel of 377 amplicons spanning 20 cancer genes and tested the NG-TAS pipeline using cell-free DNA from two HapMap lymphoblastoid cell lines. NG-TAS consistently detected mutations in cfDNA when mutation allele fraction was > 1%. We applied NG-TAS to a clinical cohort of metastatic breast cancer patients, demonstrating its potential in monitoring the disease. Reference
Accurate prediction of cell type-specific transcription factor binding
Prediction of cell type-specific, in vivo transcription factor binding sites is one of the central challenges in regulatory genomics. Here, we present our approach that earned a shared first rank in the “ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge” in 2017.
In post-challenge analyses, we benchmark the influence of different feature sets and find that chromatin accessibility and binding motifs are sufficient to yield state-of-the-art performance. Finally, we provide 682 lists of predicted peaks for a total of 31 transcription factors in 22 primary cell types and tissues and a user-friendly version of our approach, Catchitt, for download. Reference
Genome-wide profiling of adenine base editor specificity by EndoV-seq
The adenine base editor (ABE), capable of catalyzing A•T to G•C conversions, is an important gene editing toolbox. Here, we systematically evaluate genome-wide off-target deamination by ABEs using the EndoV-seq platform we developed.
EndoV-seq utilizes Endonuclease V to nick the inosine-containing DNA strand of genomic DNA deaminated by ABE in vitro. The treated DNA is then whole-genome sequenced to identify off-target sites. Of the eight gRNAs we tested with ABE, 2–19 (with an average of 8.0) off-target sites are found, significantly fewer than those found for canonical Cas9 nuclease (7–320, 160.7 on average). In vivo off-target deamination is further validated through target site deep sequencing. Moreover, we demonstrated that six different ABE-gRNA complexes could be examined in a single EndoV-seq assay. Reference
An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar
How viruses evolve within hosts can dictate infection outcomes; however, reconstructing this process is challenging. We evaluate our multiplexed amplicon approach, PrimalSeq, to demonstrate how virus concentration, sequencing coverage, primer mismatches, and replicates influence the accuracy of measuring intrahost virus diversity.
We develop an experimental protocol and computational tool, iVar, for using PrimalSeq to measure virus diversity using Illumina and compare the results to Oxford Nanopore sequencing. We demonstrate the utility of PrimalSeq by measuring Zika and West Nile virus diversity from varied sample types and show that the accumulation of genetic diversity is influenced by experimental and biological systems. Reference
The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens
The Network of Cancer Genes (NCG) is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles.
These genes were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites. This represents a more than 1.5-fold content increase compared to the previous version. NCG also annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality. Reference
Cell-type-specific metabolic labeling, detection and identification of nascent proteomes in vivo
A big challenge in proteomics is the identification of cell-type-specific proteomes in vivo. This protocol describes how to label, purify and identify cell-type-specific proteomes in living mice.
To make this possible, we created a Cre-recombinase-inducible mouse line expressing a mutant methionyl-tRNA synthetase (L274G), which enables the labeling of nascent proteins with the non-canonical amino acid azidonorleucine (ANL). This amino acid can be conjugated to different affinity tags by click chemistry. After affinity purification (AP), the labeled proteins can be identified by tandem mass spectrometry (MS/MS). With this method, it is possible to identify cell-type-specific proteomes derived from living animals, which was not possible with any previously published method. Reference
Age-related remodelling of oesophageal epithelia by mutated cancer drivers
Clonal expansion in aged normal tissues has been implicated in the development of cancer. However, the chronology and risk dependence of the expansion are poorly understood.
Here we intensively sequence 682 micro-scale oesophageal samples and show, in physiologically normal oesophageal epithelia, the progressive age-related expansion of clones that carry mutations in driver genes (predominantly NOTCH1), which is substantially accelerated by alcohol consumption and by smoking. Driver-mutated clones emerge multifocally from early childhood and increase their number and size with ageing, and ultimately replace almost the entire oesophageal epithelium in the extremely elderly. Reference
Circadian oscillations of cytosine modification in humans contribute to epigenetic variability, aging, and complex disease
Maintenance of physiological circadian rhythm plays a crucial role in human health. Numerous studies have shown that disruption of circadian rhythm may increase risk for malignant, psychiatric, metabolic, and other diseases.
Extending our recent findings of oscillating cytosine modifications (osc-modCs) in mice, in this study, we show that osc-modCs are also prevalent in human neutrophils. Osc-modCs may play a role in gene regulation, can explain parts of intra- and inter-individual epigenetic variation, and are signatures of aging. Finally, we show that osc-modCs are linked to three complex diseases and provide a new interpretation of cross-sectional epigenome-wide association studies. Reference
Integration of DNA methylation patterns and genetic variation in human pediatric tissues
The widespread use of accessible peripheral tissues for epigenetic analyses has prompted increasing interest in the study of tissue-specific DNA methylation (DNAm) variation in human populations.
To date, characterizations of inter-individual DNAm variability and DNAm concordance across tissues have been largely performed in adult tissues and therefore are limited in their relevance to DNAm profiles from pediatric samples. BECs had greater inter-individual DNAm variability compared to PBMCs and highly the variable CpGs are more likely to be positively correlated between the matched tissues compared to less variable CpGs. These sites were enriched for CpGs under genetic influence, suggesting that a substantial proportion of DNAm covariation between tissues can be attributed to genetic variation. Finally, we demonstrated the relevance of our findings to human epigenetic studies by categorizing CpGs from published DNAm association studies of pediatric BECs and peripheral blood. Reference
FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association
In the process of post-transcription, microRNAs (miRNAs) are closely related to various complex human diseases. Traditional verification methods for miRNA-disease associations take a lot of time and expense, so it is especially important to design computational methods for detecting potential associations.
Considering the restrictions of previous computational methods for predicting potential miRNAs-disease associations, we develop the model of FKL-Spa-LapRLS (Fast Kernel Learning Sparse kernel Laplacian Regularized Least Squares) to break through the limitations. Reference
Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes
Most eukaryotic genes comprise exons and introns thus requiring the precise removal of introns from pre-mRNAs to enable protein biosynthesis. U2 and U12 spliceosomes catalyze this step by recognizing motifs on the transcript in order to remove the introns. A process which is dependent on precise definition of exon-intron borders by splice sites, which are consequently highly conserved across species.
Only very few combinations of terminal dinucleotides are frequently observed at intron ends, dominated by the canonical GT-AG splice sites on the DNA level. Here we investigate the occurrence of diverse combinations of dinucleotides at predicted splice sites. Analyzing 121 plant genome sequences based on their annotation revealed strong splice site conservation across species, annotation errors, and true biological divergence from canonical splice sites. Reference
Time-resolved mapping of genetic interactions to model rewiring of signaling pathways
Context-dependent changes in genetic interactions are an important feature of cellular pathways and their varying responses under different environmental conditions. However, methodological frameworks to investigate the plasticity of genetic interaction networks over time or in response to external stresses are largely lacking.
To analyze the plasticity of genetic interactions, we performed a combinatorial RNAi screen in Drosophila cells at multiple time points and after pharmacological inhibition of Ras signaling activity. Using an image-based morphology assay to capture a broad range of phenotypes, we assessed the effect of 12768 pairwise RNAi perturbations in six different conditions. We found that genetic interactions form in different trajectories and developed an algorithm, termed MODIFI, to analyze how genetic interactions rewire over time. Using this framework, we identified more statistically significant interactions compared to end-point assays and further observed several examples of context-dependent crosstalk between signaling pathways such as an interaction between Ras and Rel which is dependent on MEK activity. Reference
Genome-wide quantification of the effects of DNA methylation on human gene regulation
Changes in DNA methylation are involved in development, disease, and the response to environmental conditions. However, not all regulatory elements are functionally methylation-dependent (MD).
Here, we report a method, mSTARR-seq, that assesses the causal effects of DNA methylation on regulatory activity at hundreds of thousands of fragments (millions of CpG sites) simultaneously. Using mSTARR-seq, we identify thousands of MD regulatory elements in the human genome. MD activity is partially predictable using sequence and chromatin state information, and distinct transcription factors are associated with higher activity in unmethylated versus methylated DNA. Further, pioneer TFs linked to higher activity in the methylated state appear to drive demethylation of experimentally methylated sites. Reference
RNA G-quadruplexes at upstream open reading frames cause DHX36- and DHX9-dependent translation of human mRNAs
RNA secondary structures in the 5′-untranslated regions (5′-UTR) of mRNAs are key to the post-transcriptional regulation of gene expression.
While it is evident that non-canonical Hoogsteen-paired G-quadruplex (rG4) structures somehow contribute to the regulation of translation initiation, the nature and extent of human mRNAs that are regulated by rG4s is not known. Here, we provide new insights into a mechanism by which rG4 formation modulates translation. Reference
A test metric for assessing single-cell RNA-seq batch correction
Single-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations, but as with all genomics experiments, batch effects can hamper data integration and interpretation.
The success of batch-effect correction is often evaluated by visual inspection of low-dimensional embeddings, which are inherently imprecise. Here we present a user-friendly, robust and sensitive k-nearest-neighbor batch-effect test (kBET) for quantification of batch effects. We used kBET to assess commonly used batch-regression and normalization approaches, and to quantify the extent to which they remove batch effects while preserving biological variability. Reference
Human genome-wide measurement of drug-responsive regulatory activity
Environmental stimuli commonly act via changes in gene regulation. Human-genome-scale assays to measure such responses are indirect or require knowledge of the transcription factors (TFs) involved.
Here, we present the use of human genome-wide high-throughput reporter assays to measure environmentally-responsive regulatory element activity. We focus on responses to glucocorticoids (GCs), an important class of pharmaceuticals and a paradigmatic genomic response model. We assay GC-responsive regulatory activity across >108 unique DNA fragments, covering the human genome at >50×. Those assays directly detected thousands of GC-responsive regulatory elements genome-wide. Reference
Genomic analysis identifies frequent deletions of Dystrophin in olfactory neuroblastoma
Olfactory neuroblastoma (ONB) is a rare malignant neoplasm arising in the upper portion of the sinonasal cavity. To better understand the genetic bases for ONB,
here we perform whole exome and whole genome sequencing as well as single nucleotide polymorphism array analyses in a series of ONB patient samples. Deletions involving the dystrophin (DMD) locus are found in 12 of 14 (86%) tumors. Interestingly, one of the remaining tumors has a deletion in LAMA2, bringing the number of ONBs with deletions of genes involved in the development of muscular dystrophies to 13 or 93%. Reference
Meta-analysis of Immunochip data of four autoimmune diseases
In recent years, research has consistently proven the occurrence of genetic overlap across autoimmune diseases, which supports the existence of common pathogenic mechanisms in autoimmunity. The objective of this study was to further investigate this shared genetic component.
For this purpose, we performed a cross-disease meta-analysis of Immunochip data from 37,159 patients diagnosed with a seropositive autoimmune disease (11,489 celiac disease (CeD), 15,523 rheumatoid arthritis (RA), 3477 systemic sclerosis (SSc), and 6670 type 1 diabetes (T1D)) and 22,308 healthy controls of European origin using the R package ASSET. We identified 38 risk variants shared by at least two of the conditions analyzed, five of which represent new pleiotropic loci in autoimmunity. Reference
An integrative approach for building personalized gene regulatory networks for precision medicine
Only a small fraction of patients respond to the drug prescribed to treat their disease, which means that most are at risk of unnecessary exposure to side effects through ineffective drugs.
This inter-individual variation in drug response is driven by differences in gene interactions caused by each patient’s genetic background, environmental exposures, and the proportions of specific cell types involved in disease. These gene interactions can now be captured by building gene regulatory networks, by taking advantage of RNA velocity (the time derivative of the gene expression state), the ability to study hundreds of thousands of cells simultaneously, and the falling price of single-cell sequencing. Here, we propose an integrative approach that leverages these recent advances in single-cell data with the sensitivity of bulk data to enable the reconstruction of personalized, cell-type- and context-specific gene regulatory networks. Reference
Predicting age from the transcriptome of human dermal fibroblasts
Biomarkers of aging can be used to assess the health of individuals and to study aging and age-related diseases. We generate a large dataset of genome-wide RNA-seq profiles of human dermal fibroblasts from 133 people aged 1 to 94 years old to test whether signatures of aging are encoded within the transcriptome.
We develop an ensemble machine learning method that predicts age to a median error of 4 years, outperforming previous methods used to predict age. The ensemble was further validated by testing it on ten progeria patients, and our method is the only one that predicts accelerated aging in these patients. Reference
Local mutational diversity drives intratumoral immune heterogeneity in non-small cell lung cancer
Combining whole exome sequencing, transcriptome profiling, and T cell repertoire analysis, we investigate the spatial features of surgically-removed biopsies from multiple loci in tumor masses of 15 patients with non-small cell lung cancer (NSCLC).
This revealed that the immune microenvironment has high spatial heterogeneity such that intratumoral regional variation is as large as inter-personal variation. While the local total mutational burden (TMB) is associated with local T-cell clonal expansion, local anti-tumor cytotoxicity does not directly correlate with neoantigen abundance. Reference
FORGe: prioritizing variants for graph genomes
There is growing interest in using genetic variants to augment the reference genome into a graph genome, with alternative sequences, to improve read alignment accuracy and reduce allelic bias.
While adding a variant has the positive effect of removing an undesirable alignment score penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead. Reference
Pan-cancer analysis of transcriptional metabolic dysregulation using The Cancer Genome Atlas
Understanding metabolic dysregulation in different disease settings is vital for the safe and effective incorporation of metabolism-targeted therapeutics in the clinic.
Here, using transcriptomic data for 10,704 tumor and normal samples from The Cancer Genome Atlas, across 26 disease sites, we present a novel bioinformatics pipeline that distinguishes tumor from normal tissues, based on differential gene expression for 114 metabolic pathways. We confirm pathway dysregulation in separate patient populations, demonstrating the robustness of our approach. Bootstrapping simulations were then applied to assess the biological significance of these alterations. We provide distinct examples of the types of analysis that can be accomplished with this tool to understand cancer specific metabolic dysregulation, highlighting novel pathways of interest, and patterns of metabolic flux, in both common and rare disease sites. Reference
Drug and disease signature integration identifies synergistic combinations in glioblastoma
Glioblastoma (GBM) is the most common primary adult brain tumor. Despite extensive efforts, the median survival for GBM patients is approximately 14 months. GBM therapy could benefit greatly from patient-specific targeted therapies that maximize treatment efficacy.
Here we report a platform termed SynergySeq to identify drug combinations for the treatment of GBM by integrating information from The Cancer Genome Atlas (TCGA) and the Library of Integrated Network-Based Cellular Signatures (LINCS). We identify differentially expressed genes in GBM samples and devise a consensus gene expression signature for each compound using LINCS L1000 transcriptional profiling data. Reference
A comprehensive pipeline for translational top-down proteomics from a single blood draw
Top-down proteomics (TDP) by mass spectrometry (MS) is a technique by which intact proteins are analyzed. It has become increasingly popDesalting and concentrating GELFrEEular in translational research because of the value of characterizing distinct proteoforms of intact proteins.
Compared to bottom-up proteomics (BUP) strategies, which measure digested peptide mixtures, TDP provides highly specific molecular information that avoids the bioinformatic challenge of protein inference. However, the technique has been difficult to implement widely because of inherent limitations of existing sample preparation methods and instrumentation. Reference