Unsupervised clustering and epigenetic classification of single cells
Characterizing epigenetic heterogeneity at the cellular level is a critical problem in the modern genomics era. Assays such as single cell ATAC-seq (scATAC-seq) offer an opportunity to interrogate cellular level epigenetic heterogeneity through patterns of variability in open chromatin.
However, these assays exhibit technical variability that complicates clear classification and cell type identification in heterogeneous populations. We present scABC, an R package for the unsupervised clustering of single-cell epigenetic data, to classify scATAC-seq data and discover regions of open chromatin specific to cell identity. Reference
GeNets: a unified web platform for network-based genomic analyses
Functional genomics networks are widely used to identify unexpected pathway relationships in large genomic datasets.
However, it is challenging to compare the signal-to-noise ratios of different networks and to identify the optimal network with which to interpret a particular genetic dataset. We present GeNets, a platform in which users can train a machine-learning model (Quack) to carry out these comparisons and execute, store, and share analyses of genetic and RNA-sequencing datasets. Reference
A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection
Whole blood transcriptional signatures distinguishing active tuberculosis patients from asymptomatic latently infected individuals exist. Consensus has not been achieved regarding the optimal reduced gene sets as diagnostic biomarkers that also achieve discrimination from other diseases.
Here we show a blood transcriptional signature of active tuberculosis using RNA-Seq, confirming microarray results, that discriminates active tuberculosis from latently infected and healthy individuals, validating this signature in an independent cohort. Using an advanced modular approach, we utilise the information from the entire transcriptome, which includes overabundance of type I interferon-inducible genes and underabundance of IFNG and TBX21, to develop a signature that discriminates active tuberculosis patients from latently infected individuals or those with acute viral and bacterial infections. Reference
Classification of red blood cell shapes in flow using outlier tolerant machine learning
The manual evaluation, classification and counting of biological objects demands for an enormous expenditure of time and subjective human input may be a source of error. Investigating the shape of red blood cells (RBCs) in microcapillary Poiseuille flow, we overcome this drawback by introducing a convolutional neural regression network for an automatic, outlier tolerant shape classification.
From our experiments we expect two stable geometries: the so-called ‘slipper’ and ‘croissant’ shapes depending on the prevailing flow conditions and the cell-intrinsic parameters. Whereas croissants mostly occur at low shear rates, slippers evolve at higher flow velocities. With our method, we are able to find the transition point between both ‘phases’ of stable shapes which is of high interest to ensuing theoretical studies and numerical simulations. Using statistically based thresholds, from our data, we obtain so-called phase diagrams which are compared to manual evaluations. Reference
KLRD1-expressing natural killer cells predict influenza susceptibility
Influenza infects tens of millions of people every year in the USA. Other than notable risk groups, such as children and the elderly, it is difficult to predict what subpopulations are at higher risk of infection.
Viral challenge studies, where healthy human volunteers are inoculated with live influenza virus, provide a unique opportunity to study infection susceptibility. Biomarkers predicting influenza susceptibility would be useful for identifying risk groups and designing vaccines. We applied cell mixture deconvolution to estimate immune cell proportions from whole blood transcriptome data in four independent influenza challenge studies. Reference
Genome-wide analysis of long non-coding RNAs affecting roots development at an early stage in the rice response to cadmium stress
Long non-coding RNAs (lncRNAs) have been found to play a vital role in several gene regulatory networks involved in the various biological processes in plants related to stress response. However, systematic analyses of lncRNAs expressed in rice Cadmium (Cd) stress are seldom studied.
Thus, we presented the characterization and expression of lncRNAs in rice root development at an early stage in response to Cd stress. The lncRNA deep sequencing revealed differentially expressed lncRNAs among Cd stress and normal condition. In the Cd stress group, 69 lncRNAs were up-regulated and 75 lncRNAs were down-regulated. Reference
Whole-genome resequencing reveals world-wide ancestry and adaptive introgression events of domesticated cattle in East Asia
Cattle domestication and the complex histories of East Asian cattle breeds warrant further investigation. Through analysing the genomes of 49 modern breeds and eight East Asian ancient samples, worldwide cattle are consistently classified into five continental groups based on Y-chromosome haplotypes and autosomal variants.
We find that East Asian cattle populations are mainly composed of three distinct ancestries, including an earlier East Asian taurine ancestry that reached China at least ~3.9 kya, a later introduced Eurasian taurine ancestry, and a novel Chinese indicine ancestry that diverged from Indian indicine approximately 36.6–49.6 kya. Reference
Diversification and independent domestication of Asian and European pears
Pear (Pyrus) is a globally grown fruit, with thousands of cultivars in five domesticated species and dozens of wild species. However, little is known about the evolutionary history of these pear species and what has contributed to the distinct phenotypic traits between Asian pears and European pears.
We report the genome resequencing of 113 pear accessions from worldwide collections, representing both cultivated and wild pear species. Based on 18,302,883 identified SNPs, we conduct phylogenetics, population structure, gene flow, and selective sweep analyses. Furthermore, we propose a model for the divergence, dissemination, and independent domestication of Asian and European pears in which pear, after originating in southwest China and then being disseminated throughout central Asia, has eventually spread to western Asia, and then on to Europe. Reference
Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci
Genome-wide association studies (GWAS) and fine-mapping efforts to date have identified more than 100 prostate cancer (PrCa)-susceptibility loci.
We meta-analyzed genotype data from a custom high-density array of 46,939 PrCa cases and 27,910 controls of European ancestry with previously genotyped data of 32,255 PrCa cases and 33,202 controls of European ancestry. Our analysis identified 62 novel loci associated (P < 5.0 × 10−8) with PrCa and one locus significantly associated with early-onset PrCa (≤55 years). Our findings include missense variants rs1800057 (odds ratio (OR) = 1.16; P = 8.2 × 10−9; G>C, p.Pro1054Arg) in ATM and rs2066827 (OR = 1.06; P = 2.3 × 10−9; T>G, p.Val109Gly) in CDKN1B. The combination of all loci captured 28.4% of the PrCa familial relative risk, and a polygenic risk score conferred an elevated PrCa risk for men in the ninetieth to ninety-ninth percentiles (relative risk = 2.69; 95% confidence interval (CI): 2.55–2.82) and first percentile (relative risk = 5.71; 95% CI: 5.04–6.48) risk stratum compared with the population average. Reference
Tensorial blind source separation for improved analysis of multi-omic data
There is an increased need for integrative analyses of multi-omic data. We present and benchmark a novel tensorial independent component analysis (tICA) algorithm against current state-of-the-art methods.
We find that tICA outperforms competing methods in identifying biological sources of data variation at a reduced computational cost. On epigenetic data, tICA can identify methylation quantitative trait loci at high sensitivity. In the cancer context, tICA identifies gene modules whose expression variation across tumours is driven by copy-number or DNA methylation changes, but whose deregulation relative to normal tissue is independent of such alterations, a result we validate by direct analysis of individual data types. Reference
SINC-seq: correlation of transient gene expressions between nucleus and cytoplasm reflects single-cell physiology
We report a microfluidic system that physically separates nuclear RNA (nucRNA) and cytoplasmic RNA (cytRNA) from a single cell and enables single-cell integrated nucRNA and cytRNA-sequencing (SINC-seq).
SINC-seq constructs two individual RNA-seq libraries, nucRNA and cytRNA, per cell, quantifies gene expression in the subcellular compartments, and combines them to create novel single-cell RNA-seq data. Leveraging SINC-seq, we discover distinct natures of correlation among cytRNA and nucRNA that reflect the transient physiological state of single cells. These data provide unique insights into the regulatory network of messenger RNA from the nucleus toward the cytoplasm at the single-cell level. Reference
A computational tool to detect DNA alterations tailored to formalin-fixed paraffin-embedded samples in cancer clinical sequencing
Advanced cancer genomics technologies are now being employed in clinical sequencing, where next-generation sequencers are used to simultaneously identify multiple types of DNA alterations for prescription of molecularly targeted drugs.
However, no computational tool is available to accurately detect DNA alterations in formalin-fixed paraffin-embedded (FFPE) samples commonly used in hospitals. Here, we developed a computational tool tailored to the detection of single nucleotide variations, indels, fusions, and copy number alterations in FFPE samples. Elaborated multilayer noise filters reduced the inherent noise while maintaining high sensitivity, as evaluated in tumor-unmatched normal samples using orthogonal technologies. Reference
Identification of Drivers of Aneuploidy in Breast Tumors
Although aneuploidy is found in the majority of tumors, the degree of aneuploidy varies widely. It is unclear how cancer cells become aneuploid or how highly aneuploid tumors are different from those of more normal ploidy.
We developed a simple computational method that measures the degree of aneuploidy or structural rearrangements of large chromosome regions of 522 human breast tumors from The Cancer Genome Atlas (TCGA). Highly aneuploid tumors overexpress activators of mitotic transcription and the genes encoding proteins that segregate chromosomes. Overexpression of three mitotic transcriptional regulators, E2F1, MYBL2, and FOXM1, is sufficient to increase the rate of lagging anaphase chromosomes in a non-transformed vertebrate tissue, demonstrating that this event can initiate aneuploidy. Highly aneuploid human breast tumors are also enriched in TP53 mutations. TP53 mutations co-associate with the overexpression of mitotic transcriptional activators, suggesting that these events work together to provide fitness to breast tumors. Reference
MICMIC: identification of DNA methylation of distal regulatory regions with causal effects on tumorigenesis
Aberrant promoter methylation is a common mechanism for tumor suppressor inactivation in cancer. We develop a set of tools to identify genome-wide DNA methylation in distal regions with causal effect on tumorigenesis called MICMIC.
Many predictions are directly validated by dCas9-based epigenetic editing to support the accuracy and efficiency of our tool. Oncogenic and lineage-specific transcription factors are shown to aberrantly shape the methylation landscape by modifying tumor-subtype core regulatory circuitry. Notably, the gene regulatory networks orchestrated by enhancer methylation across different cancer types are seen to converge on a common architecture. Reference
Comprehensive comparative analysis of 5′-end RNA-sequencing methods
Specialized RNA-seq methods are required to identify the 5′ ends of transcripts, which are critical for studies of gene regulation, but these methods have not been systematically benchmarked.
We directly compared six such methods, including the performance of five methods on a single human cellular RNA sample and a new spike-in RNA assay that helps circumvent challenges resulting from uncertainties in annotation and RNA processing. We found that the ‘cap analysis of gene expression’ (CAGE) method performed best for mRNA and that most of its unannotated peaks were supported by evidence from other genomic methods. We applied CAGE to eight brain-related samples and determined sample-specific transcription start site (TSS) usage, as well as a transcriptome-wide shift in TSS usage between fetal and adult brain. Reference
Four evolutionary trajectories underlie genetic intratumoral variation in childhood cancer
A major challenge to personalized oncology is that driver mutations vary among cancer cells inhabiting the same tumor. Whether this reflects principally disparate patterns of Darwinian evolution in different tumor regions has remained unexplored.
We mapped the prevalence of genetically distinct clones over 250 regions in 54 childhood cancers. This showed that primary tumors can simultaneously follow up to four evolutionary trajectories over different anatomic areas. The most common pattern consists of subclones with very few mutations confined to a single tumor region. The second most common is a stable coexistence, over vast areas, of clones characterized by changes in chromosome numbers. Reference
Comparative genomics reveals phylogenetic distribution patterns of secondary metabolites in Amycolatopsis species
Genome mining tools have enabled us to predict biosynthetic gene clusters that might encode compounds with valuable functions for industrial and medical applications. With the continuously increasing number of genomes sequenced, we are confronted with an overwhelming number of predicted clusters.
Here, we provide a comprehensive analysis of the model actinobacterial genus Amycolatopsis and its potential for the production of secondary metabolites. A phylogenetic characterization, together with a pan-genome analysis showed that within this highly diverse genus, four major lineages could be distinguished which differed in their potential to produce secondary metabolites. Reference
Vex-seq: high-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency
Understanding the functional impact of genomic variants is a major goal of modern genetics and personalized medicine. Although many synonymous and non-coding variants act through altering the efficiency of pre-mRNA splicing, it is difficult to predict how these variants impact pre-mRNA splicing.
Here, we describe a massively parallel approach we use to test the impact on pre-mRNA splicing of 2059 human genetic variants spanning 110 alternative exons. This method, called variant exon sequencing (Vex-seq), yields data that reinforce known mechanisms of pre-mRNA splicing, identifies variants that impact pre-mRNA splicing, and will be useful for increasing our understanding of genome function. Reference
PanDrugs: a novel method to prioritize anticancer drug treatments according to individual genomic data
Large-sequencing cancer genome projects have shown that tumors have thousands of molecular alterations and their frequency is highly heterogeneous. In such scenarios, physicians and oncologists routinely face lists of cancer genomic alterations where only a minority of them are relevant biomarkers to drive clinical decision-making.
We present PanDrugs, a new computational methodology to guide the selection of personalized treatments in cancer patients using the variant lists provided by genome-wide sequencing analyses. PanDrugs offers the largest database of drug-target associations available from well-known targeted therapies to preclinical drugs. Scoring data-driven gene cancer relevance and drug feasibility PanDrugs interprets genomic alterations and provides a prioritized evidence-based list of anticancer therapies. Reference
An optimized library for reference-based deconvolution of whole-blood biospecimens assayed
Genome-wide methylation arrays are powerful tools for assessing cell composition of complex mixtures. We compare three approaches to select reference libraries for deconvoluting neutrophil, monocyte, B-lymphocyte, natural killer, and CD4+ and CD8+ T-cell fractions based on blood-derived DNA methylation signatures assayed using the Illumina HumanMethylationEPIC array.
The IDOL algorithm identifies a library of 450 CpGs, resulting in an average R2 = 99.2 across cell types when applied to EPIC methylation data collected on artificial mixtures constructed from the above cell types. Of the 450 CpGs, 69% are unique to EPIC. This library has the potential to reduce unintended technical differences across array platforms. Reference
The fecal metabolome as a functional readout of the gut microbiome
The human gut microbiome plays a key role in human health1, but 16S characterization lacks quantitative functional annotation2. The fecal metabolome provides a functional readout of microbial activity and can be used as an intermediate phenotype mediating host–microbiome interactions3.
In this comprehensive description of the fecal metabolome, examining 1,116 metabolites from 786 individuals from a population-based twin study (TwinsUK), the fecal metabolome was found to be only modestly influenced by host genetics (heritability (H2) = 17.9%). One replicated locus at the NAT2 gene was associated with fecal metabolic traits. The fecal metabolome largely reflects gut microbial composition, explaining on average 67.7% (±18.8%) of its variance. It is strongly associated with visceral-fat mass, thereby illustrating potential mechanisms underlying the well-established microbial influence on abdominal obesity. Reference
Analysis of the androgen receptor–regulated lncRNA landscape identifies a role for ARLNC1 in prostate cancer progression
The androgen receptor (AR) plays a critical role in the development of the normal prostate as well as prostate cancer.
Using an integrative transcriptomic analysis of prostate cancer cell lines and tissues, we identified ARLNC1 (AR-regulated long noncoding RNA 1) as an important long noncoding RNA that is strongly associated with AR signaling in prostate cancer progression. Not only was ARLNC1 induced by the AR protein, but ARLNC1 stabilized the AR transcript via RNA–RNA interaction. ARLNC1 knockdown suppressed AR expression, global AR signaling and prostate cancer growth in vitro and in vivo. Reference
The genomic landscape of TERT promoter wildtype-IDH wildtype glioblastoma
The majority of glioblastomas can be classified into molecular subgroups based on mutations in the TERT promoter (TERTp) and isocitrate dehydrogenase 1 or 2 (IDH). These molecular subgroups utilize distinct genetic mechanisms of telomere maintenance, either TERTp mutation leading to telomerase activation or ATRX-mutation leading to an alternative lengthening of telomeres phenotype (ALT).
However, about 20% of glioblastomas lack alterations in TERTp and IDH. These tumors, designated TERTpWT-IDHWT glioblastomas, do not have well-established genetic biomarkers or defined mechanisms of telomere maintenance. Here we report the genetic landscape of TERTpWT-IDHWT glioblastoma and identify SMARCAL1 inactivating mutations as a novel genetic mechanism of ALT. Furthermore, we identify a novel mechanism of telomerase activation in glioblastomas that occurs via chromosomal rearrangements upstream of TERT. Reference
Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust
The genomes of cancer cells constantly change during pathogenesis. This evolutionary process can lead to the emergence of drug-resistant mutations in subclonal populations, which can hinder therapeutic intervention in patients. Data derived from massively parallel sequencing can be used to infer these subclonal populations using tumor-specific point mutations.
The accurate determination of copy-number changes and tumor impurity is necessary to reliably infer subclonal populations by mutational clustering. This protocol describes how to use Sclust, a copy-number analysis method with a recently developed mutational clustering approach. In a series of simulations and comparisons with alternative methods, we have previously shown that Sclust accurately determines copy-number states and subclonal populations. Reference
Integrated time course omics analysis distinguishes immediate therapeutic response from acquired resistance
Targeted therapies specifically act by blocking the activity of proteins that are encoded by genes critical for tumorigenesis. However, most cancers acquire resistance and long-term disease remission is rarely observed.
Understanding the time course of molecular changes responsible for the development of acquired resistance could enable optimization of patients’ treatment options. Clinically, acquired therapeutic resistance can only be studied at a single time point in resistant tumors. To determine the dynamics of these molecular changes, we obtained high throughput omics data (RNA-sequencing and DNA methylation) weekly during the development of cetuximab resistance in a head and neck cancer in vitro model. The CoGAPS unsupervised algorithm was used to determine the dynamics of the molecular changes associated with resistance during the time course of resistance development. Reference
Mapping the physical network of cellular interactions
A cell’s function is influenced by the environment, or niche, in which it resides. Studies of niches usually require assumptions about the cell types present, which impedes the discovery of new cell types or interactions.
Here we describe ProximID, an approach for building a cellular network based on physical cell interaction and single-cell mRNA sequencing, and show that it can be used to discover new preferential cellular interactions without prior knowledge of component cell types. ProximID found specific interactions between megakaryocytes and mature neutrophils and between plasma cells and myeloblasts and/or promyelocytes (precursors of neutrophils) in mouse bone marrow, and it identified a Tac1+ enteroendocrine cell–Lgr5+ stem cell interaction in small intestine crypts. Reference