NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION
Technological limitations have hindered the large-scale genetic investigation of tandem repeats in disease. We show that long-read sequencing with a single Oxford Nanopore Technologies PromethION flow cell per individual achieves 30× human genome coverage and enables accurate assessment of tandem repeats including the 10,000-bp Alzheimer’s disease-associated ABCA7 VNTR.
The Guppy “flip-flop” base caller and tandem-genotypes tandem repeat caller are efficient for large-scale tandem repeat assessment, but base calling and alignment challenges persist. We present NanoSatellite, which analyzes tandem repeats directly on electric current data and improves calling of GC-rich tandem repeats, expanded alleles, and motif interruptions. Reference
A Bayesian mixture model for the analysis of allelic expression in single cells
Allele-specific expression (ASE) at single-cell resolution is a critical tool for understanding the stochastic and dynamic features of gene expression. However, low read coverage and high biological variability present challenges for analyzing ASE. We demonstrate that discarding multi-mapping reads leads to higher variability in estimates of allelic proportions, an increased frequency of sampling zeros, and can lead to spurious findings of dynamic and monoallelic gene expression.
Here, we report a method for ASE analysis from single-cell RNA-Seq data that accurately classifies allelic expression states and improves estimation of allelic proportions by pooling information across cells. We further demonstrate that combining information across cells using a hierarchical mixture model reduces sampling variability without sacrificing cell-to-cell heterogeneity. We applied our approach to re-evaluate the statistical independence of allelic bursting and track changes in the allele-specific expression patterns of cells sampled over a developmental time course. Reference
FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation
The ability to simulate high-throughput chromatin conformation (Hi-C) data is foundational for benchmarking Hi-C data analysis methods.
Here we present a nonparametric strategy named FreeHi-C to simulate Hi-C data from the interacting genome fragments. Data from FreeHi-C exhibit high fidelity to biological Hi-C data. FreeHi-C boosts the precision and power of differential chromatin interaction detection through data augmentation under preserved false discovery rate control. Reference
OrthoFinder: phylogenetic orthology inference for comparative genomics
Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics.
Each output is benchmarked on appropriate real or simulated datasets, and where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. Reference
Super-enhancer-guided mapping of regulatory networks controlling mouse trophoblast stem cells
Trophectoderm (TE) lineage development is pivotal for proper implantation, placentation, and healthy pregnancy. However, only a few TE-specific transcription factors (TFs) have been systematically characterized, hindering our understanding of the process. To elucidate regulatory mechanisms underlying TE development, here we map super-enhancers (SEs) in trophoblast stem cells (TSCs) as a model.
We find both prominent TE-specific master TFs (Cdx2, Gata3, and Tead4), and >150 TFs that had not been previously implicated in TE lineage, that are SE-associated. Mapping targets of 27 SE-predicted TFs reveals a highly intertwined transcriptional regulatory circuitry. Intriguingly, SE-predicted TFs show 4 distinct expression patterns with dynamic alterations of their targets during TSC differentiation. Reference
Global impact of somatic structural variation on the DNA methylome of human cancers
Genomic rearrangements exert a heavy influence on the molecular landscape of cancer. New analytical approaches integrating somatic structural variants (SSVs) with altered gene features represent a framework by which we can assign global significance to a core set of genes, analogous to established methods that identify genes non-randomly targeted by somatic mutation or copy number alteration.
While recent studies have defined broad patterns of association involving gene transcription and nearby SSV breakpoints, global alterations in DNA methylation in the context of SSVs remain largely unexplored. Reference
Mapping 123 million neonatal, infant and child deaths between 2000 and 2017
Since 2000, many countries have achieved considerable success in improving child survival, but localized progress remains unclear. To inform efforts towards United Nations Sustainable Development Goal 3.2—to end preventable child deaths by 2030—we need consistently estimated data at the subnational level regarding child mortality rates and trends.
Here we quantified, for the period 2000–2017, the subnational variation in mortality rates and number of deaths of neonates, infants and children under 5 years of age within 99 low- and middle-income countries using a geostatistical survival model. We estimated that 32% of children under 5 in these countries lived in districts that had attained rates of 25 or fewer child deaths per 1,000 live births by 2017, and that 58% of child deaths between 2000 and 2017 in these countries could have been averted in the absence of geographical inequality. Reference
Single-cell transcriptomics of human T cells reveals tissue and activation signatures in health and disease
Human T cells coordinate adaptive immunity in diverse anatomic compartments through production of cytokines and effector molecules, but it is unclear how tissue site influences T cell persistence and function.
Here, we use single cell RNA-sequencing (scRNA-seq) to define the heterogeneity of human T cells isolated from lungs, lymph nodes, bone marrow and blood, and their functional responses following stimulation. Through analysis of >50,000 resting and activated T cells, we reveal tissue T cell signatures in mucosal and lymphoid sites, and lineage-specific activation states across all sites including distinct effector states for CD8+ T cells and an interferon-response state for CD4+ T cells. Comparing scRNA-seq profiles of tumor-associated T cells to our dataset reveals predominant activated CD8+ compared to CD4+ T cell states within multiple tumor types. Reference
MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions
scRNA-seq profiles each represent a highly partial sample of mRNA molecules from a unique cell that can never be resampled, and robust analysis must separate the sampling effect from biological variance.
We describe a methodology for partitioning scRNA-seq datasets into metacells: disjoint and homogenous groups of profiles that could have been resampled from the same cell. Unlike clustering analysis, our algorithm specializes at obtaining granular as opposed to maximal groups. We show how to use metacells as building blocks for complex quantitative transcriptional maps while avoiding data smoothing. Our algorithms are implemented in the MetaCell R/C++ software package. Reference
Genome-wide association mapping of date palm fruit traits
Date palms (Phoenix dactylifera) are an important fruit crop of arid regions of the Middle East and North Africa. Despite its importance, few genomic resources exist for date palms, hampering evolutionary genomic studies of this perennial species.
Here we report an improved long-read genome assembly for P. dactylifera that is 772.3 Mb in length, with contig N50 of 897.2 Kb, and use this to perform genome-wide association studies (GWAS) of the sex determining region and 21 fruit traits. We find a fruit color GWAS at the R2R3-MYB transcription factor VIRESCENS gene and identify functional alleles that include a retrotransposon insertion and start codon mutation. We also find a GWAS peak for sugar composition spanning deletion polymorphisms in multiple linked invertase genes. Reference
Relating the gut metagenome and metatranscriptome to immunotherapy responses in melanoma patients
Recent evidence suggests that immunotherapy efficacy in melanoma is modulated by gut microbiota. Few studies have examined this phenomenon in humans, and none have incorporated metatranscriptomics, important for determining expression of metagenomic functions in the microbial community.
In melanoma patients undergoing immunotherapy, gut microbiome was characterized in pre-treatment stool using 16S rRNA gene and shotgun metagenome sequencing (n = 27). Transcriptional expression of metagenomic pathways was confirmed with metatranscriptome sequencing in a subset of 17. We examined associations of taxa and metagenomic pathways with progression-free survival (PFS) using 500 × 10-fold cross-validated elastic-net penalized Cox regression. Reference
A systematic evaluation of single cell RNA-seq analysis pipelines
The recent rapid spread of single cell RNA sequencing (scRNA-seq) methods has created a large variety of experimental and computational pipelines for which best practices have not yet been established. Here, we use simulations based on five scRNA-seq library protocols in combination with nine realistic differential expression (DE) setups to systematically evaluate three mapping, four imputation, seven normalisation and four differential expression testing approaches resulting in ~3000 pipelines, allowing us to also assess interactions among pipeline steps.
We find that choices of normalisation and library preparation protocols have the biggest impact on scRNA-seq analyses. Specifically, we find that library preparation determines the ability to detect symmetric expression differences, while normalisation dominates pipeline performance in asymmetric DE-setups. Finally, we illustrate the importance of informed choices by showing that a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the sample size. Reference
DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data
Characterizing and interpreting heterogeneous mixtures at the cellular level is a critical problem in genomics. Single-cell assays offer an opportunity to resolve cellular level heterogeneity, e.g., scRNA-seq enables single-cell expression profiling, and scATAC-seq identifies active regulatory elements.
Furthermore, while scHi-C can measure the chromatin contacts (i.e., loops) between active regulatory elements to target genes in single cells, bulk HiChIP can measure such contacts in a higher resolution. In this work, we introduce DC3 (De-Convolution and Coupled-Clustering) as a method for the joint analysis of various bulk and single-cell data such as HiChIP, RNA-seq and ATAC-seq from the same heterogeneous cell population. DC3 can simultaneously identify distinct subpopulations, assign single cells to the subpopulations (i.e., clustering) and de-convolve the bulk data into subpopulation-specific data. Reference
Identifying significantly impacted pathways: a comprehensive review and assessment
Many high-throughput experiments compare two phenotypes such as disease vs. healthy, with the goal of understanding the underlying biological phenomena characterizing the given phenotype. Because of the importance of this type of analysis, more than 70 pathway analysis methods have been proposed so far.
These can be categorized into two main categories: non-topology-based (non-TB) and topology-based (TB). Although some review papers discuss this topic from different aspects, there is no systematic, large-scale assessment of such methods. Furthermore, the majority of the pathway analysis approaches rely on the assumption of uniformity of p values under the null hypothesis, which is often not true. Reference
Convergence of human and Old World monkey gut microbiomes demonstrates the importance of human ecology over phylogeny
Comparative data from non-human primates provide insight into the processes that shaped the evolution of the human gut microbiome and highlight microbiome traits that differentiate humans from other primates.
Here, in an effort to improve our understanding of the human microbiome, we compare gut microbiome composition and functional potential in 14 populations of humans from ten nations and 18 species of wild, non-human primates. Contrary to expectations from host phylogenetics, we find that human gut microbiome composition and functional potential are more similar to those of cercopithecines, a subfamily of Old World monkey, particularly baboons, than to those of African apes. Reference
Transcriptional landscape and clinical utility of enhancer RNAs for eRNA-targeted therapy in cancer
Enhancer RNA (eRNA) is a type of noncoding RNA transcribed from the enhancer. Although critical roles of eRNA in gene transcription control have been increasingly realized, the systemic landscape and potential function of eRNAs in cancer remains largely unexplored.
Here, we report the integration of multi-omics and pharmacogenomics data across large-scale patient samples and cancer cell lines. We observe a cancer-/lineage-specificity of eRNAs, which may be largely driven by tissue-specific TFs. eRNAs are involved in multiple cancer signaling pathways through putatively regulating their target genes, including clinically actionable genes and immune checkpoints. Reference
Interplay between the human gut microbiome and host metabolism
The human gut is inhabited by a complex and metabolically active microbial ecosystem. While many studies focused on the effect of individual microbial taxa on human health, their overall metabolic potential has been under-explored.
Using whole-metagenome shotgun sequencing data in 1,004 twins, we first observed that unrelated subjects share, on average, almost double the number of metabolic pathways (82%) than species (43%). Then, using 673 blood and 713 faecal metabolites, we found metabolic pathways to be associated with 34% of blood and 95% of faecal metabolites, with over 18,000 significant associations, while species showed less than 3,000 associations. Finally, we estimated that the microbiome was involved in a dialogue between 71% of faecal, and 15% of blood, metabolites. Reference
A homology-guided, genome-based proteome for improved proteomics in the alloploid Nicotiana benthamiana
Nicotiana benthamiana is an important model organism of the Solanaceae (Nightshade) family. Several draft assemblies of the N. benthamiana genome have been generated, but many of the gene-models in these draft assemblies appear incorrect.
Here we present an improved proteome based on the Niben1.0.1 draft genome assembly guided by gene models from other Nicotiana species. Due to the fragmented nature of the Niben1.0.1 draft genome, many protein-encoding genes are missing or partial. We complement these missing proteins by similarly annotating other draft genome assemblies. This approach overcomes problems caused by mis-annotated exon-intron boundaries and mis-assigned short read transcripts to homeologs in polyploid genomes. Reference
Identifying Crohn’s disease signal from variome analysis
After years of concentrated research efforts, the exact cause of Crohn’s disease (CD) remains unknown. Its accurate diagnosis, however, helps in management and preventing the onset of disease. Genome-wide association studies have identified 241 CD loci, but these carry small log odds ratios and are thus diagnostically uninformative.
Here, we describe a machine learning method—AVA,Dx (Analysis of Variation for Association with Disease)—that uses exonic variants from whole exome or genome sequencing data to extract CD signal and predict CD status. Using the person-specific coding variation in genes from a panel of only 111 individuals, we built disease-prediction models informative of previously undiscovered disease genes. Reference
Target genes, variants, tissues and transcriptional pathways influencing human serum urate levels
Elevated serum urate levels cause gout and correlate with cardiometabolic diseases via poorly understood mechanisms.
We performed a trans-ancestry genome-wide association study of serum urate in 457,690 individuals, identifying 183 loci (147 previously unknown) that improve the prediction of gout in an independent cohort of 334,880 individuals. Serum urate showed significant genetic correlations with many cardiometabolic traits, with genetic causality analyses supporting a substantial role for pleiotropy. Enrichment analysis, fine-mapping of urate-associated loci and colocalization with gene expression in 47 tissues implicated the kidney and liver as the main target organs and prioritized potentially causal genes and variants, including the transcriptional master regulators in the liver and kidney, HNF1A and HNF4A. Reference
Multi-Cell ECM compaction is predictable via superposition of nonlinear cell dynamics linearized in augmented state space
Cells interacting through an extracellular matrix (ECM) exhibit emergent behaviors resulting from collective intercellular interaction. In wound healing and tissue development, characteristic compaction of ECM gel is induced by multiple cells that generate tensions in the ECM fibers and coordinate their actions with other cells. Computational prediction of collective cell-ECM interaction based on first principles is highly complex especially as the number of cells increase.
Here, we introduce a computationally-efficient method for predicting nonlinear behaviors of multiple cells interacting mechanically through a 3-D ECM fiber network. The key enabling technique is superposition of single cell computational models to predict multicellular behaviors. While cell-ECM interactions are highly nonlinear, they can be linearized accurately with a unique method, termed Dual-Faceted Linearization. This method recasts the original nonlinear dynamics in an augmented space where the system behaves more linearly. The independent state variables are augmented by combining auxiliary variables that inform nonlinear elements involved in the system. This computational method involves a) expressing the original nonlinear state equations with two sets of linear dynamic equations b) reducing the order of the augmented linear system via principal component analysis and c) superposing individual single cell-ECM dynamics to predict collective behaviors of multiple cells. Reference
Transcriptome-wide association study of attention deficit hyperactivity disorder identifies associated genes and phenotypes
Attention deficit/hyperactivity disorder (ADHD) is a common neurodevelopmental psychiatric disorder. Genome-wide association studies (GWAS) have identified several loci associated with ADHD. However, understanding the biological relevance of these genetic loci has proven to be difficult.
Here, we conduct an ADHD transcriptome-wide association study (TWAS) consisting of 19,099 cases and 34,194 controls and identify 9 transcriptome-wide significant hits, of which 6 genes were not implicated in the original GWAS. We demonstrate that two of the previous GWAS hits can be largely explained by expression regulation. Probabilistic causal fine-mapping of TWAS signals prioritizes KAT2B with a posterior probability of 0.467 in the dorsolateral prefrontal cortex and TMEM161B with a posterior probability of 0.838 in the amygdala. Reference
microRNA arm-imbalance in part from complementary targets mediated decay promotes gastric cancer progression
Strand-selection is the final step of microRNA biogenesis in which functional mature miRNAs are generated from one or both arms of precursor. The preference of strand-selection is diverse during development and tissue formation, however, its pathological effect is still unknown.
Here we find that two miRNA arms from the same precursor, miR-574-5p and miR-574-3p, are inversely expressed and play exactly opposite roles in gastric cancer progression. Higher-5p with lower-3p expression pattern is significantly correlated with higher TNM stages and poor prognosis of gastric cancer patients. The increase of miR-574-5p/-3p ratio, named miR-574 arm-imbalance is partially due to the dynamic expression of their highly complementary targets in gastric carcinogenesis, moreover, the arm-imbalance of miR-574 is in turn involved and further promotes gastric cancer progression. Reference
A de novo evolved gene in the house mouse regulates female pregnancy cycles
The de novo emergence of new genes has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here, we identify a set of house mouse-specific protein-coding genes and assess their translation by ribosome profiling and mass spectrometry data.
We functionally analyze one of them, Gm13030, which is specifically expressed in females in the oviduct. The interruption of the reading frame affects the transcriptional network in the oviducts at a specific stage of the estrous cycle. This includes the upregulation of Dcpp genes, which are known to stimulate the growth of preimplantation embryos. As a consequence, knockout females have their second litters after shorter times and have a higher infanticide rate. Given that Gm13030 shows no signs of positive selection, our findings support the hypothesis that a de novo evolved gene can directly adopt a function without much sequence adaptation. Reference
Metabolomic adaptations and correlates of survival to immune checkpoint blockade
Despite remarkable success of immune checkpoint inhibitors, the majority of cancer patients have yet to receive durable benefits.
Here, in order to investigate the metabolic alterations in response to immune checkpoint blockade, we comprehensively profile serum metabolites in advanced melanoma and renal cell carcinoma patients treated with nivolumab, an antibody against programmed cell death protein 1 (PD1). We identify serum kynurenine/tryptophan ratio increases as an adaptive resistance mechanism associated with worse overall survival. This advocates for patient stratification and metabolic monitoring in immunotherapy clinical trials including those combining PD1 blockade with indoleamine 2,3-dioxygenase/tryptophan 2,3-dioxygenase (IDO/TDO) inhibitors. Reference
Divergent neuronal DNA methylation patterns across human cortical development reveal critical periods and a unique role of CpH methylation
DNA methylation (DNAm) is a critical regulator of both development and cellular identity and shows unique patterns in neurons. To better characterize maturational changes in DNAm patterns in these cells, we profile the DNAm landscape at single-base resolution across the first two decades of human neocortical development in NeuN+ neurons using whole-genome bisulfite sequencing and compare them to non-neurons (primarily glia) and prenatal homogenate cortex.
We show that DNAm changes more dramatically during the first 5 years of postnatal life than during the entire remaining period. We further refine global patterns of increasingly divergent neuronal CpG and CpH methylation (mCpG and mCpH) into six developmental trajectories and find that in contrast to genome-wide patterns, neighboring mCpG and mCpH levels within these regions are highly correlated. Reference
Massively parallel RNA device engineering in mammalian cells with RNA-Seq
Synthetic RNA-based genetic devices dynamically control a wide range of gene-regulatory processes across diverse cell types. However, the limited throughput of quantitative assays in mammalian cells has hindered fast iteration and interrogation of sequence space needed to identify new RNA devices.
Here we report developing a quantitative, rapid and high-throughput mammalian cell-based RNA-Seq assay to efficiently engineer RNA devices. We identify new ribozyme-based RNA devices that respond to theophylline, hypoxanthine, cyclic-di-GMP, and folinic acid from libraries of ~22,700 sequences in total. The small molecule responsive devices exhibit low basal expression and high activation ratios, significantly expanding our toolset of highly functional ribozyme switches. Reference
Genetic architecture of human plasma lipidome and its link to cardiovascular disease
Understanding genetic architecture of plasma lipidome could provide better insights into lipid metabolism and its link to cardiovascular diseases (CVDs).
Here, we perform genome-wide association analyses of 141 lipid species (n = 2,181 individuals), followed by phenome-wide scans with 25 CVD related phenotypes (n = 511,700 individuals). We identify 35 lipid-species-associated loci (P <5 ×10−8), 10 of which associate with CVD risk including five new loci-COL5A1, GLTPD2, SPTLC3, MBOAT7 and GALNT16 (false discovery rate<0.05). We identify loci for lipid species that are shown to predict CVD e.g., SPTLC3 for CER(d18:1/24:1). We show that lipoprotein lipase (LPL) may more efficiently hydrolyze medium length triacylglycerides (TAGs) than others. Polyunsaturated lipids have highest heritability and genetic correlations, suggesting considerable genetic regulation at fatty acids levels. Reference
Genome-wide recombination map construction from single individuals using linked-read sequencing
Meiotic recombination rates vary across the genome, often involving localized crossover hotspots and coldspots. Studying the molecular basis and mechanisms underlying this variation has been challenging due to the high cost and effort required to construct individualized genome-wide maps of recombination crossovers.
Here we introduce a new method, called ReMIX, to detect crossovers from gamete DNA of a single individual using Illumina sequencing of 10X Genomics linked-read libraries. ReMIX reconstructs haplotypes and identifies the valuable rare molecules spanning crossover breakpoints, allowing quantification of the genomic location and intensity of meiotic recombination. Using a single mouse and stickleback fish, we demonstrate how ReMIX faithfully recovers recombination hotspots and landscapes that have previously been built using hundreds of offspring. Reference
Signatures of selection in the genome of Swedish warmblood horses selected for sport performance
A growing demand for improved physical skills and mental attitude in modern sport horses has led to strong selection for performance in many warmblood studbooks. The aim of this study was to detect genomic regions with low diversity, and therefore potentially under selection, in Swedish Warmblood horses (SWB) by analysing high-density SNP data.
To investigate if such signatures could be the result of selection for equestrian sport performance, we compared our SWB SNP data with those from Exmoor ponies, a horse breed not selected for sport performance traits. The genomic scan for homozygous regions identified long runs of homozygosity (ROH) shared by more than 85% of the genotyped SWB individuals. Such ROH were located on ECA4, ECA6, ECA7, ECA10 and ECA17. Reference
Discovering genetic interactions bridging pathways in genome-wide association studies
Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear.
In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to focus on testing individual locus pairs, which undermines statistical power. Importantly, a global genetic network mapped for a model eukaryotic organism revealed that genetic interactions often connect genes between compensatory functional modules in a highly coherent manner. Taking advantage of this expected structure, we developed a computational approach called BridGE that identifies pathways connected by genetic interactions from GWAS data. Reference
A compendium of promoter-centered long-range chromatin interactions in the human genome
A large number of putative cis-regulatory sequences have been annotated in the human genome, but the genes they control remain poorly defined. To bridge this gap, we generate maps of long-range chromatin interactions centered on 18,943 well-annotated promoters for protein-coding genes in 27 human cell/tissue types.
We use this information to infer the target genes of 70,329 candidate regulatory elements and suggest potential regulatory function for 27,325 noncoding sequence variants associated with 2,117 physiological traits and diseases. Integrative analysis of these promoter-centered interactome maps reveals widespread enhancer-like promoters involved in gene regulation and common molecular pathways underlying distinct groups of human traits and diseases. Reference
Quantitative MNase-seq accurately maps nucleosome occupancy levels
Micrococcal nuclease (MNase) is widely used to map nucleosomes. However, its aggressive endo-/exo-nuclease activities make MNase-seq unreliable for determining nucleosome occupancies, because cleavages within linker regions produce oligo- and mono-nucleosomes, whereas cleavages within nucleosomes destroy them.
Here, we introduce a theoretical framework for predicting nucleosome occupancies and an experimental protocol with appropriate spike-in normalization that confirms our theory and provides accurate occupancy levels over an MNase digestion time course. As with human cells, we observe no overall differences in nucleosome occupancies between Drosophila euchromatin and heterochromatin, which implies that heterochromatic compaction does not reduce MNase accessibility of linker DNA. Reference
Long-read assembly of the Chinese rhesus macaque genome and identification of ape-specific structural variants
We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp).
We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. Reference
Consistent and correctable bias in metagenomic sequencing experiments
Marker-gene and metagenomic sequencing have profoundly expanded our ability to measure biological communities. But the measurements they provide differ from the truth, often dramatically, because these experiments are biased toward detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions.
We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. These results illuminate new avenues toward truly quantitative and reproducible metagenomics measurements. Reference
The mutational landscape of a prion-like domain
Insoluble protein aggregates are the hallmarks of many neurodegenerative diseases. For example, aggregates of TDP-43 occur in nearly all cases of amyotrophic lateral sclerosis (ALS). However, whether aggregates cause cellular toxicity is still not clear, even in simpler cellular systems.
We reasoned that deep mutagenesis might be a powerful approach to disentangle the relationship between aggregation and toxicity. We generated >50,000 mutations in the prion-like domain (PRD) of TDP-43 and quantified their toxicity in yeast cells. Surprisingly, mutations that increase hydrophobicity and aggregation strongly decrease toxicity. In contrast, toxic variants promote the formation of dynamic liquid-like condensates. Reference
Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling
Single-cell RNA sequencing has enabled the decomposition of complex tissues into functionally distinct cell types. Often, investigators wish to assign cells to cell types through unsupervised clustering followed by manual annotation or via ‘mapping’ to existing data.
However, manual interpretation scales poorly to large datasets, mapping approaches require purified or pre-annotated data and both are prone to batch effects. To overcome these issues, we present CellAssign, a probabilistic model that leverages prior knowledge of cell-type marker genes to annotate single-cell RNA sequencing data into predefined or de novo cell types. CellAssign automates the process of assigning cells in a highly scalable manner across large datasets while controlling for batch and sample effects. Reference
The first enhancer in an enhancer chain safeguards subsequent enhancer-promoter contacts from a distance
Robustness and evolutionary stability of gene expression in the human genome are established by an array of redundant enhancers. Using Hi-C data in multiple cell lines, we report a comprehensive map of promoters and active enhancers connected by chromatin contacts, spanning 9000 enhancer chains in 4 human cell lines associated with 2600 human genes.
We find that the first enhancer in a chain that directly contacts the target promoter is commonly located at a greater genomic distance from the promoter than the second enhancer in a chain, 96 kb vs. 45 kb, respectively. The first enhancer also features higher similarity to the promoter in terms of tissue specificity and higher enrichment of loop factors, suggestive of a stable primary contact with the promoter. Reference
CUT&RUNTools: a flexible pipeline for CUT&RUN processing and footprint analysis
We introduce CUT&RUNTools as a flexible, general pipeline for facilitating the identification of chromatin-associated protein binding and genomic footprinting analysis from antibody-targeted CUT&RUN primary cleavage data.
CUT&RUNTools extracts endonuclease cut site information from sequences of short-read fragments and produces single-locus binding estimates, aggregate motif footprints, and informative visualizations to support the high-resolution mapping capability of CUT&RUN. Reference
Comparative genomics reveals the origin of fungal hyphae and multicellularity
Hyphae represent a hallmark structure of multicellular fungi. The evolutionary origins of hyphae and of the underlying genes are, however, hardly known. By systematically analyzing 72 complete genomes, we here show that hyphae evolved early in fungal evolution probably via diverse genetic changes, including co-option and exaptation of ancient eukaryotic (e.g. phagocytosis-related) genes, the origin of new gene families, gene duplications and alterations of gene structure, among others.
Contrary to most multicellular lineages, the origin of filamentous fungi did not correlate with expansions of kinases, receptors or adhesive proteins. Co-option was probably the dominant mechanism for recruiting genes for hypha morphogenesis, while gene duplication was apparently less prevalent, except in transcriptional regulators and cell wall – related genes. We identified 414 novel gene families that show correlated evolution with hyphae and that may have contributed to its evolution. Reference
Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology
Population-based biobanks with genomic and dense phenotype data provide opportunities for generating effective therapeutic hypotheses and understanding the genomic role in disease predisposition. To characterize latent components of genetic associations, we apply truncated singular value decomposition (DeGAs) to matrices of summary statistics derived from genome-wide association analyses across 2,138 phenotypes measured in 337,199 White British individuals in the UK Biobank study.
We systematically identify key components of genetic associations and the contributions of variants, genes, and phenotypes to each component. As an illustration of the utility of the approach to inform downstream experiments, we report putative loss of function variants, rs114285050 (GPR151) and rs150090666 (PDE3B), that substantially contribute to obesity-related traits and experimentally demonstrate the role of these genes in adipocyte biology. Reference
SalMotifDB: a tool for analyzing putative transcription factor binding sites in salmonid genomes
Recently developed genome resources in Salmonid fish provides tools for studying the genomics underlying a wide range of properties including life history trait variation in the wild, economically important traits in aquaculture and the evolutionary consequences of whole genome duplications.
We present SalMotifDB, a database and associated web and R interface for the analysis of transcription factors (TFs) and their cis-regulatory binding sites in five salmonid genomes. SalMotifDB integrates TF-binding site information for 3072 non-redundant DNA patterns (motifs) assembled from a large number of metazoan motif databases. Through motif matching and TF prediction, we have used these multi-species databases to construct putative regulatory networks in salmonid species. Reference
Gut microbiota confers host resistance to obesity by metabolizing dietary polyunsaturated fatty acids
Gut microbiota mediates the effects of diet, thereby modifying host metabolism and the incidence of metabolic disorders. Increased consumption of omega-6 polyunsaturated fatty acid (PUFA) that is abundant in Western diet contributes to obesity and related diseases.
Although gut-microbiota-related metabolic pathways of dietary PUFAs were recently elucidated, the effects on host physiological function remain unclear. Here, we demonstrate that gut microbiota confers host resistance to high-fat diet (HFD)-induced obesity by modulating dietary PUFAs metabolism. Supplementation of 10-hydroxy-cis-12-octadecenoic acid (HYA), an initial linoleic acid-related gut-microbial metabolite, attenuates HFD-induced obesity in mice without eliciting arachidonic acid-mediated adipose inflammation and by improving metabolic condition via free fatty acid receptors. Moreover, Lactobacillus-colonized mice show similar effects with elevated HYA levels. Our findings illustrate the interplay between gut microbiota and host energy metabolism via the metabolites of dietary omega-6-FAs thereby shedding light on the prevention and treatment of metabolic disorders by targeting gut microbial metabolites. Reference
TOAST: improving reference-free cell composition estimation by cross-cell type differential analysis
In the analysis of high-throughput data from complex samples, cell composition is an important factor that needs to be accounted for. Except for a limited number of tissues with known pure cell type profiles, a majority of genomics and epigenetics data relies on the “reference-free deconvolution” methods to estimate cell composition.
We develop a novel computational method to improve reference-free deconvolution, which iteratively searches for cell type-specific features and performs composition estimation. Simulation studies and applications to six real datasets including both DNA methylation and gene expression data demonstrate favorable performance of the proposed method. Reference
MITRE: inferring features from microbiota time-series data linked to host status
Longitudinal studies are crucial for discovering causal relationships between the microbiome and human disease. We present MITRE, the Microbiome Interpretable Temporal Rule Engine, a supervised machine learning method for microbiome time-series analysis that infers human-interpretable rules linking changes in abundance of clades of microbes over time windows to binary descriptions of host status, such as the presence/absence of disease.
We validate MITRE’s performance on semi-synthetic data and five real datasets. MITRE performs on par or outperforms conventional difficult-to-interpret machine learning approaches, providing a powerful new tool enabling the discovery of biologically interpretable relationships between microbiome and human host. Reference
Extreme inbreeding in a European ancestry sample from the contemporary UK population
In most human societies, there are taboos and laws banning mating between first- and second-degree relatives, but actual prevalence and effects on health and fitness are poorly quantified. Here, we leverage a large observational study of ~450,000 participants of European ancestry from the UK Biobank (UKB) to quantify extreme inbreeding (EI) and its consequences.
We use genotyped SNPs to detect large runs of homozygosity (ROH) and call EI when >10% of an individual’s genome comprise ROHs. We estimate a prevalence of EI of ~0.03%, i.e., ~1/3652. EI cases have phenotypic means between 0.3 and 0.7 standard deviation below the population mean for 7 traits, including stature and cognitive ability, consistent with inbreeding depression estimated from individuals with low levels of inbreeding. Our study provides DNA-based quantification of the prevalence of EI in a European ancestry sample from the UK and measures its effects on health and fitness traits. Reference
Comprehensive characterization of circular RNAs in ~ 1000 human cancer cell lines
Human cancer cell lines are fundamental models for cancer research and therapeutic strategy development. However, there is no characterization of circular RNAs (circRNAs) in a large number of cancer cell lines.
Here, we apply four circRNA identification algorithms to heuristically characterize the expression landscape of circRNAs across ~ 1000 human cancer cell lines from CCLE polyA-enriched RNA-seq data. By using integrative analysis and experimental approaches, we explore the expression landscape, biogenesis, functional consequences, and drug response of circRNAs across different cancer lineages. We revealed highly lineage-specific expression patterns of circRNAs, suggesting that circRNAs may be powerful diagnostic and/or prognostic markers in cancer treatment. We also identified key genes involved in circRNA biogenesis and confirmed that TGF-β signaling may promote biogenesis of circRNAs. Reference
Raptor genomes reveal evolutionary signatures of predatory and nocturnal lifestyles
Birds of prey (raptors) are dominant apex predators in terrestrial communities, with hawks (Accipitriformes) and falcons (Falconiformes) hunting by day and owls (Strigiformes) hunting by night.
Here, we report new genomes and transcriptomes for 20 species of birds, including 16 species of birds of prey, and high-quality reference genomes for the Eurasian eagle-owl (Bubo bubo), oriental scops owl (Otus sunia), eastern buzzard (Buteo japonicus), and common kestrel (Falco tinnunculus). Our extensive genomic analysis and comparisons with non-raptor genomes identify common molecular signatures that underpin anatomical structure and sensory, muscle, circulatory, and respiratory systems related to a predatory lifestyle. Reference
A phenotypic and genomics approach in a multi-ethnic cohort to subtype systemic lupus erythematosus
Systemic lupus erythematous (SLE) is a heterogeneous autoimmune disease in which outcomes vary among different racial groups. Here, we aim to identify SLE subgroups within a multiethnic cohort using an unsupervised clustering approach based on the American College of Rheumatology (ACR) classification criteria.
We identify three patient clusters that vary according to disease severity. Methylation association analysis identifies a set of 256 differentially methylated CpGs across clusters, including 101 CpGs in genes in the Type I Interferon pathway, and we validate these associations in an external cohort. A cis-methylation quantitative trait loci analysis identifies 744 significant CpG-SNP pairs. The methylation signature is enriched for ethnic-associated CpGs suggesting that genetic and non-genetic factors may drive outcomes and ethnic-associated methylation differences. Reference
iTRAQ-based quantitative proteomic and physiological analysis of the response to N deficiency and the compensation effect in rice
The crop growth compensation effect is a naturally biological phenomenon, and nitrogen (N) is essential for crop growth and development, especially for yield formation. Little is known about the molecular mechanism of N deficiency and N compensation in rice.
Thus, the N-sensitive stage of rice was selected to study N deficiency at the tillering stage and N compensation at the young panicle differentiation stage. In this study, a proteome analysis was performed to analyze leaf differentially expressed proteins (DEPs), and to investigate the leaf physiological characteristics and yield under N deficiency and after N compensation. Reference
Unique transcriptional and protein-expression signature in human lung tissue-resident NK cells
Human lung tissue-resident NK cells (trNK cells) are likely to play an important role in host responses towards viral infections, inflammatory conditions and cancer. However, detailed insights into these cells are still largely lacking.
Here we show, using RNA sequencing and flow cytometry-based analyses, that subsets of human lung CD69+CD16− NK cells display hallmarks of tissue-residency, including high expression of CD49a, CD103, and ZNF683, and reduced expression of SELL, S1PR5, and KLF2/3. CD49a+CD16− NK cells are functionally competent, and produce IFN-γ, TNF, MIP-1β, and GM-CSF. After stimulation with IL-15, they upregulate perforin, granzyme B, and Ki67 to a similar degree as CD49a−CD16− NK cells. Reference
Molecular profiling of tissue biopsies reveals unique signatures associated with streptococcal necrotizing soft tissue infections
Necrotizing soft tissue infections (NSTIs) are devastating infections caused by either a single pathogen, predominantly Streptococcus pyogenes, or by multiple bacterial species. A better understanding of the pathogenic mechanisms underlying these different NSTI types could facilitate faster diagnostic and more effective therapeutic strategies.
Here, we integrate microbial community profiling with host and pathogen(s) transcriptional analysis in patient biopsies to dissect the pathophysiology of streptococcal and polymicrobial NSTIs. We observe that the pathogenicity of polymicrobial communities is mediated by synergistic interactions between community members, fueling a cycle of bacterial colonization and inflammatory tissue destruction. In S. pyogenes NSTIs, expression of specialized virulence factors underlies infection pathophysiology. Reference
Variant Interpretation for Cancer (VIC): a computational tool for assessing clinical impacts of somatic variants
Clinical laboratories implement a variety of measures to classify somatic sequence variants and identify clinically significant variants to facilitate the implementation of precision medicine.
To standardize the interpretation process, the Association for Molecular Pathology (AMP), American Society of Clinical Oncology (ASCO), and College of American Pathologists (CAP) published guidelines for the interpretation and reporting of sequence variants in cancer in 2017. These guidelines classify somatic variants using a four-tiered system with ten criteria. Even with the standardized guidelines, assessing clinical impacts of somatic variants remains to be tedious. Additionally, manual implementation of the guidelines may vary among professionals and may lack reproducibility when the supporting evidence is not documented in a consistent manner. Reference
Regulation of rumen development in neonatal ruminants through microbial metagenomes and host transcriptomes
In ruminants, early rumen development is vital for efficient fermentation that converts plant materials to human edible food such as milk and meat. Here, we investigate the extent and functional basis of host-microbial interactions regulating rumen development during the first 6 weeks of life.
The use of microbial metagenomics, together with quantification of volatile fatty acids (VFAs) and qPCR, reveals the colonization of an active bacterial community in the rumen at birth. Colonization of active complex carbohydrate fermenters and archaea with methyl-coenzyme M reductase activity was also observed from the first week of life in the absence of a solid diet. Integrating microbial metagenomics and host transcriptomics reveals only 26.3% of mRNA transcripts, and 46.4% of miRNAs were responsive to VFAs, while others were ontogenic. Reference
Metabolic landscape of the tumor microenvironment at single cell resolution
The tumor milieu consists of numerous cell types each existing in a different environment. However, a characterization of metabolic heterogeneity at single-cell resolution is not established.
Here, we develop a computational pipeline to study metabolic programs in single cells. In two representative human cancers, melanoma and head and neck, we apply this algorithm to define the intratumor metabolic landscape. We report an overall discordance between analyses of single cells and those of bulk tumors with higher metabolic activity in malignant cells than previously appreciated. Variation in mitochondrial programs is found to be the major contributor to metabolic heterogeneity. Surprisingly, the expression of both glycolytic and mitochondrial programs strongly correlates with hypoxia in all cell types. Immune and stromal cells could also be distinguished by their metabolic features. Reference
AlleleAnalyzer: a tool for personalized and allele-specific sgRNA design
The CRISPR/Cas system is a highly specific genome editing tool capable of distinguishing alleles differing by even a single base pair. Target sites might carry genetic variations that are not distinguishable by sgRNA designing tools based on one reference genome.
AlleleAnalyzer is an open-source software that incorporates single-nucleotide variants and short insertions and deletions to design sgRNAs for precisely editing 1 or multiple haplotypes of a sequenced genome, currently supporting 11 Cas proteins. It also leverages patterns of shared genetic variation to optimize sgRNA design for different human populations. Reference
Accurate ethnicity prediction from placental DNA methylation data
The influence of genetics on variation in DNA methylation (DNAme) is well documented. Yet confounding from population stratification is often unaccounted for in DNAme association studies.
Existing approaches to address confounding by population stratification using DNAme data may not generalize to populations or tissues outside those in which they were developed. To aid future placental DNAme studies in assessing population stratification, we developed an ethnicity classifier, PlaNET (Placental DNAme Elastic Net Ethnicity Tool), using five cohorts with Infinium Human Methylation 450k BeadChip array (HM450k) data from placental samples that is also compatible with the newer EPIC platform. Reference
Machine learning-based microarray analyses indicate low-expression genes might collectively influence PAH disease
Accurately predicting and testing the types of Pulmonary arterial hypertension (PAH) of each patient using cost-effective microarray-based expression data and machine learning algorithms could greatly help either identifying the most targeting medicine or adopting other therapeutic measures that could correct/restore defective genetic signaling at the early stage.
Furthermore, the prediction model construction processes can also help identifying highly informative genes controlling PAH, leading to enhanced understanding of the disease etiology and molecular pathways. In this study, we used several different gene filtering methods based on microarray expression data obtained from a high-quality patient PAH dataset. Following that, we proposed a novel feature selection and refinement algorithm in conjunction with well-known machine learning methods to identify a small set of highly informative genes. Results indicated that clusters of small-expression genes could be extremely informative at predicting and differentiating different forms of PAH. Additionally, our proposed novel feature refinement algorithm could lead to significant enhancement in model performance. Reference
T-Scan: A Genome-wide Method for the Systematic Discovery of T Cell Epitopes
T cell recognition of specific antigens mediates protection from pathogens and controls neoplasias, but can also cause autoimmunity. Our knowledge of T cell antigens and their implications for human health is limited by the technical limitations of T cell profiling technologies. Here, we present T-Scan, a high-throughput platform for identification of antigens productively recognized by T cells.
T-Scan uses lentiviral delivery of antigen libraries into cells for endogenous processing and presentation on major histocompatibility complex (MHC) molecules. Target cells functionally recognized by T cells are isolated using a reporter for granzyme B activity, and the antigens mediating recognition are identified by next-generation sequencing. We show T-Scan correctly identifies cognate antigens of T cell receptors (TCRs) from viral and human genome-wide libraries. We apply T-Scan to discover new viral antigens, perform high-resolution mapping of TCR specificity, and characterize the reactivity of a tumor-derived TCR. T-Scan is a powerful approach for studying T cell responses. Reference
Single cell transcriptome analysis of developing arcuate nucleus neurons uncovers their key developmental regulators
Despite the crucial physiological processes governed by neurons in the hypothalamic arcuate nucleus (ARC), such as growth, reproduction and energy homeostasis, the developmental pathways and regulators for ARC neurons remain understudied. Our single cell RNA-seq analyses of mouse embryonic ARC revealed many cell type-specific markers for developing ARC neurons.
These markers include transcription factors whose expression is enriched in specific neuronal types and often depleted in other closely-related neuronal types, raising the possibility that these transcription factors play important roles in the fate commitment or differentiation of specific ARC neuronal types. We validated this idea with the two transcription factors, Foxp2 enriched for Ghrh-neurons and Sox14 enriched for Kisspeptin-neurons, using Foxp2- and Sox14-deficient mouse models. Reference
Single-cell DNA replication profiling identifies spatiotemporal developmental dynamics of chromosome organization
In mammalian cells, chromosomes are partitioned into megabase-sized topologically associating domains (TADs). TADs can be in either A (active) or B (inactive) subnuclear compartments, which exhibit early and late replication timing (RT), respectively.
Here, we show that A/B compartments change coordinately with RT changes genome wide during mouse embryonic stem cell (mESC) differentiation. While A to B compartment changes and early to late RT changes were temporally inseparable, B to A changes clearly preceded late to early RT changes and transcriptional activation. Compartments changed primarily by boundary shifting, altering the compartmentalization of TADs facing the A/B compartment interface, which was conserved during reprogramming and confirmed in individual cells by single-cell Repli-seq. Reference
A meta-analysis of genome-wide association studies identifies multiple longevity genes
Human longevity is heritable, but genome-wide association (GWA) studies have had limited success. Here, we perform two meta-analyses of GWA studies of a rigorous longevity phenotype definition including 11,262/3484 cases surviving at or beyond the age corresponding to the 90th/99th survival percentile, respectively, and 25,483 controls whose age at death or at last contact was at or below the age corresponding to the 60th survival percentile.
Consistent with previous reports, rs429358 (apolipoprotein E (ApoE) ε4) is associated with lower odds of surviving to the 90th and 99th percentile age, while rs7412 (ApoE ε2) shows the opposite. Reference
BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes
To fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for identifying cell lineages and bona fide transcriptional signals, it is necessary to combine data from multiple experiments.
We present BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data. BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate that BERMUDA outperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets. Reference
Insight into genetic predisposition to chronic lymphocytic leukemia from integrative epigenomics
Genome-wide association studies have provided evidence for inherited genetic predisposition to chronic lymphocytic leukemia (CLL). To gain insight into the mechanisms underlying CLL risk we analyze chromatin accessibility, active regulatory elements marked by H3K27ac, and DNA methylation at 42 risk loci in up to 486 primary CLLs.
We identify that risk loci are significantly enriched for active chromatin in CLL with evidence of being CLL-specific or differentially regulated in normal B-cell development. We then use in situ promoter capture Hi-C, in conjunction with gene expression data to reveal likely target genes of the risk loci. Candidate target genes are enriched for pathways related to B-cell development such as MYC and BCL2 signalling. At 14 loci the analysis highlights 63 variants as the probable functional basis of CLL risk. Reference
A systematic assessment of current genome-scale metabolic reconstruction tools
Several genome-scale metabolic reconstruction software platforms have been developed and are being continuously updated. These tools have been widely applied to reconstruct metabolic models for hundreds of microorganisms ranging from important human pathogens to species of industrial relevance.
However, these platforms, as yet, have not been systematically evaluated with respect to software quality, best potential uses and intrinsic capacity to generate high-quality, genome-scale metabolic models. It is therefore unclear for potential users which tool best fits the purpose of their research. Reference
Connecting signaling and metabolic pathways in EGF receptor-mediated oncogenesis of glioblastoma
As malignant transformation requires synchronization of growth-driving signaling (S) and metabolic (M) pathways, defining cancer-specific S-M interconnected networks (SMINs) could lead to better understanding of oncogenic processes.
In a systems-biology approach, we developed a mathematical model for SMINs in mutated EGF receptor (EGFRvIII) compared to wild-type EGF receptor (EGFRwt) expressing glioblastoma multiforme (GBM). Starting with experimentally validated human protein-protein interactome data for S-M pathways, and incorporating proteomic data for EGFRvIII and EGFRwt GBM cells and patient transcriptomic data, we designed a dynamic model for EGFR-driven GBM-specific information flow. Key nodes and paths identified by in silico perturbation were validated experimentally when inhibition of signaling pathway proteins altered expression of metabolic proteins as predicted by the model. Reference
Comprehensive transcriptomic analysis of cell lines as models of primary tumors across 22 tumor types
Cancer cell lines are a cornerstone of cancer research but previous studies have shown that not all cell lines are equal in their ability to model primary tumors.
Here we present a comprehensive pan-cancer analysis utilizing transcriptomic profiles from The Cancer Genome Atlas and the Cancer Cell Line Encyclopedia to evaluate cell lines as models of primary tumors across 22 tumor types. We perform correlation analysis and gene set enrichment analysis to understand the differences between cell lines and primary tumors. Additionally, we classify cell lines into tumor subtypes in 9 tumor types. We present our pancreatic cancer results as a case study and find that the commonly used cell line MIA PaCa-2 is transcriptionally unrepresentative of primary pancreatic adenocarcinomas. Reference
Proteogenomic landscape of squamous cell lung cancer
How genomic and transcriptomic alterations affect the functional proteome in lung cancer is not fully understood. Here, we integrate DNA copy number, somatic mutations, RNA-sequencing, and expression proteomics in a cohort of 108 squamous cell lung cancer (SCC) patients.
We identify three proteomic subtypes, two of which (Inflamed, Redox) comprise 87% of tumors. The Inflamed subtype is enriched with neutrophils, B-cells, and monocytes and expresses more PD-1. Redox tumours are enriched for oxidation-reduction and glutathione pathways and harbor more NFE2L2/KEAP1 alterations and copy gain in the 3q2 locus. Proteomic subtypes are not associated with patient survival. Reference
BART-Seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis
We describe a highly sensitive, quantitative, and inexpensive technique for targeted sequencing of transcript cohorts or genomic regions from thousands of bulk samples or single cells in parallel.
Multiplexing is based on a simple method that produces extensive matrices of diverse DNA barcodes attached to invariant primer sets, which are all pre-selected and optimized in silico. By applying the matrices in a novel workflow named Barcode Assembly foR Targeted Sequencing (BART-Seq), we analyze developmental states of thousands of single human pluripotent stem cells, either in different maintenance media or upon Wnt/β-catenin pathway activation, which identifies the mechanisms of differentiation induction. Reference
GWAS hints at pleiotropic roles for FLOWERING LOCUS T in flowering time and yield-related traits in canola
Transition to flowering at the right time is critical for local adaptation and to maximize grain yield in crops. Canola is an important oilseed crop with extensive variation in flowering time among varieties. However, our understanding of underlying genes and their role in canola productivity is limited.
We report our analyses of a diverse GWAS panel (300–368 accessions) of canola and identify SNPs that are significantly associated with variation in flowering time and response to photoperiod across multiple locations. We show that several of these associations map in the vicinity of FLOWERING LOCUS T (FT) paralogs and its known transcriptional regulators. Complementary QTL and eQTL mapping studies, conducted in an Australian doubled haploid population, also detected consistent genomic regions close to the FT paralogs associated with flowering time and yield-related traits. FT sequences vary between accessions. Reference
The existence of discrete phenotypic traits suggests that the complex regulatory processes which produce them are functionally modular. These processes are usually represented by networks. Only modular networks can be partitioned into intelligible subcircuits able to evolve relatively independently.
Traditionally, functional modularity is approximated by detection of modularity in network structure. However, the correlation between structure and function is loose. Many regulatory networks exhibit modular behaviour without structural modularity. Here we partition an experimentally tractable regulatory network—the gap gene system of dipteran insects—using an alternative approach. We show that this system, although not structurally modular, is composed of dynamical modules driving different aspects of whole-network behaviour. Reference
A genome-wide positioning systems network algorithm for in silico drug repurposing
Recent advances in DNA/RNA sequencing have made it possible to identify new targets rapidly and to repurpose approved drugs for treating heterogeneous diseases by the ‘precise’ targeting of individualized disease modules.
In this study, we develop a Genome-wide Positioning Systems network (GPSnet) algorithm for drug repurposing by specifically targeting disease modules derived from individual patient’s DNA and RNA sequencing profiles mapped to the human protein-protein interactome network. We investigate whole-exome sequencing and transcriptome profiles from ~5,000 patients across 15 cancer types from The Cancer Genome Atlas. We show that GPSnet-predicted disease modules can predict drug responses and prioritize new indications for 140 approved drugs. Reference
A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes
A platform for highly parallel direct sequencing of native RNA strands was recently described by Oxford Nanopore Technologies, but despite initial efforts it remains crucial to further investigate the technology for quantification of complex transcriptomes.
Here we undertake native RNA sequencing of polyA + RNA from two human cell lines, analysing ~5.2 million aligned native RNA reads. To enable informative comparisons, we also perform relevant ONT direct cDNA- and Illumina-sequencing. We find that while native RNA sequencing does enable some of the anticipated advantages, key unexpected aspects currently hamper its performance, most notably the quite frequent inability to obtain full-length transcripts from single reads, as well as difficulties to unambiguously infer their true transcript of origin. Reference
HUPAN: a pan-genome analysis pipeline for human genomes
The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions.
Here, we developed a HUman Pan-genome ANalysis (HUPAN) system to build the human pan-genome. We applied it to 185 deep sequencing and 90 assembled Han Chinese genomes and detected 29.5 Mb novel genomic sequences and at least 188 novel protein-coding genes missing in the human reference genome (GRCh38). It can be an important resource for the human genome-related biomedical studies, such as cancer genome analysis. Reference
Spatial chromatin architecture alteration by structural variations in human genomes at the population scale
The number of reported examples of chromatin architecture alterations involved in the regulation of gene transcription and in disease is increasing. However, no genome-wide testing has been performed to assess the abundance of these events and their importance relative to other factors affecting genome regulation.
This is particularly interesting given that a vast majority of genetic variations identified in association studies are located outside coding sequences. This study attempts to address this lack by analyzing the impact on chromatin spatial organization of genetic variants identified in individuals from 26 human populations and in genome-wide association studies. Reference
Role of p110a subunit of PI3-kinase in skeletal muscle mitochondrial homeostasis and metabolism
Skeletal muscle insulin resistance, decreased phosphatidylinositol 3-kinase (PI3K) activation and altered mitochondrial function are hallmarks of type 2 diabetes. To determine the relationship between these abnormalities, we created mice with muscle-specific knockout of the p110α or p110β catalytic subunits of PI3K.
We find that mice with muscle-specific knockout of p110α, but not p110β, display impaired insulin signaling and reduced muscle size due to enhanced proteasomal and autophagic activity. Despite insulin resistance and muscle atrophy, M-p110αKO mice show decreased serum myostatin, increased mitochondrial mass, increased mitochondrial fusion, and increased PGC1α expression, especially PCG1α2 and PCG1α3. Reference
Hidden Markov models lead to higher resolution maps of mutation signature activity in cancer
Knowing the activity of the mutational processes shaping a cancer genome may provide insight into tumorigenesis and personalized therapy. It is thus important to characterize the signatures of active mutational processes in patients from their patterns of single base substitutions.
However, mutational processes do not act uniformly on the genome, leading to statistical dependencies among neighboring mutations. To account for such dependencies, we develop the first sequence-dependent model, SigMa, for mutation signatures. We apply SigMa to characterize genomic and other factors that influence the activity of mutation signatures in breast cancer. We show that SigMa outperforms previous approaches, revealing novel insights on signature etiology. Reference
RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts
While genetic relatedness, usually manifested as segments identical by descent (IBD), is ubiquitous in modern large biobanks, current IBD detection methods are not efficient at such a scale.
Here, we describe an efficient method, RaPID, for detecting IBD segments in a panel with phased haplotypes. RaPID achieves a time and space complexity linear to the input size and the number of reported IBDs. With simulation, we showed that RaPID is orders of magnitude faster than existing method while offering competitive power and accuracy. In UK Biobank, RaPID identified 3,335,807 IBDs with a lenght ≥ 10 cM among 223,507 male X chromosomes in 11 min. Reference
FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science
FDA proactively invests in tools to support innovation of emerging technologies, such as infectious disease next generation sequencing (ID-NGS). Here, we introduce FDA-ARGOS quality-controlled reference genomes as a public database for diagnostic purposes and demonstrate its utility on the example of two use cases.
We provide quality control metrics for the FDA-ARGOS genomic database resource and outline the need for genome quality gap filling in the public domain. In the first use case, we show more accurate microbial identification of Enterococcus avium from metagenomic samples with FDA-ARGOS reference genomes compared to non-curated GenBank genomes. Reference
Integrating Gene and Protein Expression Reveals Perturbed Functional Networks in Alzheimer’s Disease
Asymptomatic and symptomatic Alzheimer’s disease (AD) subjects may present with equivalent neuropathological burdens but have significantly different antemortem cognitive decline rates. Using the transcriptome as a proxy for functional state, we selected 414 expression profiles of symptomatic AD subjects and age-matched non-demented controls from a community-based neuropathological study. B
y combining brain tissue-specific protein interactomes with gene networks, we identified functionally distinct composite clusters of genes that reveal extensive changes in expression levels in AD. Global expression for clusters broadly corresponding to synaptic transmission, metabolism, cell cycle, survival, and immune response were downregulated, while the upregulated cluster included largely uncharacterized processes. Reference
Landscape of transcriptomic interactions between breast cancer and its microenvironment
Solid tumours comprise mixtures of tumour cells (TCs) and tumour-adjacent cells (TACs), and the intricate interconnections between these diverse populations shape the tumour’s microenvironment. Despite this complexity, clinical genomic profiling is typically performed from bulk samples, without distinguishing TCs from TACs.
To better understand TC–TAC interactions, we computationally distinguish their transcriptomes in 1780 primary breast tumours. We show that TC and TAC mRNA abundances are divergently associated with clinical phenotypes, including tumour subtypes and patient survival. These differences reflect distinct responses of TCs and TACs to specific somatic driver mutations, particularly TP53. These data further elucidate how the molecular interplay between breast tumours and their microenvironment drives aggressive tumour phenotypes. Reference
CellSIUS provides sensitive and specific detection of rare cell populations from complex single-cell RNA-seq data
We develop CellSIUS (Cell Subtype Identification from Upregulated gene Sets) to fill a methodology gap for rare cell population identification for scRNA-seq data.
CellSIUS outperforms existing algorithms for specificity and selectivity for rare cell types and their transcriptomic signature identification in synthetic and complex biological data. Characterization of a human pluripotent cell differentiation protocol recapitulating deep-layer corticogenesis using CellSIUS reveals unrecognized complexity in human stem cell-derived cellular populations. CellSIUS enables identification of novel rare cell populations and their signature genes providing the means to study those populations in vitro in light of their role in health and disease. Reference
Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq
Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle.
Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis. Reference
Gene expression profile of human T cells following a single stimulation of peripheral blood mononuclear cells with anti-CD3 antibodies
Anti-CD3 immunotherapy was initially approved for clinical use for renal transplantation rejection prevention. Subsequently, new generations of anti-CD3 antibodies have entered clinical trials for a broader spectrum of therapeutic applications, including cancer and autoimmune diseases.
Despite their extensive use, little is known about the exact mechanism of these molecules, except that they are able to activate T cells, inducing an overall immunoregulatory and tolerogenic behavior. To better understand the effects of anti-CD3 antibodies on human T cells, PBMCs were stimulated, and then, we performed RNA-seq assays of enriched T cells to assess changes in their gene expression profiles. Reference
Analysis of the equine “cumulome” reveals major metabolic aberrations after maturation in vitro
Maturation of oocytes under in vitro conditions (IVM) results in impaired developmental competence compared to oocytes matured in vivo.
As oocytes are closely coupled to their cumulus complex, elucidating aberrations in cumulus metabolism in vitro is important to bridge the gap towards more physiological maturation conditions. The aim of this study was to analyze the equine “cumulome” in a novel combination of proteomic (nano-HPLC MS/MS) and metabolomic (UPLC-nanoESI-MS) profiling of single cumulus complexes of metaphase II oocytes matured either in vivo (n = 8) or in vitro (n = 7). Reference
Genome and epigenome wide studies of neurological protein biomarkers in the Lothian Birth Cohort 1936
Although plasma proteins may serve as markers of neurological disease risk, the molecular mechanisms responsible for inter-individual variation in plasma protein levels are poorly understood.
Therefore, we conduct genome- and epigenome-wide association studies on the levels of 92 neurological proteins to identify genetic and epigenetic loci associated with their plasma concentrations (n = 750 healthy older adults). We identify 41 independent genome-wide significant (P < 5.4 × 10−10) loci for 33 proteins and 26 epigenome-wide significant (P < 3.9 × 10−10) sites associated with the levels of 9 proteins. Using this information, we identify biological pathways in which putative neurological biomarkers are implicated (neurological, immunological and extracellular matrix metabolic pathways). Reference
Membrane protein-regulated networks across human cancers
Alterations in membrane proteins (MPs) and their regulated pathways have been established as cancer hallmarks and extensively targeted in clinical applications. However, the analysis of MP-interacting proteins and downstream pathways across human malignancies remains challenging.
Here, we present a systematically integrated method to generate a resource of cancer membrane protein-regulated networks (CaMPNets), containing 63,746 high-confidence protein–protein interactions (PPIs) for 1962 MPs, using expression profiles from 5922 tumors with overall survival outcomes across 15 human cancers. Comprehensive analysis of CaMPNets links MP partner communities and regulated pathways to provide MP-based gene sets for identifying prognostic biomarkers and druggable targets. Reference
CONFINED: distinguishing biological from technical sources of variation by leveraging multiple methylation datasets
Methylation datasets are affected by innumerable sources of variability, both biological (cell-type composition, genetics) and technical (batch effects).
Here, we propose a reference-free method based on sparse canonical correlation analysis to separate the biological from technical sources of variability. We show through simulations and real data that our method, CONFINED, is not only more accurate than the state-of-the-art reference-free methods for capturing known, replicable biological variability, but it is also considerably more robust to dataset-specific technical variability than previous approaches. Reference
GEMINI: a variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screens
Systems for CRISPR-based combinatorial perturbation of two or more genes are emerging as powerful tools for uncovering genetic interactions.
However, systematic identification of these relationships is complicated by sample, reagent, and biological variability. We develop a variational Bayes approach (GEMINI) that jointly analyzes all samples and reagents to identify genetic interactions in pairwise knockout screens. The improved accuracy and scalability of GEMINI enables the systematic analysis of combinatorial CRISPR knockout screens, regardless of design and dimension. Reference
The nasal methylome as a biomarker of asthma and airway inflammation in children
The nasal cellular epigenome may serve as biomarker of airway disease and environmental response. Here we collect nasal swabs from the anterior nares of 547 children (mean-age 12.9 y), and measure DNA methylation (DNAm) with the Infinium MethylationEPIC BeadChip.
We perform nasal Epigenome-Wide Association analyses (EWAS) of current asthma, allergen sensitization, allergic rhinitis, fractional exhaled nitric oxide (FeNO) and lung function. We find multiple differentially methylated CpGs (FDR < 0.05) and Regions (DMRs; ≥ 5-CpGs and FDR < 0.05) for asthma (285-CpGs), FeNO (8,372-CpGs; 191-DMRs), total IgE (3-CpGs; 3-DMRs), environment IgE (17-CpGs; 4-DMRs), allergic asthma (1,235-CpGs; 7-DMRs) and bronchodilator response (130-CpGs). Reference
SMURF-seq: efficient copy number profiling on long-read sequencers
We present SMURF-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer by randomly ligating them to form long molecules. Applying SMURF-seq using the Oxford Nanopore MinION yields up to 30 fragments per read, providing an average of 6.2 and up to 7.5 million mappable fragments per run, increasing information throughput for read-counting applications.
We apply SMURF-seq on the MinION to generate copy number profiles. A comparison with profiles from Illumina sequencing reveals that SMURF-seq attains similar accuracy. More broadly, SMURF-seq expands the utility of long-read sequencers for read-counting applications. Reference
GWAS of peripheral artery disease in the Million Veteran Program
Peripheral artery disease (PAD) is a leading cause of cardiovascular morbidity and mortality; however, the extent to which genetic factors increase risk for PAD is largely unknown. Using electronic health record data, we performed a genome-wide association study in the Million Veteran Program testing ~32 million DNA sequence variants with PAD (31,307 cases and 211,753 controls) across veterans of European, African and Hispanic ancestry.
The results were replicated in an independent sample of 5,117 PAD cases and 389,291 controls from the UK Biobank. We identified 19 PAD loci, 18 of which have not been previously reported. Eleven of the 19 loci were associated with disease in three vascular beds (coronary, cerebral, peripheral), including LDLR, LPL and LPA, suggesting that therapeutic modulation of low-density lipoprotein cholesterol, the lipoprotein lipase pathway or circulating lipoprotein(a) may be efficacious for multiple atherosclerotic disease phenotypes. Reference
Metabolic network percolation quantifies biosynthetic capabilities across the human oral microbiome
The biosynthetic capabilities of microbes underlie their growth and interactions, playing a prominent role in microbial community structure. For large, diverse microbial communities, prediction of these capabilities is limited by uncertainty about metabolic functions and environmental conditions.
To address this challenge, we propose a probabilistic method, inspired by percolation theory, to computationally quantify how robustly a genome-derived metabolic network produces a given set of metabolites under an ensemble of variable environments. We used this method to compile an atlas of predicted biosynthetic capabilities for 97 metabolites across 456 human oral microbes. This atlas captures taxonomically-related trends in biomass composition, and makes it possible to estimate inter-microbial metabolic distances that correlate with microbial co-occurrences. Reference
Integrative analysis of vascular endothelial cell genomic features identifies AIDA as a coronary artery disease candidate gene
Genome-wide association studies (GWAS) have identified hundreds of loci associated with coronary artery disease (CAD) and blood pressure (BP) or hypertension. Many of these loci are not linked to traditional risk factors, nor do they include obvious candidate genes, complicating their functional characterization.
We hypothesize that many GWAS loci associated with vascular diseases modulate endothelial functions. Endothelial cells play critical roles in regulating vascular homeostasis, such as roles in forming a selective barrier, inflammation, hemostasis, and vascular tone, and endothelial dysfunction is a hallmark of atherosclerosis and hypertension. Reference
Evolving neoantigen profiles in colorectal cancers with DNA repair defects
Neoantigens that arise as a consequence of tumor-specific mutations can be recognized by T lymphocytes leading to effective immune surveillance. In colorectal cancer (CRC) and other tumor types, a high number of neoantigens is associated with patient response to immune therapies.
The molecular processes governing the generation of neoantigens and their turnover in cancer cells are poorly understood. We exploited CRC as a model system to understand how alterations in DNA repair pathways modulate neoantigen profiles over time. Reference
Multi-region exome sequencing reveals genomic evolution from preneoplasia to lung adenocarcinoma
There has been a dramatic increase in the detection of lung nodules, many of which are preneoplasia atypical adenomatous hyperplasia (AAH), adenocarcinoma in situ (AIS), minimally invasive adenocarcinoma (MIA) or invasive adenocarcinoma (ADC).
The molecular landscape and the evolutionary trajectory of lung preneoplasia have not been well defined. Here, we perform multi-region exome sequencing of 116 resected lung nodules including AAH (n = 22), AIS (n = 27), MIA (n = 54) and synchronous ADC (n = 13). Comparing AAH to AIS, MIA and ADC, we observe progressive genomic evolution at the single nucleotide level and demarcated evolution at the chromosomal level supporting the early lung carcinogenesis model from AAH to AIS, MIA and ADC. Reference
Personal clinical history predicts antibiotic resistance of urinary tract infections
Antibiotic resistance is prevalent among the bacterial pathogens causing urinary tract infections. However, antimicrobial treatment is often prescribed ‘empirically’, in the absence of antibiotic susceptibility testing, risking mismatched and therefore ineffective treatment.
Here, linking a 10-year longitudinal data set of over 700,000 community-acquired urinary tract infections with over 5,000,000 individually resolved records of antibiotic purchases, we identify strong associations of antibiotic resistance with the demographics, records of past urine cultures and history of drug purchases of the patients. When combined together, these associations allow for machine-learning-based personalized drug-specific predictions of antibiotic resistance, thereby enabling drug-prescribing algorithms that match an antibiotic treatment recommendation to the expected resistance of each sample. Reference
RNA-Seq in 296 phased trios provides a high-resolution map of genomic imprinting
Identification of imprinted genes, demonstrating a consistent preference towards the paternal or maternal allelic expression, is important for the understanding of gene expression regulation during embryonic development and of the molecular basis of developmental disorders with a parent-of-origin effect.
Combining allelic analysis of RNA-Seq data with phased genotypes in family trios provides a powerful method to detect parent-of-origin biases in gene expression. Reference
CDetection: CRISPR-Cas12b-based DNA detection with sub-attomolar sensitivity and single-base specificity
CRISPR-based nucleic acid detection methods are reported to facilitate rapid and sensitive DNA detection. However, precise DNA detection at the single-base resolution and its wide applications including high-fidelity SNP genotyping remain to be explored. Here we develop a Cas12b-mediated DNA detection (CDetection) strategy, which shows higher sensitivity on examined targets compared with the previously reported Cas12a-based detection platform.
Moreover, we show that CDetection can distinguish differences at the single-base level upon combining the optimized tuned guide RNA (tgRNA). Therefore, our findings highlight the high sensitivity and accuracy of CDetection, which provides an efficient and highly practical platform for DNA detection. Reference
miRkwood: a tool for the reliable identification of microRNAs in plant genomes
MicroRNAs (miRNAs) play crucial roles in post-transcriptional regulation of eukaryotic gene expression and are involved in many aspects of plant development. Although several prediction tools are available for metazoan genomes, the number of tools dedicated to plants is relatively limited.
Here, we present miRkwood, a user-friendly tool for the identification of miRNAs in plant genomes using small RNA sequencing data. Deep-sequencing data of Argonaute associated small RNAs showed that miRkwood is able to identify a large diversity of plant miRNAs and limits false positive predictions. Reference
Death effector domain-containing protein induces vulnerability to cell cycle inhibition in triple-negative breast cancer
Lacking targetable molecular drivers, triple-negative breast cancer (TNBC) is the most clinically challenging subtype of breast cancer. In this study, we reveal that Death Effector Domain-containing DNA-binding protein (DEDD), which is overexpressed in > 60% of TNBCs, drives a mitogen-independent G1/S cell cycle transition through cytoplasm localization. The gain of cytosolic DEDD enhances cyclin D1 expression by interacting with heat shock 71 kDa protein 8 (HSC70). Concurrently, DEDD interacts with Rb family proteins and promotes their proteasome-mediated degradation.
DEDD overexpression renders TNBCs vulnerable to cell cycle inhibition. Patients with TNBC have been excluded from CDK 4/6 inhibitor clinical trials due to the perceived high frequency of Rb-loss in TNBCs. Interestingly, our study demonstrated that, irrespective of Rb status, TNBCs with DEDD overexpression exhibit a DEDD-dependent vulnerability to combinatorial treatment with CDK4/6 inhibitor and EGFR inhibitor in vitro and in vivo. Thus, our study provided a rationale for the clinical application of CDK4/6 inhibitor combinatorial regimens for patients with TNBC. Reference
Performance of neural network basecalling tools for Oxford Nanopore sequencing
Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT).
Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Reference
Microbiota therapy acts via a regulatory T cell MyD88/RORγt pathway to suppress food allergy
The role of dysbiosis in food allergy (FA) remains unclear. We found that dysbiotic fecal microbiota in FA infants evolved compositionally over time and failed to protect against FA in mice. Infants and mice with FA had decreased IgA and increased IgE binding to fecal bacteria, indicative of a broader breakdown of oral tolerance than hitherto appreciated.
Therapy with Clostridiales species impacted by dysbiosis, either as a consortium or as monotherapy with Subdoligranulum variabile, suppressed FA in mice as did a separate immunomodulatory Bacteroidales consortium. Bacteriotherapy induced expression by regulatory T (Treg) cells of the transcription factor ROR-γt in a MyD88-dependent manner, which was deficient in FA infants and mice and ineffectively induced by their microbiota. Reference
Transcriptomic correlates of electrophysiological and morphological diversity within and across excitatory and inhibitory neuron classes
In order to further our understanding of how gene expression contributes to key functional properties of neurons, we combined publicly accessible gene expression, electrophysiology, and morphology measurements to identify cross-cell type correlations between these data modalities. Building on our previous work using a similar approach, we distinguished between correlations which were “class-driven,” meaning those that could be explained by differences between excitatory and inhibitory cell classes, and those that reflected graded phenotypic differences within classes. Taking cell class identity into account increased the degree to which our results replicated in an independent dataset as well as their correspondence with known modes of ion channel function based on the literature.
We also found a smaller set of genes whose relationships to electrophysiological or morphological properties appear to be specific to either excitatory or inhibitory cell types. Next, using data from PatchSeq experiments, allowing simultaneous single-cell characterization of gene expression and electrophysiology, we found that some of the gene-property correlations observed across cell types were further predictive of within-cell type heterogeneity. Reference
Meta-omics analysis of elite athletes identifies a performance-enhancing microbe that functions via lactate metabolism
The human gut microbiome is linked to many states of human health and disease1. The metabolic repertoire of the gut microbiome is vast, but the health implications of these bacterial pathways are poorly understood. In this study, we identify a link between members of the genus Veillonella and exercise performance.
We observed an increase in Veillonella relative abundance in marathon runners postmarathon and isolated a strain of Veillonella atypica from stool samples. Inoculation of this strain into mice significantly increased exhaustive treadmill run time. Veillonella utilize lactate as their sole carbon source, which prompted us to perform a shotgun metagenomic analysis in a cohort of elite athletes, finding that every gene in a major pathway metabolizing lactate to propionate is at higher relative abundance postexercise. Using 13C3-labeled lactate in mice, we demonstrate that serum lactate crosses the epithelial barrier into the lumen of the gut. Reference
Performance of neural network basecalling tools for Oxford Nanopore sequencing
Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT).
Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Reference
Integrated analysis of long non-coding RNA and mRNA expression in different colored skin of koi carp
Long non-coding RNAs (lncRNAs) perform crucial roles in biological process involving complex mechanisms. However, information regarding their abundance, characteristics and potential functions linked to fish skin color is limited. Herein, Illumina sequencing and bioinformatics were conducted on black, white, and red skin of Koi carp (Cyprinus carpio L.).
A total of 590,415,050 clean reads, 446,614 putative transcripts, 4252 known and 72,907 novel lncRNAs were simultaneously obtained, including 92 significant differentially expressed lncRNAs and 722 mRNAs. Ccr_lnc5622441 and Ccr_lnc765201 were up-regulated in black and red skin, Ccr_lnc14074601 and Ccr_lnc2382951 were up-regulated in white skin, and premelanosome protein a (Pmela), Pmelb and tyrosinase (Tyr) were up-regulated in black skin. Reference
MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices
Sample multiplexing facilitates scRNA-seq by reducing costs and identifying artifacts such as cell doublets. However, universal and scalable sample barcoding strategies have not been described. We therefore developed MULTI-seq: multiplexing using lipid-tagged indices for single-cell and single-nucleus RNA sequencing.
MULTI-seq reagents can barcode any cell type or nucleus from any species with an accessible plasma membrane. The method involves minimal sample processing, thereby preserving cell viability and endogenous gene expression patterns. When cells are classified into sample groups using MULTI-seq barcode abundances, data quality is improved through doublet identification and recovery of cells with low RNA content that would otherwise be discarded by standard quality-control workflows. Reference
Exome sequencing in routine diagnostics: a generic test for 254 patients with primary immunodeficiencies
Diagnosis of primary immunodeficiencies (PIDs) is complex and cumbersome yet important for the clinical management of the disease. Exome sequencing may provide a genetic diagnosis in a significant number of patients in a single genetic test.
In May 2013, we implemented exome sequencing in routine diagnostics for patients suffering from PIDs. This study reports the clinical utility and diagnostic yield for a heterogeneous group of 254 consecutively referred PID patients from 249 families. For the majority of patients, the clinical diagnosis was based on clinical criteria including rare and/or unusual severe bacterial, viral, or fungal infections, sometimes accompanied by autoimmune manifestations. Functional immune defects were interpreted in the context of aberrant immune cell populations, aberrant antibody levels, or combinations of these factors. Reference
Identification of metabolic vulnerabilities of receptor tyrosine kinases-driven cancer
One of the biggest hurdles for the development of metabolism-targeted therapies is to identify the responsive tumor subsets. However, the metabolic vulnerabilities for most human cancers remain unclear.
Establishing the link between metabolic signatures and the oncogenic alterations of receptor tyrosine kinases (RTK), the most well-defined cancer genotypes, may precisely direct metabolic intervention to a broad patient population. By integrating metabolomics and transcriptomics, we herein show that oncogenic RTK activation causes distinct metabolic preference. Specifically, EGFR activation branches glycolysis to the serine synthesis for nucleotide biosynthesis and redox homeostasis, whereas FGFR activation recycles lactate to fuel oxidative phosphorylation for energy generation. Reference
A cellular census of human lungs identifies novel cell states in health and in asthma
Human lungs enable efficient gas exchange and form an interface with the environment, which depends on mucosal immunity for protection against infectious agents. Tightly controlled interactions between structural and immune cells are required to maintain lung homeostasis.
Here, we use single-cell transcriptomics to chart the cellular landscape of upper and lower airways and lung parenchyma in healthy lungs, and lower airways in asthmatic lungs. We report location-dependent airway epithelial cell states and a novel subset of tissue-resident memory T cells. In the lower airways of patients with asthma, mucous cell hyperplasia is shown to stem from a novel mucous ciliated cell state, as well as goblet cell hyperplasia. Reference
Cold stress induces enhanced chromatin accessibility and bivalent histone modifications H3K4me3 and H3K27me3 of active genes in potato
Cold stress can greatly affect plant growth and development. Plants have developed special systems to respond to and tolerate cold stress. While plant scientists have discovered numerous genes involved in responses to cold stress, few studies have been dedicated to investigation of genome-wide chromatin dynamics induced by cold or other abiotic stresses.
Genomic regions containing active cis-regulatory DNA elements can be identified as DNase I hypersensitive sites (DHSs). We develop high-resolution DHS maps in potato (Solanum tuberosum) using chromatin isolated from tubers stored under room (22 °C) and cold (4 °C) conditions. We find that cold stress induces a large number of DHSs enriched in genic regions which are frequently associated with differential gene expression in response to temperature variation. Reference
Predicting three-dimensional genome organization with chromatin states
We introduce a computational model to simulate chromatin structure and dynamics. Starting from one-dimensional genomics and epigenomics data that are available for hundreds of cell types, this model enables de novo prediction of chromatin structures at five-kilo-base resolution.
Simulated chromatin structures recapitulate known features of genome organization, including the formation of chromatin loops, topologically associating domains (TADs) and compartments, and are in quantitative agreement with chromosome conformation capture experiments and super-resolution microscopy measurements. Detailed characterization of the predicted structural ensemble reveals the dynamical flexibility of chromatin loops and the presence of cross-talk among neighboring TADs. Analysis of the model’s energy function uncovers distinct mechanisms for chromatin folding at various length scales and suggests a need to go beyond simple A/B compartment types to predict specific contacts between regulatory elements using polymer simulations. Reference
HumanMycobiomeScan: a new bioinformatics tool for the characterization of the fungal fraction in metagenomic samples
Modern metagenomic analysis of complex microbial communities produces large amounts of sequence data containing information on the microbiome in terms of bacterial, archaeal, viral and eukaryotic composition.
HumanMycobiomeScan is a bioinformatics tool for the taxonomic profiling of the mycobiome directly from raw data of next-generation sequencing. The tool uses hierarchical databases of fungi in order to unambiguously assign reads to fungal species more accurately and > 10,000 times faster than other comparable approaches. HumanMycobiomeScan was validated using in silico generated synthetic communities and then applied to metagenomic data, to characterize the intestinal fungal components in subjects adhering to different subsistence strategies. Reference
Invasive DNA elements modify the nuclear architecture of their insertion site by KNOT-linked silencing in Arabidopsis thaliana
The three-dimensional (3D) organization of chromosomes is linked to epigenetic regulation and transcriptional activity. However, only few functional features of 3D chromatin architecture have been described to date.
Here, we report the KNOT’s involvement in regulating invasive DNA elements. Transgenes can specifically interact with the KNOT, leading to perturbations of 3D nuclear organization, which correlates with the transgene’s expression: high KNOT interaction frequencies are associated with transgene silencing. KNOT-linked silencing (KLS) cannot readily be connected to canonical silencing mechanisms, such as RNA-directed DNA methylation and post-transcriptional gene silencing, as both cytosine methylation and small RNA abundance do not correlate with KLS. Reference
COMPASS for rapid combinatorial optimization of biochemical pathways based on artificial transcription factors
Balanced expression of multiple genes is central for establishing new biosynthetic pathways or multiprotein cellular complexes. Methods for efficient combinatorial assembly of regulatory sequences (promoters) and protein coding sequences are therefore highly wanted.
Here, we report a high-throughput cloning method, called COMPASS for COMbinatorial Pathway ASSembly, for the balanced expression of multiple genes in Saccharomyces cerevisiae. COMPASS employs orthogonal, plant-derived artificial transcription factors (ATFs) and homologous recombination-based cloning for the generation of thousands of individual DNA constructs in parallel. Reference
Codon usage optimization in pluripotent embryonic stem cells
The uneven use of synonymous codons in the transcriptome regulates the efficiency and fidelity of protein translation rates. Yet, the importance of this codon bias in regulating cell state-specific expression programmes is currently debated.
Here, we ask whether different codon usage controls gene expression programmes in self-renewing and differentiating embryonic stem cells. Using ribosome and transcriptome profiling, we identify distinct codon signatures during human embryonic stem cell differentiation. We find that cell state-specific codon bias is determined by the guanine-cytosine (GC) content of differentially expressed genes. Reference
A high-density BAC physical map covering the entire MHC region of addax antelope genome
The mammalian major histocompatibility complex (MHC) harbours clusters of genes associated with the immunological defence of animals against infectious pathogens. At present, no complete MHC physical map is available for any of the wild ruminant species in the world.
The high-density physical map is composed of two contigs of 47 overlapping bacterial artificial chromosome (BAC) clones, with an average of 115 Kb for each BAC, covering the entire addax MHC genome. The first contig has 40 overlapping BAC clones covering an approximately 2.9 Mb region of MHC class I, class III, and class IIa, and the second contig has 7 BAC clones covering an approximately 500 Kb genomic region that harbours MHC class IIb. Reference
Comprehensive study of nuclear receptor DNA binding provides a revised framework for understanding receptor specificity
The type II nuclear receptors (NRs) function as heterodimeric transcription factors with the retinoid X receptor (RXR) to regulate diverse biological processes in response to endogenous ligands and therapeutic drugs. DNA-binding specificity has been proposed as a primary mechanism for NR gene regulatory specificity.
Here we use protein-binding microarrays (PBMs) to comprehensively analyze the DNA binding of 12 NR:RXRα dimers. We find more promiscuous NR-DNA binding than has been reported, challenging the view that NR binding specificity is defined by half-site spacing. We show that NRs bind DNA using two distinct modes, explaining widespread NR binding to half-sites in vivo. Finally, we show that the current models of NR specificity better reflect binding-site activity rather than binding-site affinity. Reference
Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer
In most cases of sporadic colorectal cancers, tumorigenesis is a multistep process, involving genomic alterations in parallel with morphologic changes. In addition, accumulating evidence suggests that the human gut microbiome is linked to the development of colorectal cancer.
Here we performed fecal metagenomic and metabolomic studies on samples from a large cohort of 616 participants who underwent colonoscopy to assess taxonomic and functional characteristics of gut microbiota and metabolites. Microbiome and metabolome shifts were apparent in cases of multiple polypoid adenomas and intramucosal carcinomas, in addition to more advanced lesions. We found two distinct patterns of microbiome elevations. Reference
The Genomic and Immune Landscapes of Lethal Metastatic Breast Cancer
The detailed molecular characterization of lethal cancers is a prerequisite to understanding resistance to therapy and escape from cancer immunoediting. We performed extensive multi-platform profiling of multi-regional metastases in autopsies from 10 patients with therapy-resistant breast cancer.
The integrated genomic and immune landscapes show that metastases propagate and evolve as communities of clones, reveal their predicted neo-antigen landscapes, and show that they can accumulate HLA loss of heterozygosity (LOH). The data further identify variable tumor microenvironments and reveal, through analyses of T cell receptor repertoires, that adaptive immune responses appear to co-evolve with the metastatic genomes. These findings reveal in fine detail the landscapes of lethal metastatic breast cancer. Reference
Diabetes causes marked inhibition of mitochondrial metabolism in pancreatic β-cells
Diabetes is a global health problem caused primarily by the inability of pancreatic β-cells to secrete adequate levels of insulin. The molecular mechanisms underlying the progressive failure of β-cells to respond to glucose in type-2 diabetes remain unresolved.
Using a combination of transcriptomics and proteomics, we find significant dysregulation of major metabolic pathways in islets of diabetic βV59M mice, a non-obese, eulipidaemic diabetes model. Multiple genes/proteins involved in glycolysis/gluconeogenesis are upregulated, whereas those involved in oxidative phosphorylation are downregulated. In isolated islets, glucose-induced increases in NADH and ATP are impaired and both oxidative and glycolytic glucose metabolism are reduced. Reference
Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data
DNA base modifications, such as C5-methylcytosine (5mC) and N6-methyldeoxyadenosine (6mA), are important types of epigenetic regulations. Short-read bisulfite sequencing and long-read PacBio sequencing have inherent limitations to detect DNA modifications.
Here, using raw electric signals of Oxford Nanopore long-read sequencing data, we design DeepMod, a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications. We sequence a human genome HX1 and a Chlamydomonas reinhardtii genome using Nanopore sequencing, and then evaluate DeepMod on three types of genomes (Escherichia coli, Chlamydomonas reinhardtii and human genomes). Reference
A practical guide to methods controlling false discoveries in computational biology
In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses.
However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Reference
Comprehensively benchmarking applications for detecting copy number variation
Recently, copy number variation (CNV) has gained considerable interest as a type of genomic variation that plays an important role in complex phenotypes and disease susceptibility. Since a number of CNV detection methods have recently been developed, it is necessary to help investigators choose suitable methods for CNV detection depending on their objectives.
For this reason, this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands. Taking the DGV gold standard variants as a standard dataset, we evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X. Among the ten methods benchmarked, LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Reference
Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments
Single cell RNA-sequencing (scRNA-seq) technology has undergone rapid development in recent years, leading to an explosion in the number of tailored data analysis methods. However, the current lack of gold-standard benchmark datasets makes it difficult for researchers to systematically compare the performance of the many methods available.
Here, we generated a realistic benchmark experiment that included single cells and admixtures of cells or RNA to create ‘pseudo cells’ from up to five distinct cancer cell lines. In total, 14 datasets were generated using both droplet and plate-based scRNA-seq protocols. We compared 3,913 combinations of data analysis methods for tasks ranging from normalization and imputation to clustering, trajectory analysis and data integration. Reference
Associating somatic mutations to clinical outcomes: a pan-cancer study of survival time
We developed subclone multiplicity allocation and somatic heterogeneity (SMASH), a new statistical method for intra-tumor heterogeneity (ITH) inference. SMASH is tailored to the purpose of large-scale association studies with one tumor sample per patient.
In a pan-cancer study of 14 cancer types, we studied the associations between survival time and ITH quantified by SMASH, together with other features of somatic mutations. Our results show that ITH is associated with survival time in several cancer types and its effect can be modified by other covariates, such as mutation burden. Reference
A few Ascomycota taxa dominate soil fungal communities worldwide
Despite having key functions in terrestrial ecosystems, information on the dominant soil fungi and their ecological preferences at the global scale is lacking. To fill this knowledge gap, we surveyed 235 soils from across the globe. Our findings indicate that 83 phylotypes (<0.1% of the retrieved fungi), mostly belonging to wind dispersed, generalist Ascomycota, dominate soils globally.
We identify patterns and ecological drivers of dominant soil fungal taxa occurrence, and present a map of their distribution in soils worldwide. Whole-genome comparisons with less dominant, generalist fungi point at a significantly higher number of genes related to stress-tolerance and resource uptake in the dominant fungi, suggesting that they might be better in colonising a wide range of environments. Reference
WhoGEM: an admixture-based prediction machine accurately predicts quantitative functional traits in plants
The explosive growth of genomic data provides an opportunity to make increased use of sequence variations for phenotype prediction. We have developed a prediction machine for quantitative phenotypes (WhoGEM) that overcomes some of the bottlenecks limiting the current methods.
We demonstrated its performance by predicting quantitative disease resistance and quantitative functional traits in the wild model plant species, Medicago truncatula, using geographical locations as covariates for admixture analysis. The method’s prediction reliability equals or outperforms all existing algorithms for quantitative phenotype prediction. WhoGEM analysis produces evidence that variation in genome admixture proportions explains most of the phenotypic variation for quantitative phenotypes. Reference
OSCA: a tool for omic-data-based complex trait analysis
The rapid increase of omic data has greatly facilitated the investigation of associations between omic profiles such as DNA methylation (DNAm) and complex traits in large cohorts.
Here, we propose a mixed-linear-model-based method called MOMENT that tests for association between a DNAm probe and trait with all other distal probes fitted in multiple random-effect components to account for unobserved confounders. We demonstrate by simulations that MOMENT shows a lower false positive rate and more robustness than existing methods. MOMENT has been implemented in a versatile software package called OSCA together with a number of other implementations for omic-data-based analyses. Reference
qDSB-Seq is a general method for genome-wide quantification of DNA double-strand breaks using sequencing
DNA double-strand breaks (DSBs) are among the most lethal types of DNA damage and frequently cause genome instability. Sequencing-based methods for mapping DSBs have been developed but they allow measurement only of relative frequencies of DSBs between loci, which limits our understanding of the physiological relevance of detected DSBs.
Here we propose quantitative DSB sequencing (qDSB-Seq), a method providing both DSB frequencies per cell and their precise genomic coordinates. We induce spike-in DSBs by a site-specific endonuclease and use them to quantify detected DSBs (labeled, e.g., using i-BLESS). Utilizing qDSB-Seq, we determine numbers of DSBs induced by a radiomimetic drug and replication stress, and reveal two orders of magnitude differences in DSB frequencies. Reference
Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data
We introduce quanTIseq, a method to quantify the fractions of ten immune cell types from bulk RNA-sequencing data. quanTIseq was extensively validated in blood and tumor samples using simulated, flow cytometry, and immunohistochemistry data.
quanTIseq analysis of 8000 tumor samples revealed that cytotoxic T cell infiltration is more strongly associated with the activation of the CXCR3/CXCL9 axis than with mutational load and that deconvolution-based cell scores have prognostic value in several solid cancers. Finally, we used quanTIseq to show how kinase inhibitors modulate the immune contexture and to reveal immune-cell types that underlie differential patients’ responses to checkpoint blockers. Reference
ChiCMaxima: a robust and simple pipeline for detection and visualization of chromatin looping in Capture Hi-C
Capture Hi-C (CHi-C) is a new technique for assessing genome organization based on chromosome conformation capture coupled to oligonucleotide capture of regions of interest, such as gene promoters.
Chromatin loop detection is challenging because existing Hi-C/4C-like tools, which make different assumptions about the technical biases presented, are often unsuitable. We describe a new approach, ChiCMaxima, which uses local maxima combined with limited filtering to detect DNA looping interactions, integrating information from biological replicates. ChiCMaxima shows more stringency and robustness compared to previously developed tools. Reference
Genome-scale screens identify JNK–JUN signaling as a barrier for pluripotency exit and endoderm differentiation
Human embryonic stem cells (ESCs) and human induced pluripotent stem cells hold great promise for cell-based therapies and drug discovery. However, homogeneous differentiation remains a major challenge, highlighting the need for understanding developmental mechanisms.
We performed genome-scale CRISPR screens to uncover regulators of definitive endoderm (DE) differentiation, which unexpectedly uncovered five Jun N-terminal kinase (JNK)–JUN family genes as key barriers of DE differentiation. The JNK–JUN pathway does not act through directly inhibiting the DE enhancers. Instead, JUN co-occupies ESC enhancers with OCT4, NANOG, SMAD2 and SMAD3, and specifically inhibits the exit from the pluripotent state by impeding the decommissioning of ESC enhancers and inhibiting the reconfiguration of SMAD2 and SMAD3 chromatin binding from ESC to DE enhancers. Reference
Transcriptional cofactors display specificity for distinct types of core promoters
Transcriptional cofactors (COFs) communicate regulatory cues from enhancers to promoters and are central effectors of transcription activation and gene expression.
Although some COFs have been shown to prefer certain promoter types over others the extent to which different COFs display intrinsic specificities for distinct promoters is unclear. Here we use a high-throughput promoter-activity assay in Drosophila melanogaster S2 cells to screen 23 COFs for their ability to activate 72,000 candidate core promoters (CPs). We observe differential activation of CPs, indicating distinct regulatory preferences or ‘compatibilities’ between COFs and specific types of CPs. Reference