GIVE: portable genome browsers for personal websites
Growing popularity and diversity of genomic data demand portable and versatile genome browsers.
Here, we present an open source programming library called GIVE that facilitates the creation of personalized genome browsers without requiring a system administrator. By inserting HTML tags, one can add to a personal webpage interactive visualization of multiple types of genomics data, including genome annotation, “linear” quantitative data, and genome interaction data. GIVE includes a graphical interface called HUG (HTML Universal Generator) that automatically generates HTML code for displaying user chosen data, which can be copy-pasted into user’s personal website or saved and shared with collaborators. Reference
Genomic inference of the metabolism and evolution of the archaeal phylum Aigarchaeota
Microbes of the phylum Aigarchaeota are widely distributed in geothermal environments, but their physiological and ecological roles are poorly understood.
Here we analyze six Aigarchaeota metagenomic bins from two circumneutral hot springs in Tengchong, China, to reveal that they are either strict or facultative anaerobes, and most are chemolithotrophs that can perform sulfide oxidation. Applying comparative genomics to the Thaumarchaeota and Aigarchaeota, we find that they both originated from thermal habitats, sharing 1154 genes with their common ancestor. Reference
Identification of RNA-binding protein targets with HyperTRIBE
RNA-binding proteins (RBPs) accompany RNA from birth to death, affecting RNA biogenesis and functions. Identifying RBP–RNA interactions is essential to understanding their complex roles in different cellular processes.
However, detecting in vivo RNA targets of RBPs, especially in a small number of discrete cells, has been a technically challenging task. We previously developed a novel technique called TRIBE (targets of RNA-binding proteins identified by editing) to overcome this problem. TRIBE expresses a fusion protein consisting of a queried RBP and the catalytic domain of the RNA-editing enzyme ADAR (adenosine deaminase acting on RNA) (ADARcd), which marks target RNA transcripts by converting adenosine to inosine near the RBP binding sites. These marks can be subsequently identified via high-throughput sequencing. Reference
Integrated genetic and epigenetic analysis of myxofibrosarcoma
Myxofibrosarcoma (MFS) is a common adult soft tissue sarcoma characterized by an infiltrative growth pattern and a high local recurrence rate.
Here we report the genetic and epigenetic landscape of MFS based on the results of whole-exome sequencing (N = 41), RNA sequencing (N = 29), and methylation analysis (N = 41), using 41 MFSs as a discovery set, and subsequent targeted sequencing of 140 genes in the entire cohort of 99 MFSs and 17 MFSs’ data from TCGA. Fourteen driver genes are identified, including potentially actionable therapeutic targets seen in 37% of cases. Reference
Population genomics of hypervirulent Klebsiella pneumoniae clonal-group 23 reveals early emergence and rapid global dissemination
Severe liver abscess infections caused by hypervirulent clonal-group CG23 Klebsiella pneumoniae have been increasingly reported since the mid-1980s. Strains typically possess several virulence factors including an integrative, conjugative element ICEKp encoding the siderophore yersiniabactin and genotoxin colibactin.
Here we investigate CG23’s evolutionary history, showing several deep-branching sublineages associated with distinct ICEKp acquisitions. Over 80% of liver abscess isolates belong to sublineage CG23-I, which emerged in ~1928 following acquisition of ICEKp10 (encoding yersiniabactin and colibactin), and then disseminated globally within the human population. CG23-I’s distinguishing feature is the colibactin synthesis locus, which reportedly promotes gut colonisation and metastatic infection in murine models. Reference
Modelling how responsiveness to interferon improves interferon-free treatment of hepatitis C virus infection
Direct-acting antiviral agents (DAAs) for hepatitis C treatment tend to fare better in individuals who are also likely to respond well to interferon-alpha (IFN), a surprising correlation given that DAAs target specific viral proteins whereas IFN triggers a generic antiviral immune response. Here, we posit a causal relationship between IFN-responsiveness and DAA treatment outcome. IFN-responsiveness restricts viral replication, which would prevent the growth of viral variants resistant to DAAs and improve treatment outcome.
To test this hypothesis, we developed a multiscale mathematical model integrating IFN-responsiveness at the cellular level, viral kinetics and evolution leading to drug resistance at the individual level, and treatment outcome at the population level. Model predictions quantitatively captured data from over 50 clinical trials demonstrating poorer response to DAAs in previous non-responders to IFN than treatment-naïve individuals, presenting strong evidence supporting the hypothesis. Reference
Large-scale gene losses underlie the genome evolution of parasitic plant Cuscuta australis
Dodders (Cuscuta spp., Convolvulaceae) are root- and leafless parasitic plants. The physiology, ecology, and evolution of these obligate parasites are poorly understood. A high-quality reference genome of Cuscuta australis was assembled.
Our analyses reveal that Cuscuta experienced accelerated molecular evolution, and Cuscuta and the convolvulaceous morning glory (Ipomoea) shared a common whole-genome triplication event before their divergence. C. australis genome harbors 19,671 protein-coding genes, and importantly, 11.7% of the conserved orthologs in autotrophic plants are lost in C. australis. Many of these gene loss events likely result from its parasitic lifestyle and the massive changes of its body plan. Reference
Prioritization and functional assessment of noncoding variants associated with complex diseases
Unraveling functional noncoding variants associated with complex diseases is still a great challenge. We present a novel algorithm, Prioritization
And Functional Assessment (PAFA), that prioritizes and assesses the functionality of genetic variants by introducing population differentiation measures and recalibrating training variants. Comprehensive evaluations demonstrate that PAFA exhibits much higher sensitivity and specificity in prioritizing noncoding risk variants than existing methods. PAFA achieves improved performance in distinguishing both common and rare recurrent variants from non-recurrent variants by integrating multiple annotations and metrics. Reference
Integrative genomic analysis of adult mixed phenotype acute leukemia delineates lineage associated molecular subtypes
Mixed phenotype acute leukemia (MPAL) is a rare subtype of acute leukemia characterized by leukemic blasts presenting myeloid and lymphoid markers.
Here we report data from integrated genomic analysis on 31 MPAL samples and compare molecular profiling with that from acute myeloid leukemia (AML), B cell acute lymphoblastic leukemia (B-ALL), and T cell acute lymphoblastic leukemia (T-ALL). Consistent with the mixed immunophenotype, both AML-type and ALL-type mutations are detected in MPAL. Myeloid-B and myeloid-T MPAL show distinct mutation and methylation signatures that are associated with differences in lineage-commitment gene expressions. Reference
Horizontal transfer of BovB and L1 retrotransposons in eukaryotes
Transposable elements (TEs) are mobile DNA sequences, colloquially known as jumping genes because of their ability to replicate to new genomic locations. TEs can jump between organisms or species when given a vector of transfer, such as a tick or virus, in a process known as horizontal transfer.
Here, we propose that LINE-1 (L1) and Bovine-B (BovB), the two most abundant TE families in mammals, were initially introduced as foreign DNA via ancient horizontal transfer events. Using analyses of 759 plant, fungal and animal genomes, we identify multiple possible L1 horizontal transfer events in eukaryotic species, primarily involving Tx-like L1s in marine eukaryotes. Reference
COBRAme: A computational framework for genome-scale models of metabolism and gene expression
Genome-scale models of metabolism and macromolecular expression (ME-models) explicitly compute the optimal proteome composition of a growing cell. ME-models expand upon the well-established genome-scale models of metabolism (M-models), and they enable a new fundamental understanding of cellular growth.
ME-models have increased predictive capabilities and accuracy due to their inclusion of the biosynthetic costs for the machinery of life, but they come with a significant increase in model size and complexity. This challenge results in models which are both difficult to compute and challenging to understand conceptually. As a result, ME-models exist for only two organisms (Escherichia coli and Thermotoga maritima) and are still used by relatively few researchers. To address these challenges, we have developed a new software framework called COBRAme for building and simulating ME-models. Reference
A large-scale WGS analysis reveals highly specific genome editing by both Cas9 and Cpf1 (Cas12a) nucleases in rice
Targeting specificity has been a barrier to applying genome editing systems in functional genomics, precise medicine and plant breeding.
We conduct a WGS analysis of 34 plants edited by Cas9 and 15 plants edited by Cpf1 in T0 and T1 generations along with 20 diverse control plants in rice. The sequencing depths range from 45× to 105× with read mapping rates above 96%. Our results clearly show that most mutations in edited plants are created by the tissue culture process, which causes approximately 102 to 148 single nucleotide variations (SNVs) and approximately 32 to 83 insertions/deletions (indels) per plant. Reference
Deep coverage whole genome sequences and plasma lipoprotein(a) in individuals of European and African ancestries
Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle that contains apolipoprotein(a), encoded by LPA, and is a highly heritable, causal risk factor for cardiovascular diseases that varies in concentrations across ancestries.
Here, we use deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a). We observe that genetic determinants between Europeans and Africans have several unique determinants. The common variant rs12740374 associated with Lp(a) cholesterol is an eQTL for SORT1 and independent of LDL cholesterol. Reference
Adaptation and conservation insights from the koala genome
The koala, the only extant species of the marsupial family Phascolarctidae, is classified as ‘vulnerable’ due to habitat loss and widespread disease. We sequenced the koala genome, producing a complete and contiguous marsupial reference genome, including centromeres.
We reveal that the koala’s ability to detoxify eucalypt foliage may be due to expansions within a cytochrome P450 gene family, and its ability to smell, taste and moderate ingestion of plant secondary metabolites may be due to expansions in the vomeronasal and taste receptors. We characterized novel lactation proteins that protect young in the pouch and annotated immune genes important for response to chlamydial disease. Reference
Modelling genotypes in their physical microenvironment to predict single- and multi-cellular behaviour
A cell’s phenotype is the set of observable characteristics resulting from the interaction of the genotype with the surrounding environment, determining cell behaviour.
Deciphering genotype-phenotype relationships has been crucial to understand normal and disease biology. Analysis of molecular pathways has provided an invaluable tool to such understanding; however, it has typically lacked a component describing the physical context, which is a key determinant of phenotype. In this study, we present a novel modelling framework that enables to study the link between genotype, signalling networks and cell behaviour in a 3D physical environment. To achieve this we bring together Agent Based Modelling, a powerful computational modelling technique, and gene networks. Reference
INSaFLU: an automated open web-based bioinformatics suite “from-reads” for influenza
A new era of flu surveillance has already started based on the genetic characterization and exploration of influenza virus evolution at whole-genome scale.
We developed and implemented INSaFLU (“INSide the FLU”), which is the first influenza-oriented bioinformatics free web-based suite that deals with primary NGS data (reads) towards the automatic generation of the output data that are actually the core first-line “genetic requests” for effective and timely influenza laboratory surveillance (e.g., type and sub-type, gene and whole-genome consensus sequences, variants’ annotation, alignments and phylogenetic trees). Reference
Transcriptional synergy as an emergent property defining cell subpopulation identity enables population shift
Single-cell RNA sequencing allows defining molecularly distinct cell subpopulations. However, the identification of specific sets of transcription factors (TFs) that define the identity of these subpopulations remains a challenge.
Here we propose that subpopulation identity emerges from the synergistic activity of multiple TFs. Based on this concept, we develop a computational platform (TransSyn) for identifying synergistic transcriptional cores that determine cell subpopulation identities. TransSyn leverages single-cell RNA-seq data, and performs a dynamic search for an optimal synergistic transcriptional core using an information theoretic measure of synergy. Reference
HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies
Culture-independent analysis of microbial communities frequently relies on amplification and sequencing of the prokaryotic 16S ribosomal RNA gene. Typical analysis pipelines group sequences into operational taxonomic units (OTUs) to infer taxonomic and phylogenetic relationships.
Here, we present HmmUFOtu, a novel tool for processing microbiome amplicon sequencing data, which performs rapid per-read phylogenetic placement, followed by phylogenetically informed clustering into OTUs and taxonomy assignment. Compared to standard pipelines, HmmUFOtu more accurately and reliably recapitulates microbial community diversity and composition in simulated and real datasets without relying on heuristics or sacrificing speed or accuracy. Reference
Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation
We apply integrative approaches to expression quantitative loci (eQTLs) from 44 tissues from the Genotype-Tissue Expression project and genome-wide association study data. About 60% of known trait-associated loci are in linkage disequilibrium with a cis-eQTL, over half of which were not found in previous large-scale whole blood studies.
Applying polygenic analyses to metabolic, cardiovascular, anthropometric, autoimmune, and neurodegenerative traits, we find that eQTLs are significantly enriched for trait associations in relevant pathogenic tissues and explain a substantial proportion of the heritability (40–80%). For most traits, tissue-shared eQTLs underlie a greater proportion of trait associations, although tissue-specific eQTLs have a greater contribution to some traits, such as blood pressure. Reference
Molecular phenomics and metagenomics of hepatic steatosis in non-diabetic obese women
Hepatic steatosis is a multifactorial condition that is often observed in obese patients and is a prelude to non-alcoholic fatty liver disease.
Here, we combine shotgun sequencing of fecal metagenomes with molecular phenomics (hepatic transcriptome and plasma and urine metabolomes) in two well-characterized cohorts of morbidly obese women recruited to the FLORINASH study. We reveal molecular networks linking the gut microbiome and the host phenome to hepatic steatosis. Patients with steatosis have low microbial gene richness and increased genetic potential for the processing of dietary lipids and endotoxin biosynthesis (notably from Proteobacteria), hepatic inflammation and dysregulation of aromatic and branched-chain amino acid metabolism. Reference
Therapy-induced stress response is associated with downregulation of pre-mRNA splicing in cancer cells
Abnormal pre-mRNA splicing regulation is common in cancer, but the effects of chemotherapy on this process remain unclear.
To evaluate the effect of chemotherapy on slicing regulation, we performed meta-analyses of previously published transcriptomic, proteomic, phosphoproteomic, and secretome datasets. Our findings were verified by LC-MS/MS, western blotting, immunofluorescence, and FACS analyses of multiple cancer cell lines treated with cisplatin and pladienolide B. Our results revealed that different types of chemotherapy lead to similar changes in alternative splicing by inducing intron retention in multiple genes. Reference
Meta-analysis of GWAS for neuroticism in 449,484 individuals identifies novel genetic loci and pathways
Neuroticism is an important risk factor for psychiatric traits, including depression, anxiety, and schizophrenia. At the time of analysis, previous genome-wide association studies (GWAS) reported 16 genomic loci associated to neuroticism.
Here we conducted a large GWAS meta-analysis (n = 449,484) of neuroticism and identified 136 independent genome-wide significant loci (124 new at the time of analysis), which implicate 599 genes. Functional follow-up analyses showed enrichment in several brain regions and involvement of specific cell types, including dopaminergic neuroblasts (P = 3.49 × 10−8), medium spiny neurons (P = 4.23 × 10−8), and serotonergic neurons (P = 1.37 × 10−7). Gene set analyses implicated three specific pathways: neurogenesis (P = 4.43 × 10−9), behavioral response to cocaine processes (P = 1.84 × 10−7), and axon part (P = 5.26 × 10−8). Reference
DeepCRISPR: optimized CRISPR guide RNA design by deep learning
A major challenge for effective application of CRISPR systems is to accurately predict the single guide RNA (sgRNA) on-target knockout efficacy and off-target profile, which would facilitate the optimized design of sgRNAs with high sensitivity and specificity.
Here we present DeepCRISPR, a comprehensive computational platform to unify sgRNA on-target and off-target site prediction into one framework with deep learning, surpassing available state-of-the-art in silico tools. Reference
dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments
Recent single-cell RNA-seq protocols based on droplet microfluidics use massively multiplexed barcoding to enable simultaneous measurements of transcriptomes for thousands of individual cells. The increasing complexity of such data creates challenges for subsequent computational processing and troubleshooting of these experiments, with few software options currently available.
Here, we describe a flexible pipeline for processing droplet-based transcriptome data that implements barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries. We introduce advanced methods for correcting composition bias and sequencing errors affecting cellular and molecular barcodes to provide more accurate estimates of molecular counts in individual cells. Reference