Architecture of gene regulatory networks controlling flower development in Arabidopsis thaliana
Floral homeotic transcription factors (TFs) act in a combinatorial manner to specify the organ identities in the flower. However, the architecture and the function of the gene regulatory network (GRN) controlling floral organ specification is still poorly understood.
In particular, the interconnections of homeotic TFs, microRNAs (miRNAs) and other factors controlling organ initiation and growth have not been studied systematically so far. Here, using a combination of genome-wide TF binding, mRNA and miRNA expression data, we reconstruct the dynamic GRN controlling floral meristem development and organ differentiation. We identify prevalent feed-forward loops (FFLs) mediated by floral homeotic TFs and miRNAs that regulate common targets. Experimental validation of a coherent FFL shows that petal size is controlled by the SEPALLATA3-regulated miR319/TCP4 module. We further show that combinatorial DNA-binding of homeotic factors and selected other TFs is predictive of organ-specific patterns of gene expression. Reference
omniCLIP: probabilistic identification of protein-RNA interactions from CLIP-seq data
CLIP-seq methods allow the generation of genome-wide maps of RNA binding protein – RNA interaction sites. However, due to differences between different CLIP-seq assays, existing computational approaches to analyze the data can only be applied to a subset of assays.
Here, we present a probabilistic model called omniCLIP that can detect regulatory elements in RNAs from data of all CLIP-seq assays. omniCLIP jointly models data across replicates and can integrate background information. Therefore, omniCLIP greatly simplifies the data analysis, increases the reliability of results and paves the way for integrative studies based on data from different assays. Reference
Network integration of multi-tumour omics data suggests novel targeting strategies
We characterize different tumour types in search for multi-tumour drug targets, in particular aiming for drug repurposing and novel drug combinations. Starting from 11 tumour types from The Cancer Genome Atlas, we obtain three clusters based on transcriptomic correlation profiles.
A network-based analysis, integrating gene expression profiles and protein interactions of cancer-related genes, allows us to define three cluster-specific signatures, with genes belonging to NF-κB signaling, chromosomal instability, ubiquitin-proteasome system, DNA metabolism, and apoptosis biological processes. These signatures have been characterized by different approaches based on mutational, pharmacological and clinical evidences, demonstrating the validity of our selection. Reference
Distinguishing genetic correlation from causation across 52 diseases and complex traits
Mendelian randomization, a method to infer causal relationships, is confounded by genetic correlations reflecting shared etiology.
We developed a model in which a latent causal variable mediates the genetic correlation; trait 1 is partially genetically causal for trait 2 if it is strongly genetically correlated with the latent causal variable, quantified using the genetic causality proportion. We fit this model using mixed fourth moments E(𝛼21𝛼1𝛼2) and E(𝛼22𝛼1𝛼2) of marginal effect sizes for each trait; if trait 1 is causal for trait 2, then SNPs affecting trait 1 (large 𝛼21) will have correlated effects on trait 2 (large α1α2), but not vice versa. In simulations, our method avoided false positives due to genetic correlations, unlike Mendelian randomization. Across 52 traits (average n = 331,000), we identified 30 causal relationships with high genetic causality proportion estimates. Reference
Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data
Identifying co-expressed gene clusters can provide evidence for genetic or physical interactions. Thus, co-expression clustering is a routine step in large-scale analyses of gene expression data.
We show that commonly used clustering methods produce results that substantially disagree and that do not match the biological expectations of co-expressed gene clusters. We present clust, a method that solves these problems by extracting clusters matching the biological expectations of co-expressed genes and outperforms widely used methods. Additionally, clust can simultaneously cluster multiple datasets, enabling users to leverage the large quantity of public expression data for novel comparative analysis. Reference
Large-scale reconstruction of cell lineages using single-cell readout of transcriptomes and CRISPR–Cas9 barcodes by scGESTALT
Lineage relationships among the large number of heterogeneous cell types generated during development are difficult to reconstruct in a high-throughput manner.
We recently established a method, scGESTALT, that combines cumulative editing of a lineage barcode array by CRISPR–Cas9 with large-scale transcriptional profiling using droplet-based single-cell RNA sequencing (scRNA-seq). The technique generates edits in the barcode array over multiple timepoints using Cas9 and pools of single-guide RNAs (sgRNAs) introduced during early and late zebrafish embryonic development, which distinguishes it from similar Cas9 lineage-tracing methods. Reference
An atlas of genetic associations in UK Biobank
Genome-wide association studies (GWAS) have identified many loci contributing to variation in complex traits, yet the majority of loci that contribute to the heritability of complex traits remain elusive.
Large study populations with sufficient statistical power are required to detect the small effect sizes of the yet unidentified genetic variants. However, the analysis of huge cohorts, like UK Biobank, is challenging. Here, we present an atlas of genetic associations for 118 non-binary and 660 binary traits of 452,264 UK Biobank participants of European ancestry. Results are compiled in a publicly accessible database that allows querying genome-wide association results for 9,113,133 genetic variants, as well as downloading GWAS summary statistics for over 30 million imputed genetic variants (>23 billion phenotype–genotype pairs). Reference
The human gut microbiome in early-onset type 1 diabetes from the TEDDY study
Type 1 diabetes (T1D) is an autoimmune disease that targets pancreatic islet beta cells and incorporates genetic and environmental factors, including complex genetic elements, patient exposures and the gut microbiome.
Viral infections and broader gut dysbioses6 have been identified as potential causes or contributing factors; however, human studies have not yet identified microbial compositional or functional triggers that are predictive of islet autoimmunity or T1D. Here we analyse 10,913 metagenomes in stool samples from 783 mostly white, non-Hispanic children. The samples were collected monthly from three months of age until the clinical end point (islet autoimmunity or T1D) in the The Environmental Determinants of Diabetes in the Young (TEDDY) study, to characterize the natural history of the early gut microbiome in connection to islet autoimmunity, T1D diagnosis, and other common early life events such as antibiotic treatments and probiotics. Reference
Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations
The liver is the largest solid organ in the body and is critical for metabolic and immune functions. However, little is known about the cells that make up the human liver and its immune microenvironment.
Here we report a map of the cellular landscape of the human liver using single-cell RNA sequencing. We provide the transcriptional profiles of 8444 parenchymal and non-parenchymal cells obtained from the fractionation of fresh hepatic tissue from five human livers. Using gene expression patterns, flow cytometry, and immunohistochemical examinations, we identify 20 discrete cell populations of hepatocytes, endothelial cells, cholangiocytes, hepatic stellate cells, B cells, conventional and non-conventional T cells, NK-like cells, and distinct intrahepatic monocyte/macrophage populations. Reference
Genotype effects contribute to variation in longitudinal methylome patterns in older people
DNA methylation levels change along with age, but few studies have examined the variation in the rate of such changes between individuals.
We performed a longitudinal analysis to quantify the variation in the rate of change of DNA methylation between individuals using whole blood DNA methylation array profiles collected at 2–4 time points (N = 2894) in 954 individuals (67–90 years). After stringent quality control, we identified 1507 DNA methylation CpG sites (rsCpGs) with statistically significant variation in the rate of change (random slope) of DNA methylation among individuals in a mixed linear model analysis. Genes in the vicinity of these rsCpGs were found to be enriched in Homeobox transcription factors and the Wnt signalling pathway, both of which are related to ageing processes. Reference
Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes
Genome-wide association studies (GWAS) aim to identify genetic factors associated with phenotypes. Standard analyses test variants for associations individually.
However, variant-level associations are hard to identify and can be difficult to interpret biologically. Enrichment analyses help address both problems by targeting sets of biologically related variants. Here we introduce a new model-based enrichment method that requires only GWAS summary statistics. Applying this method to interrogate 4,026 gene sets in 31 human phenotypes identifies many previously-unreported enrichments, including enrichments of endochondral ossification pathway for height, NFAT-dependent transcription pathway for rheumatoid arthritis, brain-related genes for coronary artery disease, and liver-related genes for Alzheimer’s disease. Reference
SeqOthello: querying RNA-seq experiments at scale
We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments.
It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs. Reference
Discovery of potential causative mutations in human coding and noncoding genome with the interactive software BasePlayer
Next-generation sequencing (NGS) is routinely applied in life sciences and clinical practice, but interpretation of the massive quantities of genomic data produced has become a critical challenge.
The genome-wide mutation analyses enabled by NGS have had a revolutionary impact in revealing the predisposing and driving DNA alterations behind a multitude of disorders. The workflow to identify causative mutations from NGS data, for example in cancer and rare diseases, commonly involves phases such as quality filtering, case–control comparison, genome annotation, and visual validation, which require multiple processing steps and usage of various tools and scripts. To this end, we have introduced an interactive and user-friendly multi-platform-compatible software, BasePlayer, which allows scientists, regardless of bioinformatics training, to carry out variant analysis in disease genetics settings. Reference
Holo-Seq: single-cell sequencing of holo-transcriptome
Current single-cell RNA-seq approaches are hindered by preamplification bias, loss of strand of origin information, and the inability to observe small-RNA and mRNA dual transcriptomes.
Here, we introduce a single-cell holo-transcriptome sequencing (Holo-Seq) that overcomes all three hurdles. Holo-Seq has the same quantitative accuracy and uniform coverage with a complete strand of origin information as bulk RNA-seq. Most importantly, Holo-Seq can simultaneously observe small RNAs and mRNAs in a single cell. Reference
Phenome-wide association studies across large population cohorts support drug target validation
Phenome-wide association studies (PheWAS) have been proposed as a possible aid in drug development through elucidating mechanisms of action, identifying alternative indications, or predicting adverse drug events (ADEs).
Here, we select 25 single nucleotide polymorphisms (SNPs) linked through genome-wide association studies (GWAS) to 19 candidate drug targets for common disease indications. We interrogate these SNPs by PheWAS in four large cohorts with extensive health information (23andMe, UK Biobank, FINRISK, CHOP) for association with 1683 binary endpoints in up to 697,815 individuals and conduct meta-analyses for 145 mapped disease endpoints. Our analyses replicate 75% of known GWAS associations (P < 0.05) and identify nine study-wide significant novel associations (of 71 with FDR < 0.1). Reference
Identifying loci affecting trait variability and detecting interactions in genome-wide association studies
Identification of genetic variants with effects on trait variability can provide insights into the biological mechanisms that control variation and can identify potential interactions. We propose a two-degree-of-freedom test for jointly testing mean and variance effects to identify such variants.
We implement the test in a linear mixed model, for which we provide an efficient algorithm and software. To focus on biologically interesting settings, we develop a test for dispersion effects, that is, variance effects not driven solely by mean effects when the trait distribution is non-normal. We apply our approach to body mass index in the subsample of the UK Biobank population with British ancestry (n ~408,000) and show that our approach can increase the power to detect associated loci. Reference
Single-Cell Analysis of Quiescent HIV Infection Reveals Host Transcriptional Profiles that Regulate Proviral Latency
A detailed understanding of the mechanisms that establish or maintain the latent reservoir of HIV will guide approaches to eliminate persistent infection.
We used a cell line and primary cell models of HIV latency to investigate viral RNA (vRNA) expression and the role of the host transcriptome using single-cell approaches. Single-cell vRNA quantitation identified distinct populations of cells expressing various levels of vRNA, including completely silent populations. Strikingly, single-cell RNA-seq of latently infected primary cells demonstrated that HIV downregulation occurred in diverse transcriptomic environments but was significantly associated with expression of a specific set of cellular genes. In particular, latency was more frequent in cells expressing a transcriptional signature that included markers of naive and central memory T cells. These data reveal that expression of HIV proviruses within the latent reservoir are influenced by the host cell transcriptional program. Therapeutic modulation of these programs may reverse or enforce HIV latency. Reference
CRISPhieRmix: a hierarchical mixture model for CRISPR pooled screens
Pooled CRISPR screens allow researchers to interrogate genetic causes of complex phenotypes at the genome-wide scale and promise higher specificity and sensitivity compared to competing technologies.
Unfortunately, two problems exist, particularly for CRISPRi/a screens: variability in guide efficiency and large rare off-target effects. We present a method, CRISPhieRmix, that resolves these issues by using a hierarchical mixture model with a broad-tailed null distribution. We show that CRISPhieRmix allows for more accurate and powerful inferences in large-scale pooled CRISPRi/a screens. We discuss key issues in the analysis and design of screens, particularly the number of guides needed for faithful full discovery. Reference
The UK Biobank resource with deep phenotyping and genomic data
The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment.
The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Reference
PHLI-seq: constructing and visualizing cancer genomic maps in 3D by phenotype-based high-throughput laser-aided isolation and sequencing
Spatial mapping of genomic data to tissue context in a high-throughput and high-resolution manner has been challenging due to technical limitations.
Here, we describe PHLI-seq, a novel approach that enables high-throughput isolation and genome-wide sequence analysis of single cells or small numbers of cells to construct genomic maps within cancer tissue in relation to the images or phenotypes of the cells. By applying PHLI-seq, we reveal the heterogeneity of breast cancer tissues at a high resolution and map the genomic landscape of the cells to their corresponding spatial locations and phenotypes in the 3D tumor mass. Reference
Integrated systems analysis reveals conserved gene networks underlying response to spinal cord injury
Spinal cord injury (SCI) is a devastating neurological condition for which there are currently no effective treatment options to restore function.
A major obstacle to the development of new therapies is our fragmentary understanding of the coordinated pathophysiological processes triggered by damage to the human spinal cord. Here, we describe a systems biology approach to integrate decades of small-scale experiments with unbiased, genome-wide gene expression from the human spinal cord, revealing a gene regulatory network signature of the pathophysiological response to SCI. Our integrative analyses converge on an evolutionarily conserved gene subnetwork enriched for genes associated with the response to SCI by small-scale experiments, and whose expression is upregulated in a severity-dependent manner following injury and downregulated in functional recovery. Reference
Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps
We expanded GWAS discovery for type 2 diabetes (T2D) by combining data from 898,130 European-descent individuals (9% cases), after imputation to high-density reference panels.
With these data, we (i) extend the inventory of T2D-risk variants (243 loci, 135 newly implicated in T2D predisposition, comprising 403 distinct association signals); (ii) enrich discovery of lower-frequency risk alleles (80 index variants with minor allele frequency <5%, 14 with estimated allelic odds ratio >2); (iii) substantially improve fine-mapping of causal variants (at 51 signals, one variant accounted for >80% posterior probability of association (PPA)); (iv) extend fine-mapping through integration of tissue-specific epigenomic information (islet regulatory annotations extend the number of variants with PPA >80% to 73); (v) highlight validated therapeutic targets (18 genes with associations attributable to coding variants); and (vi) demonstrate enhanced potential for clinical translation (genome-wide chip heritability explains 18% of T2D risk; individuals in the extremes of a T2D polygenic risk score differ more than ninefold in prevalence). Reference
Modularity of genes involved in local adaptation to climate despite physical linkage
Linkage among genes experiencing different selection pressures can make natural selection less efficient. Theory predicts that when local adaptation is driven by complex and non-covarying stresses, increased linkage is favored for alleles with similar pleiotropic effects, with increased recombination favored among alleles with contrasting pleiotropic effects.
Here, we introduce a framework to test these predictions with a co-association network analysis, which clusters loci based on differing associations. We use this framework to study the genetic architecture of local adaptation to climate in lodgepole pine, Pinus contorta, based on associations with environments. Reference
Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris
Here we present a compendium of single-cell transcriptomic data from the model organism Mus musculus that comprises more than 100,000 cells from 20 organs and tissues.
These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations and enable the direct and controlled comparison of gene expression in cell types that are shared between tissues, such as T lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3′-end counting, enabled the survey of thousands of cells at relatively low coverage, whereas the other, full-length transcript analysis based on fluorescence-activated cell sorting, enabled the characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. Reference
Genome-wide association study identifies two risk loci for tuberculosis in Han Chinese
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (Mtb), and remains a leading public health problem. Previous studies have identified host genetic factors that contribute to Mtb infection outcomes.
However, much of the heritability in TB remains unaccounted for and additional susceptibility loci most likely exist. We perform a multistage genome-wide association study on 2949 pulmonary TB patients and 5090 healthy controls (833 cases and 1220 controls were genome-wide genotyped) from Han Chinese population. We discover two risk loci: 14q24.3 (rs12437118, Pcombined = 1.72 × 10−11, OR = 1.277, ESRRB) and 20p13 (rs6114027, Pcombined = 2.37 × 10−11, OR = 1.339, TGM6). Reference
Predicting microRNA targeting efficacy in Drosophila
MicroRNAs (miRNAs) are short regulatory RNAs that derive from hairpin precursors. Important for understanding the functional roles of miRNAs is the ability to predict the messenger RNA (mRNA) targets most responsive to each miRNA.
We acquired datasets suitable for the quantitative study of miRNA targeting in Drosophila. Analyses of these data expanded the types of regulatory sites known to be effective in flies, expanded the mRNA regions with detectable targeting to include 5′ untranslated regions, and identified features of site context that correlate with targeting efficacy in fly cells. Reference
Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program
The Million Veteran Program (MVP) was established in 2011 as a national research initiative to determine how genetic variation influences the health of US military veterans.
Here we genotyped 312,571 MVP participants using a custom biobank array and linked the genetic data to laboratory and clinical phenotypes extracted from electronic health records covering a median of 10.0 years of follow-up. Among 297,626 veterans with at least one blood lipid measurement, including 57,332 black and 24,743 Hispanic participants, we tested up to around 32 million variants for association with lipid levels and identified 118 novel genome-wide significant loci after meta-analysis with data from the Global Lipids Genetics Consortium (total n > 600,000). Reference
Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects
Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power.
A central challenge for joint analysis is that different WGS data processing pipelines cause substantial differences in variant calling in combined datasets, necessitating computationally expensive reprocessing. This approach is no longer tenable given the scale of current studies and data volumes. Here, we define WGS data processing standards that allow different groups to produce functionally equivalent (FE) results, yet still innovate on data processing pipelines. Reference
A comprehensive analysis of 195 DNA methylomes reveals shared and cell-specific features of partially methylated domains
Partially methylated domains are extended regions in the genome exhibiting a reduced average DNA methylation level. They cover gene-poor and transcriptionally inactive regions and tend to be heterochromatic.
We present a comprehensive comparative analysis of partially methylated domains in human and mouse cells, to identify structural and functional features associated with them. Partially methylated domains are present in up to 75% of the genome in human and mouse cells irrespective of their tissue or cell origin. Each cell type has a distinct set of partially methylated domains, and genes expressed in such domains show a strong cell type effect. The methylation level varies between cell types with a more pronounced effect in differentiating and replicating cells. The lowest level of methylation is observed in highly proliferating and immortal cancer cell lines. Reference
Methylation of all BRCA1 copies predicts response to the PARP inhibitor rucaparib in ovarian carcinoma
Accurately identifying patients with high-grade serous ovarian carcinoma (HGSOC) who respond to poly(ADP-ribose) polymerase inhibitor (PARPi) therapy is of great clinical importance.
Here we show that quantitative BRCA1 methylation analysis provides new insight into PARPi response in preclinical models and ovarian cancer patients. The response of 12 HGSOC patient-derived xenografts (PDX) to the PARPi rucaparib was assessed, with variable dose-dependent responses observed in chemo-naive BRCA1/2-mutated PDX, and no responses in PDX lacking DNA repair pathway defects. Among BRCA1-methylated PDX, silencing of all BRCA1 copies predicts rucaparib response, whilst heterozygous methylation is associated with resistance. Reference
Epigenetic prediction of complex traits and death
Genome-wide DNA methylation (DNAm) profiling has allowed for the development of molecular predictors for a multitude of traits and diseases. Such predictors may be more accurate than the self-reported phenotypes and could have clinical applications.
Here, penalized regression models are used to develop DNAm predictors for ten modifiable health and lifestyle factors in a cohort of 5087 individuals. Using an independent test cohort comprising 895 individuals, the proportion of phenotypic variance explained in each trait is examined for DNAm-based and genetic predictors. Receiver operator characteristic curves are generated to investigate the predictive performance of DNAm-based predictors, using dichotomized phenotypes. The relationship between DNAm scores and all-cause mortality (n = 212 events) is assessed via Cox proportional hazards models. DNAm predictors for smoking, alcohol, education, and waist-to-hip ratio are shown to predict mortality in multivariate models. Reference