Home

 Science in this Week (December, 2018)

Update on: December 14, 2018


A comprehensive pipeline for translational top-down proteomics from a single blood draw

Top-down proteomics (TDP) by mass spectrometry (MS) is a technique by which intact proteins are analyzed. It has become increasingly popDesalting and concentrating GELFrEEular in translational research because of the value of characterizing distinct proteoforms of intact proteins.

Compared to bottom-up proteomics (BUP) strategies, which measure digested peptide mixtures, TDP provides highly specific molecular information that avoids the bioinformatic challenge of protein inference. However, the technique has been difficult to implement widely because of inherent limitations of existing sample preparation methods and instrumentation. Reference


Genomic landscape of oxidative DNA damage and repair reveals regioselective protection from mutagenesis

DNA is subject to constant chemical modification and damage, which eventually results in variable mutation rates throughout the genome. Although detailed molecular mechanisms of DNA damage and repair are well understood, damage impact and execution of repair across a genome remain poorly defined.

To bridge the gap between our understanding of DNA repair and mutation distributions, we developed a novel method, AP-seq, capable of mapping apurinic sites and 8-oxo-7,8-dihydroguanine bases at approximately 250-bp resolution on a genome-wide scale. We directly demonstrate that the accumulation rate of apurinic sites varies widely across the genome, with hot spots acquiring many times more damage than cold spots. Reference


The molecular landscape of glioma in patients with Neurofibromatosis 1

Neurofibromatosis type 1 (NF1) is a common tumor predisposition syndrome in which glioma is one of the prevalent tumors. Gliomagenesis in NF1 results in a heterogeneous spectrum of low- to high-grade neoplasms occurring during the entire lifespan of patients.

The pattern of genetic and epigenetic alterations of glioma that develops in NF1 patients and the similarities with sporadic glioma remain unknown. Here, we present the molecular landscape of low- and high-grade gliomas in patients affected by NF1 (NF1-glioma). We found that the predisposing germline mutation of the NF1 gene was frequently converted to homozygosity and the somatic mutational load of NF1-glioma was influenced by age and grade. Reference


Genome-wide study of hair colour in UK Biobank explains most of the SNP heritability

Natural hair colour within European populations is a complex genetic trait. Previous work has established that MC1R variants are the principal genetic cause of red hair colour, but with variable penetrance.

Here, we have extensively mapped the genes responsible for hair colour in the white, British ancestry, participants in UK Biobank. MC1R only explains 73% of the SNP heritability for red hair in UK Biobank, and in fact most individuals with two MC1R variants have blonde or light brown hair. We identify other genes contributing to red hair, the combined effect of which accounts for ~90% of the SNP heritability. Reference


Pan-cancer characterisation of microRNA across cancer hallmarks reveals microRNA-mediated downregulation of tumour suppressors

microRNAs are key regulators of the human transcriptome across a number of diverse biological processes, such as development, aging and cancer, where particular miRNAs have been identified as tumour suppressive and oncogenic.

In this work, we elucidate, in a comprehensive manner, across 15 epithelial cancer types comprising 7316 clinical samples from the Cancer Genome Atlas, the association of miRNA expression and target regulation with the phenotypic hallmarks of cancer. Utilising penalised regression techniques to integrate transcriptomic, methylation and mutation data, we find evidence for a complex map of interactions underlying the relationship of miRNA regulation and the hallmarks of cancer. This highlighted high redundancy for the oncomiR-1 cluster of oncogenic miRNAs, in particular hsa-miR-17-5p. Reference


Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models

Knowing the catalytic turnover numbers of enzymes is essential for understanding the growth rate, proteome composition, and physiology of organisms, but experimental data on enzyme turnover numbers is sparse and noisy.

Here, we demonstrate that machine learning can successfully predict catalytic turnover numbers in Escherichia coli based on integrated data on enzyme biochemistry, protein structure, and network context. We identify a diverse set of features that are consistently predictive for both in vivo and in vitro enzyme turnover rates, revealing novel protein structural correlates of catalytic turnover. Reference


Single-cell mapping of lineage and identity in direct reprogramming

Direct lineage reprogramming involves the conversion of cellular identity. Single-cell technologies are useful for deconstructing the considerable heterogeneity that emerges during lineage conversion.

However, lineage relationships are typically lost during cell processing, complicating trajectory reconstruction. Here we present ‘CellTagging’, a combinatorial cell-indexing methodology that enables parallel capture of clonal history and cell identity, in which sequential rounds of cell labelling enable the construction of multi-level lineage trees. CellTagging and longitudinal tracking of fibroblast to induced endoderm progenitor reprogramming reveals two distinct trajectories: one leading to successfully reprogrammed cells, and one leading to a ‘dead-end’ state, paths determined in the earliest stages of lineage conversion. Reference


Integrated proteotranscriptomics of breast cancer

Transcriptome analysis of breast cancer discovered distinct disease subtypes of clinical significance.

However, it remains a challenge to define disease biology solely based on gene expression because tumor biology is often the result of protein function. Here, we measured global proteome and transcriptome expression in human breast tumors and adjacent non-cancerous tissue and performed an integrated proteotranscriptomic analysis. Reference


miRTrace reveals the organismal origins of microRNA sequencing data

We present here miRTrace, the first algorithm to trace microRNA sequencing data back to their taxonomic origins.

This is a challenge with profound implications for forensics, parasitology, food control, and research settings where cross-contamination can compromise results. miRTrace accurately (> 99%) assigns real and simulated data to 14 important animal and plant groups, sensitively detects parasitic infection in mammals, and discovers the primate origin of single cells. Applying our algorithm to over 700 public datasets, we find evidence that over 7% are cross-contaminated and present a novel solution to clean these computationally, even after sequencing has occurred. Reference


Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing

Cancer-associated fibroblasts (CAFs) are a major constituent of the tumor microenvironment, although their origin and roles in shaping disease initiation, progression and treatment response remain unclear due to significant heterogeneity.

Here, following a negative selection strategy combined with single-cell RNA sequencing of 768 transcriptomes of mesenchymal cells from a genetically engineered mouse model of breast cancer, we define three distinct subpopulations of CAFs. Validation at the transcriptional and protein level in several experimental models of cancer and human tumors reveal spatial separation of the CAF subclasses attributable to different origins, including the peri-vascular niche, the mammary fat pad and the transformed epithelium. Gene profiles for each CAF subtype correlate to distinctive functional programs and hold independent prognostic capability in clinical cohorts by association to metastatic disease. Reference


Deep generative modeling for single-cell transcriptomics

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses.

Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task. Reference


The phylogeography and incidence of multi-drug resistant typhoid fever in sub-Saharan Africa

There is paucity of data regarding the geographical distribution, incidence, and phylogenetics of multi-drug resistant (MDR) Salmonella Typhi in sub-Saharan Africa.

Here we present a phylogenetic reconstruction of whole genome sequenced 249 contemporaneous S. Typhi isolated between 2008-2015 in 11 sub-Saharan African countries, in context of the 2,057 global S. Typhi genomic framework. Despite the broad genetic diversity, the majority of organisms (225/249; 90%) belong to only three genotypes, 4.3.1 (H58) (99/249; 40%), 3.1.1 (97/249; 39%), and 2.3.2 (29/249; 12%). Genotypes 4.3.1 and 3.1.1 are confined within East and West Africa, respectively. MDR phenotype is found in over 50% of organisms restricted within these dominant genotypes. Reference


Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions

We introduce new statistical methods for analyzing genomic data sets that measure many effects in many conditions (for example, gene expression changes under many treatments).

These new methods improve on existing methods by allowing for arbitrary correlations in effect sizes among conditions. This flexible approach increases power, improves effect estimates and allows for more quantitative assessments of effect-size heterogeneity compared to simple shared or condition-specific assessments. We illustrate these features through an analysis of locally acting variants associated with gene expression (cis expression quantitative trait loci (eQTLs)) in 44 human tissues. Our analysis identifies more eQTLs than existing approaches, consistent with improved power. We show that although genetic effects on expression are extensively shared among tissues, effect sizes can still vary greatly among tissues. Reference


CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise

We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS.

The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. Reference


CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing

High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods but suffers from biases and artifacts that lead to false discoveries and low sensitivity.

We describe CODEX2, as a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs. Reference


GRIPT: a novel case-control analysis method for Mendelian disease gene discovery

Despite rapid progress of next-generation sequencing (NGS) technologies, the disease-causing genes underpinning about half of all Mendelian diseases remain elusive.

One main challenge is the high genetic heterogeneity of Mendelian diseases in which similar phenotypes are caused by different genes and each gene only accounts for a small proportion of the patients. To overcome this gap, we developed a novel method, the Gene Ranking, Identification and Prediction Tool (GRIPT), for performing case-control analysis of NGS data. Analyses of simulated and real datasets show that GRIPT is well-powered for disease gene discovery, especially for diseases with high locus heterogeneity. Reference


Architecture of gene regulatory networks controlling flower development in Arabidopsis thaliana

Floral homeotic transcription factors (TFs) act in a combinatorial manner to specify the organ identities in the flower. However, the architecture and the function of the gene regulatory network (GRN) controlling floral organ specification is still poorly understood.

In particular, the interconnections of homeotic TFs, microRNAs (miRNAs) and other factors controlling organ initiation and growth have not been studied systematically so far. Here, using a combination of genome-wide TF binding, mRNA and miRNA expression data, we reconstruct the dynamic GRN controlling floral meristem development and organ differentiation. We identify prevalent feed-forward loops (FFLs) mediated by floral homeotic TFs and miRNAs that regulate common targets. Experimental validation of a coherent FFL shows that petal size is controlled by the SEPALLATA3-regulated miR319/TCP4 module. We further show that combinatorial DNA-binding of homeotic factors and selected other TFs is predictive of organ-specific patterns of gene expression. Reference


omniCLIP: probabilistic identification of protein-RNA interactions from CLIP-seq data

CLIP-seq methods allow the generation of genome-wide maps of RNA binding protein – RNA interaction sites. However, due to differences between different CLIP-seq assays, existing computational approaches to analyze the data can only be applied to a subset of assays.

Here, we present a probabilistic model called omniCLIP that can detect regulatory elements in RNAs from data of all CLIP-seq assays. omniCLIP jointly models data across replicates and can integrate background information. Therefore, omniCLIP greatly simplifies the data analysis, increases the reliability of results and paves the way for integrative studies based on data from different assays. Reference


Network integration of multi-tumour omics data suggests novel targeting strategies

We characterize different tumour types in search for multi-tumour drug targets, in particular aiming for drug repurposing and novel drug combinations. Starting from 11 tumour types from The Cancer Genome Atlas, we obtain three clusters based on transcriptomic correlation profiles.

A network-based analysis, integrating gene expression profiles and protein interactions of cancer-related genes, allows us to define three cluster-specific signatures, with genes belonging to NF-κB signaling, chromosomal instability, ubiquitin-proteasome system, DNA metabolism, and apoptosis biological processes. These signatures have been characterized by different approaches based on mutational, pharmacological and clinical evidences, demonstrating the validity of our selection. Reference


Distinguishing genetic correlation from causation across 52 diseases and complex traits

Mendelian randomization, a method to infer causal relationships, is confounded by genetic correlations reflecting shared etiology.

We developed a model in which a latent causal variable mediates the genetic correlation; trait 1 is partially genetically causal for trait 2 if it is strongly genetically correlated with the latent causal variable, quantified using the genetic causality proportion. We fit this model using mixed fourth moments E(𝛼21𝛼1𝛼2) and E(𝛼22𝛼1𝛼2) of marginal effect sizes for each trait; if trait 1 is causal for trait 2, then SNPs affecting trait 1 (large 𝛼21) will have correlated effects on trait 2 (large α1α2), but not vice versa. In simulations, our method avoided false positives due to genetic correlations, unlike Mendelian randomization. Across 52 traits (average n = 331,000), we identified 30 causal relationships with high genetic causality proportion estimates. Reference


Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data

Identifying co-expressed gene clusters can provide evidence for genetic or physical interactions. Thus, co-expression clustering is a routine step in large-scale analyses of gene expression data.

We show that commonly used clustering methods produce results that substantially disagree and that do not match the biological expectations of co-expressed gene clusters. We present clust, a method that solves these problems by extracting clusters matching the biological expectations of co-expressed genes and outperforms widely used methods. Additionally, clust can simultaneously cluster multiple datasets, enabling users to leverage the large quantity of public expression data for novel comparative analysis. Reference


Large-scale reconstruction of cell lineages using single-cell readout of transcriptomes and CRISPR–Cas9 barcodes by scGESTALT

Lineage relationships among the large number of heterogeneous cell types generated during development are difficult to reconstruct in a high-throughput manner.

We recently established a method, scGESTALT, that combines cumulative editing of a lineage barcode array by CRISPR–Cas9 with large-scale transcriptional profiling using droplet-based single-cell RNA sequencing (scRNA-seq). The technique generates edits in the barcode array over multiple timepoints using Cas9 and pools of single-guide RNAs (sgRNAs) introduced during early and late zebrafish embryonic development, which distinguishes it from similar Cas9 lineage-tracing methods. Reference


An atlas of genetic associations in UK Biobank

Genome-wide association studies (GWAS) have identified many loci contributing to variation in complex traits, yet the majority of loci that contribute to the heritability of complex traits remain elusive.

Large study populations with sufficient statistical power are required to detect the small effect sizes of the yet unidentified genetic variants. However, the analysis of huge cohorts, like UK Biobank, is challenging. Here, we present an atlas of genetic associations for 118 non-binary and 660 binary traits of 452,264 UK Biobank participants of European ancestry. Results are compiled in a publicly accessible database that allows querying genome-wide association results for 9,113,133 genetic variants, as well as downloading GWAS summary statistics for over 30 million imputed genetic variants (>23 billion phenotype–genotype pairs). Reference


The human gut microbiome in early-onset type 1 diabetes from the TEDDY study

Type 1 diabetes (T1D) is an autoimmune disease that targets pancreatic islet beta cells and incorporates genetic and environmental factors, including complex genetic elements, patient exposures and the gut microbiome.

Viral infections and broader gut dysbioses6 have been identified as potential causes or contributing factors; however, human studies have not yet identified microbial compositional or functional triggers that are predictive of islet autoimmunity or T1D. Here we analyse 10,913 metagenomes in stool samples from 783 mostly white, non-Hispanic children. The samples were collected monthly from three months of age until the clinical end point (islet autoimmunity or T1D) in the The Environmental Determinants of Diabetes in the Young (TEDDY) study, to characterize the natural history of the early gut microbiome in connection to islet autoimmunity, T1D diagnosis, and other common early life events such as antibiotic treatments and probiotics. Reference


Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations

The liver is the largest solid organ in the body and is critical for metabolic and immune functions. However, little is known about the cells that make up the human liver and its immune microenvironment.

Here we report a map of the cellular landscape of the human liver using single-cell RNA sequencing. We provide the transcriptional profiles of 8444 parenchymal and non-parenchymal cells obtained from the fractionation of fresh hepatic tissue from five human livers. Using gene expression patterns, flow cytometry, and immunohistochemical examinations, we identify 20 discrete cell populations of hepatocytes, endothelial cells, cholangiocytes, hepatic stellate cells, B cells, conventional and non-conventional T cells, NK-like cells, and distinct intrahepatic monocyte/macrophage populations. Reference


Genotype effects contribute to variation in longitudinal methylome patterns in older people

DNA methylation levels change along with age, but few studies have examined the variation in the rate of such changes between individuals.

We performed a longitudinal analysis to quantify the variation in the rate of change of DNA methylation between individuals using whole blood DNA methylation array profiles collected at 2–4 time points (N = 2894) in 954 individuals (67–90 years).  After stringent quality control, we identified 1507 DNA methylation CpG sites (rsCpGs) with statistically significant variation in the rate of change (random slope) of DNA methylation among individuals in a mixed linear model analysis. Genes in the vicinity of these rsCpGs were found to be enriched in Homeobox transcription factors and the Wnt signalling pathway, both of which are related to ageing processes. Reference


Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes

Genome-wide association studies (GWAS) aim to identify genetic factors associated with phenotypes. Standard analyses test variants for associations individually.

However, variant-level associations are hard to identify and can be difficult to interpret biologically. Enrichment analyses help address both problems by targeting sets of biologically related variants. Here we introduce a new model-based enrichment method that requires only GWAS summary statistics. Applying this method to interrogate 4,026 gene sets in 31 human phenotypes identifies many previously-unreported enrichments, including enrichments of endochondral ossification pathway for height, NFAT-dependent transcription pathway for rheumatoid arthritis, brain-related genes for coronary artery disease, and liver-related genes for Alzheimer’s disease. Reference


SeqOthello: querying RNA-seq experiments at scale

We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments.

It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs. Reference


Discovery of potential causative mutations in human coding and noncoding genome with the interactive software BasePlayer

Next-generation sequencing (NGS) is routinely applied in life sciences and clinical practice, but interpretation of the massive quantities of genomic data produced has become a critical challenge.

The genome-wide mutation analyses enabled by NGS have had a revolutionary impact in revealing the predisposing and driving DNA alterations behind a multitude of disorders. The workflow to identify causative mutations from NGS data, for example in cancer and rare diseases, commonly involves phases such as quality filtering, case–control comparison, genome annotation, and visual validation, which require multiple processing steps and usage of various tools and scripts. To this end, we have introduced an interactive and user-friendly multi-platform-compatible software, BasePlayer, which allows scientists, regardless of bioinformatics training, to carry out variant analysis in disease genetics settings. Reference


Holo-Seq: single-cell sequencing of holo-transcriptome

Current single-cell RNA-seq approaches are hindered by preamplification bias, loss of strand of origin information, and the inability to observe small-RNA and mRNA dual transcriptomes.

Here, we introduce a single-cell holo-transcriptome sequencing (Holo-Seq) that overcomes all three hurdles. Holo-Seq has the same quantitative accuracy and uniform coverage with a complete strand of origin information as bulk RNA-seq. Most importantly, Holo-Seq can simultaneously observe small RNAs and mRNAs in a single cell. Reference


Phenome-wide association studies across large population cohorts support drug target validation

Phenome-wide association studies (PheWAS) have been proposed as a possible aid in drug development through elucidating mechanisms of action, identifying alternative indications, or predicting adverse drug events (ADEs).

Here, we select 25 single nucleotide polymorphisms (SNPs) linked through genome-wide association studies (GWAS) to 19 candidate drug targets for common disease indications. We interrogate these SNPs by PheWAS in four large cohorts with extensive health information (23andMe, UK Biobank, FINRISK, CHOP) for association with 1683 binary endpoints in up to 697,815 individuals and conduct meta-analyses for 145 mapped disease endpoints. Our analyses replicate 75% of known GWAS associations (P < 0.05) and identify nine study-wide significant novel associations (of 71 with FDR < 0.1). Reference


Identifying loci affecting trait variability and detecting interactions in genome-wide association studies

Identification of genetic variants with effects on trait variability can provide insights into the biological mechanisms that control variation and can identify potential interactions. We propose a two-degree-of-freedom test for jointly testing mean and variance effects to identify such variants.

We implement the test in a linear mixed model, for which we provide an efficient algorithm and software. To focus on biologically interesting settings, we develop a test for dispersion effects, that is, variance effects not driven solely by mean effects when the trait distribution is non-normal. We apply our approach to body mass index in the subsample of the UK Biobank population with British ancestry (n ~408,000) and show that our approach can increase the power to detect associated loci. Reference


Single-Cell Analysis of Quiescent HIV Infection Reveals Host Transcriptional Profiles that Regulate Proviral Latency

A detailed understanding of the mechanisms that establish or maintain the latent reservoir of HIV will guide approaches to eliminate persistent infection.

We used a cell line and primary cell models of HIV latency to investigate viral RNA (vRNA) expression and the role of the host transcriptome using single-cell approaches. Single-cell vRNA quantitation identified distinct populations of cells expressing various levels of vRNA, including completely silent populations. Strikingly, single-cell RNA-seq of latently infected primary cells demonstrated that HIV downregulation occurred in diverse transcriptomic environments but was significantly associated with expression of a specific set of cellular genes. In particular, latency was more frequent in cells expressing a transcriptional signature that included markers of naive and central memory T cells. These data reveal that expression of HIV proviruses within the latent reservoir are influenced by the host cell transcriptional program. Therapeutic modulation of these programs may reverse or enforce HIV latency. Reference


CRISPhieRmix: a hierarchical mixture model for CRISPR pooled screens

Pooled CRISPR screens allow researchers to interrogate genetic causes of complex phenotypes at the genome-wide scale and promise higher specificity and sensitivity compared to competing technologies.

Unfortunately, two problems exist, particularly for CRISPRi/a screens: variability in guide efficiency and large rare off-target effects. We present a method, CRISPhieRmix, that resolves these issues by using a hierarchical mixture model with a broad-tailed null distribution. We show that CRISPhieRmix allows for more accurate and powerful inferences in large-scale pooled CRISPRi/a screens. We discuss key issues in the analysis and design of screens, particularly the number of guides needed for faithful full discovery. Reference


The UK Biobank resource with deep phenotyping and genomic data

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment.

The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Reference


PHLI-seq: constructing and visualizing cancer genomic maps in 3D by phenotype-based high-throughput laser-aided isolation and sequencing

Spatial mapping of genomic data to tissue context in a high-throughput and high-resolution manner has been challenging due to technical limitations.

Here, we describe PHLI-seq, a novel approach that enables high-throughput isolation and genome-wide sequence analysis of single cells or small numbers of cells to construct genomic maps within cancer tissue in relation to the images or phenotypes of the cells. By applying PHLI-seq, we reveal the heterogeneity of breast cancer tissues at a high resolution and map the genomic landscape of the cells to their corresponding spatial locations and phenotypes in the 3D tumor mass. Reference


Integrated systems analysis reveals conserved gene networks underlying response to spinal cord injury

Spinal cord injury (SCI) is a devastating neurological condition for which there are currently no effective treatment options to restore function.

A major obstacle to the development of new therapies is our fragmentary understanding of the coordinated pathophysiological processes triggered by damage to the human spinal cord. Here, we describe a systems biology approach to integrate decades of small-scale experiments with unbiased, genome-wide gene expression from the human spinal cord, revealing a gene regulatory network signature of the pathophysiological response to SCI. Our integrative analyses converge on an evolutionarily conserved gene subnetwork enriched for genes associated with the response to SCI by small-scale experiments, and whose expression is upregulated in a severity-dependent manner following injury and downregulated in functional recovery. Reference


Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps

We expanded GWAS discovery for type 2 diabetes (T2D) by combining data from 898,130 European-descent individuals (9% cases), after imputation to high-density reference panels.

With these data, we (i) extend the inventory of T2D-risk variants (243 loci, 135 newly implicated in T2D predisposition, comprising 403 distinct association signals); (ii) enrich discovery of lower-frequency risk alleles (80 index variants with minor allele frequency <5%, 14 with estimated allelic odds ratio >2); (iii) substantially improve fine-mapping of causal variants (at 51 signals, one variant accounted for >80% posterior probability of association (PPA)); (iv) extend fine-mapping through integration of tissue-specific epigenomic information (islet regulatory annotations extend the number of variants with PPA >80% to 73); (v) highlight validated therapeutic targets (18 genes with associations attributable to coding variants); and (vi) demonstrate enhanced potential for clinical translation (genome-wide chip heritability explains 18% of T2D risk; individuals in the extremes of a T2D polygenic risk score differ more than ninefold in prevalence). Reference


Modularity of genes involved in local adaptation to climate despite physical linkage

Linkage among genes experiencing different selection pressures can make natural selection less efficient. Theory predicts that when local adaptation is driven by complex and non-covarying stresses, increased linkage is favored for alleles with similar pleiotropic effects, with increased recombination favored among alleles with contrasting pleiotropic effects.

Here, we introduce a framework to test these predictions with a co-association network analysis, which clusters loci based on differing associations. We use this framework to study the genetic architecture of local adaptation to climate in lodgepole pine, Pinus contorta, based on associations with environments. Reference


Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris

Here we present a compendium of single-cell transcriptomic data from the model organism Mus musculus that comprises more than 100,000 cells from 20 organs and tissues.

These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations and enable the direct and controlled comparison of gene expression in cell types that are shared between tissues, such as T lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3′-end counting, enabled the survey of thousands of cells at relatively low coverage, whereas the other, full-length transcript analysis based on fluorescence-activated cell sorting, enabled the characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. Reference


Genome-wide association study identifies two risk loci for tuberculosis in Han Chinese

Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (Mtb), and remains a leading public health problem. Previous studies have identified host genetic factors that contribute to Mtb infection outcomes.

However, much of the heritability in TB remains unaccounted for and additional susceptibility loci most likely exist. We perform a multistage genome-wide association study on 2949 pulmonary TB patients and 5090 healthy controls (833 cases and 1220 controls were genome-wide genotyped) from Han Chinese population. We discover two risk loci: 14q24.3 (rs12437118, Pcombined = 1.72 × 10−11, OR = 1.277, ESRRB) and 20p13 (rs6114027, Pcombined = 2.37 × 10−11, OR = 1.339, TGM6). Reference


Predicting microRNA targeting efficacy in Drosophila

MicroRNAs (miRNAs) are short regulatory RNAs that derive from hairpin precursors. Important for understanding the functional roles of miRNAs is the ability to predict the messenger RNA (mRNA) targets most responsive to each miRNA.

We acquired datasets suitable for the quantitative study of miRNA targeting in Drosophila. Analyses of these data expanded the types of regulatory sites known to be effective in flies, expanded the mRNA regions with detectable targeting to include 5′ untranslated regions, and identified features of site context that correlate with targeting efficacy in fly cells. Reference


Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program

The Million Veteran Program (MVP) was established in 2011 as a national research initiative to determine how genetic variation influences the health of US military veterans.

Here we genotyped 312,571 MVP participants using a custom biobank array and linked the genetic data to laboratory and clinical phenotypes extracted from electronic health records covering a median of 10.0 years of follow-up. Among 297,626 veterans with at least one blood lipid measurement, including 57,332 black and 24,743 Hispanic participants, we tested up to around 32 million variants for association with lipid levels and identified 118 novel genome-wide significant loci after meta-analysis with data from the Global Lipids Genetics Consortium (total n > 600,000). Reference


Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power.

A central challenge for joint analysis is that different WGS data processing pipelines cause substantial differences in variant calling in combined datasets, necessitating computationally expensive reprocessing. This approach is no longer tenable given the scale of current studies and data volumes. Here, we define WGS data processing standards that allow different groups to produce functionally equivalent (FE) results, yet still innovate on data processing pipelines. Reference


A comprehensive analysis of 195 DNA methylomes reveals shared and cell-specific features of partially methylated domains

Partially methylated domains are extended regions in the genome exhibiting a reduced average DNA methylation level. They cover gene-poor and transcriptionally inactive regions and tend to be heterochromatic.

We present a comprehensive comparative analysis of partially methylated domains in human and mouse cells, to identify structural and functional features associated with them.  Partially methylated domains are present in up to 75% of the genome in human and mouse cells irrespective of their tissue or cell origin. Each cell type has a distinct set of partially methylated domains, and genes expressed in such domains show a strong cell type effect. The methylation level varies between cell types with a more pronounced effect in differentiating and replicating cells. The lowest level of methylation is observed in highly proliferating and immortal cancer cell lines. Reference


Methylation of all BRCA1 copies predicts response to the PARP inhibitor rucaparib in ovarian carcinoma

Accurately identifying patients with high-grade serous ovarian carcinoma (HGSOC) who respond to poly(ADP-ribose) polymerase inhibitor (PARPi) therapy is of great clinical importance.

Here we show that quantitative BRCA1 methylation analysis provides new insight into PARPi response in preclinical models and ovarian cancer patients. The response of 12 HGSOC patient-derived xenografts (PDX) to the PARPi rucaparib was assessed, with variable dose-dependent responses observed in chemo-naive BRCA1/2-mutated PDX, and no responses in PDX lacking DNA repair pathway defects. Among BRCA1-methylated PDX, silencing of all BRCA1 copies predicts rucaparib response, whilst heterozygous methylation is associated with resistance. Reference


Epigenetic prediction of complex traits and death

Genome-wide DNA methylation (DNAm) profiling has allowed for the development of molecular predictors for a multitude of traits and diseases. Such predictors may be more accurate than the self-reported phenotypes and could have clinical applications.

Here, penalized regression models are used to develop DNAm predictors for ten modifiable health and lifestyle factors in a cohort of 5087 individuals. Using an independent test cohort comprising 895 individuals, the proportion of phenotypic variance explained in each trait is examined for DNAm-based and genetic predictors. Receiver operator characteristic curves are generated to investigate the predictive performance of DNAm-based predictors, using dichotomized phenotypes. The relationship between DNAm scores and all-cause mortality (n = 212 events) is assessed via Cox proportional hazards models. DNAm predictors for smoking, alcohol, education, and waist-to-hip ratio are shown to predict mortality in multivariate models. Reference