Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations
The liver is the largest solid organ in the body and is critical for metabolic and immune functions. However, little is known about the cells that make up the human liver and its immune microenvironment.
Here we report a map of the cellular landscape of the human liver using single-cell RNA sequencing. We provide the transcriptional profiles of 8444 parenchymal and non-parenchymal cells obtained from the fractionation of fresh hepatic tissue from five human livers. Using gene expression patterns, flow cytometry, and immunohistochemical examinations, we identify 20 discrete cell populations of hepatocytes, endothelial cells, cholangiocytes, hepatic stellate cells, B cells, conventional and non-conventional T cells, NK-like cells, and distinct intrahepatic monocyte/macrophage populations. Reference
Genotype effects contribute to variation in longitudinal methylome patterns in older people
DNA methylation levels change along with age, but few studies have examined the variation in the rate of such changes between individuals.
We performed a longitudinal analysis to quantify the variation in the rate of change of DNA methylation between individuals using whole blood DNA methylation array profiles collected at 2–4 time points (N = 2894) in 954 individuals (67–90 years). After stringent quality control, we identified 1507 DNA methylation CpG sites (rsCpGs) with statistically significant variation in the rate of change (random slope) of DNA methylation among individuals in a mixed linear model analysis. Genes in the vicinity of these rsCpGs were found to be enriched in Homeobox transcription factors and the Wnt signalling pathway, both of which are related to ageing processes. Reference
Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes
Genome-wide association studies (GWAS) aim to identify genetic factors associated with phenotypes. Standard analyses test variants for associations individually.
However, variant-level associations are hard to identify and can be difficult to interpret biologically. Enrichment analyses help address both problems by targeting sets of biologically related variants. Here we introduce a new model-based enrichment method that requires only GWAS summary statistics. Applying this method to interrogate 4,026 gene sets in 31 human phenotypes identifies many previously-unreported enrichments, including enrichments of endochondral ossification pathway for height, NFAT-dependent transcription pathway for rheumatoid arthritis, brain-related genes for coronary artery disease, and liver-related genes for Alzheimer’s disease. Reference
SeqOthello: querying RNA-seq experiments at scale
We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments.
It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs. Reference
Discovery of potential causative mutations in human coding and noncoding genome with the interactive software BasePlayer
Next-generation sequencing (NGS) is routinely applied in life sciences and clinical practice, but interpretation of the massive quantities of genomic data produced has become a critical challenge.
The genome-wide mutation analyses enabled by NGS have had a revolutionary impact in revealing the predisposing and driving DNA alterations behind a multitude of disorders. The workflow to identify causative mutations from NGS data, for example in cancer and rare diseases, commonly involves phases such as quality filtering, case–control comparison, genome annotation, and visual validation, which require multiple processing steps and usage of various tools and scripts. To this end, we have introduced an interactive and user-friendly multi-platform-compatible software, BasePlayer, which allows scientists, regardless of bioinformatics training, to carry out variant analysis in disease genetics settings. Reference
Holo-Seq: single-cell sequencing of holo-transcriptome
Current single-cell RNA-seq approaches are hindered by preamplification bias, loss of strand of origin information, and the inability to observe small-RNA and mRNA dual transcriptomes.
Here, we introduce a single-cell holo-transcriptome sequencing (Holo-Seq) that overcomes all three hurdles. Holo-Seq has the same quantitative accuracy and uniform coverage with a complete strand of origin information as bulk RNA-seq. Most importantly, Holo-Seq can simultaneously observe small RNAs and mRNAs in a single cell. Reference
Phenome-wide association studies across large population cohorts support drug target validation
Phenome-wide association studies (PheWAS) have been proposed as a possible aid in drug development through elucidating mechanisms of action, identifying alternative indications, or predicting adverse drug events (ADEs).
Here, we select 25 single nucleotide polymorphisms (SNPs) linked through genome-wide association studies (GWAS) to 19 candidate drug targets for common disease indications. We interrogate these SNPs by PheWAS in four large cohorts with extensive health information (23andMe, UK Biobank, FINRISK, CHOP) for association with 1683 binary endpoints in up to 697,815 individuals and conduct meta-analyses for 145 mapped disease endpoints. Our analyses replicate 75% of known GWAS associations (P < 0.05) and identify nine study-wide significant novel associations (of 71 with FDR < 0.1). Reference
Identifying loci affecting trait variability and detecting interactions in genome-wide association studies
Identification of genetic variants with effects on trait variability can provide insights into the biological mechanisms that control variation and can identify potential interactions. We propose a two-degree-of-freedom test for jointly testing mean and variance effects to identify such variants.
We implement the test in a linear mixed model, for which we provide an efficient algorithm and software. To focus on biologically interesting settings, we develop a test for dispersion effects, that is, variance effects not driven solely by mean effects when the trait distribution is non-normal. We apply our approach to body mass index in the subsample of the UK Biobank population with British ancestry (n ~408,000) and show that our approach can increase the power to detect associated loci. Reference
Single-Cell Analysis of Quiescent HIV Infection Reveals Host Transcriptional Profiles that Regulate Proviral Latency
A detailed understanding of the mechanisms that establish or maintain the latent reservoir of HIV will guide approaches to eliminate persistent infection.
We used a cell line and primary cell models of HIV latency to investigate viral RNA (vRNA) expression and the role of the host transcriptome using single-cell approaches. Single-cell vRNA quantitation identified distinct populations of cells expressing various levels of vRNA, including completely silent populations. Strikingly, single-cell RNA-seq of latently infected primary cells demonstrated that HIV downregulation occurred in diverse transcriptomic environments but was significantly associated with expression of a specific set of cellular genes. In particular, latency was more frequent in cells expressing a transcriptional signature that included markers of naive and central memory T cells. These data reveal that expression of HIV proviruses within the latent reservoir are influenced by the host cell transcriptional program. Therapeutic modulation of these programs may reverse or enforce HIV latency. Reference
CRISPhieRmix: a hierarchical mixture model for CRISPR pooled screens
Pooled CRISPR screens allow researchers to interrogate genetic causes of complex phenotypes at the genome-wide scale and promise higher specificity and sensitivity compared to competing technologies.
Unfortunately, two problems exist, particularly for CRISPRi/a screens: variability in guide efficiency and large rare off-target effects. We present a method, CRISPhieRmix, that resolves these issues by using a hierarchical mixture model with a broad-tailed null distribution. We show that CRISPhieRmix allows for more accurate and powerful inferences in large-scale pooled CRISPRi/a screens. We discuss key issues in the analysis and design of screens, particularly the number of guides needed for faithful full discovery. Reference
The UK Biobank resource with deep phenotyping and genomic data
The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment.
The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Reference
PHLI-seq: constructing and visualizing cancer genomic maps in 3D by phenotype-based high-throughput laser-aided isolation and sequencing
Spatial mapping of genomic data to tissue context in a high-throughput and high-resolution manner has been challenging due to technical limitations.
Here, we describe PHLI-seq, a novel approach that enables high-throughput isolation and genome-wide sequence analysis of single cells or small numbers of cells to construct genomic maps within cancer tissue in relation to the images or phenotypes of the cells. By applying PHLI-seq, we reveal the heterogeneity of breast cancer tissues at a high resolution and map the genomic landscape of the cells to their corresponding spatial locations and phenotypes in the 3D tumor mass. Reference
Integrated systems analysis reveals conserved gene networks underlying response to spinal cord injury
Spinal cord injury (SCI) is a devastating neurological condition for which there are currently no effective treatment options to restore function.
A major obstacle to the development of new therapies is our fragmentary understanding of the coordinated pathophysiological processes triggered by damage to the human spinal cord. Here, we describe a systems biology approach to integrate decades of small-scale experiments with unbiased, genome-wide gene expression from the human spinal cord, revealing a gene regulatory network signature of the pathophysiological response to SCI. Our integrative analyses converge on an evolutionarily conserved gene subnetwork enriched for genes associated with the response to SCI by small-scale experiments, and whose expression is upregulated in a severity-dependent manner following injury and downregulated in functional recovery. Reference
Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps
We expanded GWAS discovery for type 2 diabetes (T2D) by combining data from 898,130 European-descent individuals (9% cases), after imputation to high-density reference panels.
With these data, we (i) extend the inventory of T2D-risk variants (243 loci, 135 newly implicated in T2D predisposition, comprising 403 distinct association signals); (ii) enrich discovery of lower-frequency risk alleles (80 index variants with minor allele frequency <5%, 14 with estimated allelic odds ratio >2); (iii) substantially improve fine-mapping of causal variants (at 51 signals, one variant accounted for >80% posterior probability of association (PPA)); (iv) extend fine-mapping through integration of tissue-specific epigenomic information (islet regulatory annotations extend the number of variants with PPA >80% to 73); (v) highlight validated therapeutic targets (18 genes with associations attributable to coding variants); and (vi) demonstrate enhanced potential for clinical translation (genome-wide chip heritability explains 18% of T2D risk; individuals in the extremes of a T2D polygenic risk score differ more than ninefold in prevalence). Reference
Modularity of genes involved in local adaptation to climate despite physical linkage
Linkage among genes experiencing different selection pressures can make natural selection less efficient. Theory predicts that when local adaptation is driven by complex and non-covarying stresses, increased linkage is favored for alleles with similar pleiotropic effects, with increased recombination favored among alleles with contrasting pleiotropic effects.
Here, we introduce a framework to test these predictions with a co-association network analysis, which clusters loci based on differing associations. We use this framework to study the genetic architecture of local adaptation to climate in lodgepole pine, Pinus contorta, based on associations with environments. Reference
Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris
Here we present a compendium of single-cell transcriptomic data from the model organism Mus musculus that comprises more than 100,000 cells from 20 organs and tissues.
These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations and enable the direct and controlled comparison of gene expression in cell types that are shared between tissues, such as T lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3′-end counting, enabled the survey of thousands of cells at relatively low coverage, whereas the other, full-length transcript analysis based on fluorescence-activated cell sorting, enabled the characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. Reference
Genome-wide association study identifies two risk loci for tuberculosis in Han Chinese
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (Mtb), and remains a leading public health problem. Previous studies have identified host genetic factors that contribute to Mtb infection outcomes.
However, much of the heritability in TB remains unaccounted for and additional susceptibility loci most likely exist. We perform a multistage genome-wide association study on 2949 pulmonary TB patients and 5090 healthy controls (833 cases and 1220 controls were genome-wide genotyped) from Han Chinese population. We discover two risk loci: 14q24.3 (rs12437118, Pcombined = 1.72 × 10−11, OR = 1.277, ESRRB) and 20p13 (rs6114027, Pcombined = 2.37 × 10−11, OR = 1.339, TGM6). Reference
Predicting microRNA targeting efficacy in Drosophila
MicroRNAs (miRNAs) are short regulatory RNAs that derive from hairpin precursors. Important for understanding the functional roles of miRNAs is the ability to predict the messenger RNA (mRNA) targets most responsive to each miRNA.
We acquired datasets suitable for the quantitative study of miRNA targeting in Drosophila. Analyses of these data expanded the types of regulatory sites known to be effective in flies, expanded the mRNA regions with detectable targeting to include 5′ untranslated regions, and identified features of site context that correlate with targeting efficacy in fly cells. Reference
Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program
The Million Veteran Program (MVP) was established in 2011 as a national research initiative to determine how genetic variation influences the health of US military veterans.
Here we genotyped 312,571 MVP participants using a custom biobank array and linked the genetic data to laboratory and clinical phenotypes extracted from electronic health records covering a median of 10.0 years of follow-up. Among 297,626 veterans with at least one blood lipid measurement, including 57,332 black and 24,743 Hispanic participants, we tested up to around 32 million variants for association with lipid levels and identified 118 novel genome-wide significant loci after meta-analysis with data from the Global Lipids Genetics Consortium (total n > 600,000). Reference
Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects
Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power.
A central challenge for joint analysis is that different WGS data processing pipelines cause substantial differences in variant calling in combined datasets, necessitating computationally expensive reprocessing. This approach is no longer tenable given the scale of current studies and data volumes. Here, we define WGS data processing standards that allow different groups to produce functionally equivalent (FE) results, yet still innovate on data processing pipelines. Reference
A comprehensive analysis of 195 DNA methylomes reveals shared and cell-specific features of partially methylated domains
Partially methylated domains are extended regions in the genome exhibiting a reduced average DNA methylation level. They cover gene-poor and transcriptionally inactive regions and tend to be heterochromatic.
We present a comprehensive comparative analysis of partially methylated domains in human and mouse cells, to identify structural and functional features associated with them. Partially methylated domains are present in up to 75% of the genome in human and mouse cells irrespective of their tissue or cell origin. Each cell type has a distinct set of partially methylated domains, and genes expressed in such domains show a strong cell type effect. The methylation level varies between cell types with a more pronounced effect in differentiating and replicating cells. The lowest level of methylation is observed in highly proliferating and immortal cancer cell lines. Reference
Methylation of all BRCA1 copies predicts response to the PARP inhibitor rucaparib in ovarian carcinoma
Accurately identifying patients with high-grade serous ovarian carcinoma (HGSOC) who respond to poly(ADP-ribose) polymerase inhibitor (PARPi) therapy is of great clinical importance.
Here we show that quantitative BRCA1 methylation analysis provides new insight into PARPi response in preclinical models and ovarian cancer patients. The response of 12 HGSOC patient-derived xenografts (PDX) to the PARPi rucaparib was assessed, with variable dose-dependent responses observed in chemo-naive BRCA1/2-mutated PDX, and no responses in PDX lacking DNA repair pathway defects. Among BRCA1-methylated PDX, silencing of all BRCA1 copies predicts rucaparib response, whilst heterozygous methylation is associated with resistance. Reference
Epigenetic prediction of complex traits and death
Genome-wide DNA methylation (DNAm) profiling has allowed for the development of molecular predictors for a multitude of traits and diseases. Such predictors may be more accurate than the self-reported phenotypes and could have clinical applications.
Here, penalized regression models are used to develop DNAm predictors for ten modifiable health and lifestyle factors in a cohort of 5087 individuals. Using an independent test cohort comprising 895 individuals, the proportion of phenotypic variance explained in each trait is examined for DNAm-based and genetic predictors. Receiver operator characteristic curves are generated to investigate the predictive performance of DNAm-based predictors, using dichotomized phenotypes. The relationship between DNAm scores and all-cause mortality (n = 212 events) is assessed via Cox proportional hazards models. DNAm predictors for smoking, alcohol, education, and waist-to-hip ratio are shown to predict mortality in multivariate models. Reference
Parallel selection on a dormancy gene during domestication of crops from multiple families
Domesticated species often exhibit convergent phenotypic evolution, termed the domestication syndrome, of which loss of seed dormancy is a component.
To date, dormancy genes that contribute to parallel domestication across different families have not been reported. Here, we cloned the classical stay-green G gene from soybean and found that it controls seed dormancy and showed evidence of selection during soybean domestication. Moreover, orthologs in rice and tomato also showed evidence of selection during domestication. Analysis of transgenic plants confirmed that orthologs of G had conserved functions in controlling seed dormancy in soybean, rice, and Arabidopsis. Reference
Microenvironmental niche divergence shapes BRCA1-dysregulated ovarian cancer morphological plasticity
How tumor microenvironmental forces shape plasticity of cancer cell morphology is poorly understood. Here, we conduct automated histology image and spatial statistical analyses in 514 high grade serous ovarian samples to define cancer morphological diversification within the spatial context of the microenvironment.
Tumor spatial zones, where cancer cell nuclei diversify in shape, are mapped in each tumor. Integration of this spatially explicit analysis with omics and clinical data reveals a relationship between morphological diversification and the dysregulation of DNA repair, loss of nuclear integrity, and increased disease mortality. Within the Immunoreactive subtype, spatial analysis further reveals significantly lower lymphocytic infiltration within diversified zones compared with other tumor zones, suggesting that even immune-hot tumors contain cells capable of immune escape. Reference
ABLE: blockwise site frequency spectra for inferring complex population histories and recombination
We introduce ABLE (Approximate Blockwise Likelihood Estimation), a novel simulation-based composite likelihood method that uses the blockwise site frequency spectrum to jointly infer past demography and recombination.
ABLE is explicitly designed for a wide variety of data from unphased diploid genomes to genome-wide multi-locus data (for example, RADSeq) and can also accommodate arbitrarily large samples. We use simulations to demonstrate the accuracy of this method to infer complex histories of divergence and gene flow and reanalyze whole genome data from two species of orangutan. Reference
Mutational processes shape the landscape of TP53 mutations in human cancer
Unlike most tumor suppressor genes, the most common genetic alterations in tumor protein p53 (TP53) are missense mutations. Mutant p53 protein is often abundantly expressed in cancers and specific allelic variants exhibit dominant-negative or gain-of-function activities in experimental models.
To gain a systematic view of p53 function, we interrogated loss-of-function screens conducted in hundreds of human cancer cell lines and performed TP53 saturation mutagenesis screens in an isogenic pair of TP53 wild-type and null cell lines. We found that loss or dominant-negative inhibition of wild-type p53 function reliably enhanced cellular fitness. By integrating these data with the Catalog of Somatic Mutations in Cancer (COSMIC) mutational signatures database, we developed a statistical model that describes the TP53 mutational spectrum as a function of the baseline probability of acquiring each mutation and the fitness advantage conferred by attenuation of p53 activity. Reference
BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference
We introduce a Bayesian semi-supervised method for estimating cell counts from DNA methylation by leveraging an easily obtainable prior knowledge on the cell-type composition distribution of the studied tissue.
We show mathematically and empirically that alternative methods which attempt to infer cell counts without methylation reference only capture linear combinations of cell counts rather than provide one component per cell type. Our approach allows the construction of components such that each component corresponds to a single cell type, and provides a new opportunity to investigate cell compositions in genomic studies of tissues for which it was not possible before. Reference
Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning
Visual inspection of histopathology slides is one of the main methods used by pathologists to assess the stage, type and subtype of lung tumors.
Adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) are the most prevalent subtypes of lung cancer, and their distinction requires visual inspection by an experienced pathologist. In this study, we trained a deep convolutional neural network (inception v3) on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue. The performance of our method is comparable to that of pathologists, with an average area under the curve (AUC) of 0.97. Our model was validated on independent datasets of frozen tissues, formalin-fixed paraffin-embedded tissues and biopsies. Reference
Robust single-cell DNA methylome profiling with snmC-seq2
Single-cell DNA methylome profiling has enabled the study of epigenomic heterogeneity in complex tissues and during cellular reprogramming.
However, broader applications of the method have been impeded by the modest quality of sequencing libraries. Here we report snmC-seq2, which provides improved read mapping, reduced artifactual reads, enhanced throughput, as well as increased library complexity and coverage uniformity compared to snmC-seq. snmC-seq2 is an efficient strategy suited for large-scale single-cell epigenomic studies. Reference
Genomic history of the Sardinian population
The population of the Mediterranean island of Sardinia has made important contributions to genome-wide association studies of complex disease traits and, based on ancient DNA studies of mainland Europe, Sardinia is hypothesized to be a unique refuge for early Neolithic ancestry.
To provide new insights on the genetic history of this flagship population, we analyzed 3,514 whole-genome sequenced individuals from Sardinia. Sardinian samples show elevated levels of shared ancestry with Basque individuals, especially samples from the more historically isolated regions of Sardinia. Our analysis also uniquely illuminates how levels of genetic similarity with mainland ancient DNA samples varies subtly across the island. Together, our results indicate that within-island substructure and sex-biased processes have substantially impacted the genetic history of Sardinia. Reference
Origin of exon skipping-rich transcriptomes in animals driven by evolution of gene architecture
Alternative splicing, particularly through intron retention and exon skipping, is a major layer of pre-translational regulation in eukaryotes.
While intron retention is believed to be the most prevalent mode across non-animal eukaryotes, animals have unusually high rates of exon skipping. We used RNA-seq data to quantify exon skipping and intron retention frequencies across 65 eukaryotic species, with particular focus on early branching animals and unicellular holozoans. We found that only bilaterians have significantly increased their exon skipping frequencies compared to all other eukaryotic groups. Reference
Exome-wide analysis of bi-allelic alterations identifies a Lynch phenotype in The Cancer Genome Atlas
Cancer susceptibility germline variants generally require somatic alteration of the remaining allele to drive oncogenesis and, in some cases, tumor mutational profiles.
Whether combined germline and somatic bi-allelic alterations are universally required for germline variation to influence tumor mutational profile is unclear. Here, we performed an exome-wide analysis of the frequency and functional effect of bi-allelic alterations in The Cancer Genome Atlas (TCGA). We integrated germline variant, somatic mutation, somatic methylation, and somatic copy number loss data from 7790 individuals from TCGA to identify germline and somatic bi-allelic alterations in all coding genes. Reference
The gut microbiota promotes hepatic fatty acid desaturation and elongation in mice
Interactions between the gut microbial ecosystem and host lipid homeostasis are highly relevant to host physiology and metabolic diseases.
We present a comprehensive multi-omics view of the effect of intestinal microbial colonization on hepatic lipid metabolism, integrating transcriptomic, proteomic, phosphoproteomic, and lipidomic analyses of liver and plasma samples from germfree and specific pathogen-free mice. Microbes induce monounsaturated fatty acid generation by stearoyl-CoA desaturase 1 and polyunsaturated fatty acid elongation by fatty acid elongase 5, leading to significant alterations in glycerophospholipid acyl-chain profiles. Reference
Accurate classification of BRCA1 variants with saturation genome editing
Variants of uncertain significance fundamentally limit the clinical utility of genetic information. The challenge they pose is epitomized by BRCA1, a tumour suppressor gene in which germline loss-of-function variants predispose women to breast and ovarian cancer.
Although BRCA1 has been sequenced in millions of women, the risk associated with most newly observed variants cannot be definitively assigned. Here we use saturation genome editing to assay 96.5% of all possible single-nucleotide variants (SNVs) in 13 exons that encode functionally critical domains of BRCA1. Functional effects for nearly 4,000 SNVs are bimodally distributed and almost perfectly concordant with established assessments of pathogenicity. Reference
DNA methylation footprints during soybean domestication and improvement
In addition to genetic variation, epigenetic variation plays an important role in determining various biological processes.
To understand the impact of epigenetics on crop domestication, we investigate the variation of DNA methylation during soybean domestication and improvement by whole-genome bisulfite sequencing of 45 soybean accessions, including wild soybeans, landraces, and cultivars. Through methylomic analysis, we identify 5412 differentially methylated regions (DMRs). These DMRs exhibit characters distinct from those of genetically selected regions. In particular, they have significantly higher genetic diversity. Reference
Integrative detection and analysis of structural variation in cancer genomes
Structural variants (SVs) can contribute to oncogenesis through a variety of mechanisms. Despite their importance, the identification of SVs in cancer genomes remains challenging.
Here, we present a framework that integrates optical mapping, high-throughput chromosome conformation capture (Hi-C), and whole-genome sequencing to systematically detect SVs in a variety of normal or cancer samples and cell lines. We identify the unique strengths of each method and demonstrate that only integrative approaches can comprehensively identify SVs in the genome. By combining Hi-C and optical mapping, we resolve complex SVs and phase multiple SV events to a single haplotype. Reference
Exome-wide analysis identifies three low-frequency missense variants associated with pancreatic cancer risk in Chinese populations
Germline coding variants have not been systematically investigated for pancreatic ductal adenocarcinoma (PDAC).
Here we report an exome-wide investigation using the Illumina Human Exome Beadchip with 943 PDAC cases and 3908 controls in the Chinese population, followed by two independent replicate samples including 2142 cases and 4697 controls. We identify three low-frequency missense variants associated with the PDAC risk: rs34309238 in PKN1 (OR = 1.77, 95% CI: 1.48–2.12, P = 5.35 × 10−10), rs2242241 in DOK2 (OR = 1.85, 95% CI: 1.50–2.27, P = 4.34 × 10−9), and rs183117027 in APOB (OR = 2.34, 95% CI: 1.72–3.16, P = 4.21 × 10−8). Functional analyses show that the PKN1 rs34309238 variant significantly increases the level of phosphorylated PKN1 and thus enhances PDAC cells’ proliferation by phosphorylating and activating the FAK/PI3K/AKT pathway. Reference
Increased DNA methylation variability in rheumatoid arthritis-discordant monozygotic twins
Rheumatoid arthritis is a common autoimmune disorder influenced by both genetic and environmental factors.
Epigenome-wide association studies can identify environmentally mediated epigenetic changes such as altered DNA methylation, which may also be influenced by genetic factors. To investigate possible contributions of DNA methylation to the aetiology of rheumatoid arthritis with minimum confounding genetic heterogeneity, we investigated genome-wide DNA methylation in disease-discordant monozygotic twin pairs. Reference
Route of immunization defines multiple mechanisms of vaccine-mediated protection against SIV
Antibodies are the primary correlate of protection for most licensed vaccines; however, their mechanisms of protection may vary, ranging from physical blockade to clearance via the recruitment of innate immunity.
Here, we uncover striking functional diversity in vaccine-induced antibodies that is driven by immunization site and is associated with reduced risk of SIV infection in nonhuman primates. While equivalent levels of protection were observed following intramuscular (IM) and aerosol (AE) immunization with an otherwise identical DNA prime–Ad5 boost regimen, reduced risk of infection was associated with IgG-driven antibody-dependent monocyte-mediated phagocytosis in the IM vaccinees, but with vaccine-elicited IgA-driven neutrophil-mediated phagocytosis in AE-immunized animals. Thus, although route-independent correlates indicate a critical role for phagocytic Fc-effector activity in protection from SIV, the site of immunization may drive this Fc activity via distinct innate effector cells and antibody isotypes. Reference
Decreasing miRNA sequencing bias using a single adapter and circularization approach
The ability to accurately quantify all the microRNAs (miRNAs) in a sample is important for understanding miRNA biology and for development of new biomarkers and therapeutic targets.
We develop a new method for preparing miRNA sequencing libraries, RealSeq®-AC, that involves ligating the miRNAs with a single adapter and circularizing the ligation products. When compared to other methods, RealSeq®-AC provides greatly reduced miRNA sequencing bias and allows the identification of the largest variety of miRNAs in biological samples. This reduced bias also allows robust quantification of miRNAs present in samples across a wide range of RNA input levels. Reference
Co-activation of super-enhancer-driven CCAT1 by TP63 and SOX2 promotes squamous cancer progression
Squamous cell carcinomas (SCCs) are aggressive malignancies. Previous report demonstrated that master transcription factors (TFs) TP63 and SOX2 exhibited overlapping genomic occupancy in SCCs.
However, functional consequence of their frequent co-localization at super-enhancers remains incompletely understood. Here, epigenomic profilings of different types of SCCs reveal that TP63 and SOX2 cooperatively and lineage-specifically regulate long non-coding RNA (lncRNA) CCAT1 expression, through activation of its super-enhancers and promoter. Reference
An orthogonal proteomic survey uncovers novel Zika virus host factors
Zika virus (ZIKV) has recently emerged as a global health concern owing to its widespread diffusion and its association with severe neurological symptoms and microcephaly in newborns.
However, the molecular mechanisms that are responsible for the pathogenicity of ZIKV remain largely unknown. Here we use human neural progenitor cells and the neuronal cell line SK-N-BE2 in an integrated proteomics approach to characterize the cellular responses to viral infection at the proteome and phosphoproteome level, and use affinity proteomics to identify cellular targets of ZIKV proteins. Using this approach, we identify 386 ZIKV-interacting proteins, ZIKV-specific and pan-flaviviral activities as well as host factors with known functions in neuronal development, retinal defects and infertility. Reference
Comparative transcriptomic analysis of hematopoietic system between human and mouse by Microwell-seq
The classical model of hematopoiesis is a branched tree, rooted from long-term hematopoietic stem cell (LT-HSC) and followed by multipotent, oligopotent, and unipotent progenitor stages.
However, very limited studies have used systemic methods to investigate the heterogeneity of this population. The cross-species comparison of hematopoietic hierarchy is also lacking. Here, through Microwell-seq, a high-throughput and low-cost scRNA-seq platform4 and a canonical correlation analysis computational strategy5, we conducted comparative transcriptomic analysis of hematopoietic hierarchy in human and mouse. Reference
Detecting repeated cancer evolution from multi-region tumor sequencing data
Recurrent successions of genomic changes, both within and between patients, reflect repeated evolutionary processes that are valuable for the anticipation of cancer progression.
Multi-region sequencing allows the temporal order of some genomic changes in a tumor to be inferred, but the robust identification of repeated evolution across patients remains a challenge. We developed a machine-learning method based on transfer learning that allowed us to overcome the stochastic effects of cancer evolution and noise in data and identified hidden evolutionary patterns in cancer cohorts. When applied to multi-region sequencing datasets from lung, breast, renal, and colorectal cancer (768 samples from 178 patients), our method detected repeated evolutionary trajectories in subgroups of patients, which were reproduced in single-sample cohorts (n = 2,935). Reference
Comprehensive antibiotic-linked mutation assessment by resistance mutation sequencing (RM-seq)
Mutation acquisition is a major mechanism of bacterial antibiotic resistance that remains insufficiently characterised.
Here we present RM-seq, a new amplicon-based deep sequencing workflow based on a molecular barcoding technique adapted from Low Error Amplicon sequencing (LEA-seq). RM-seq allows detection and functional assessment of mutational resistance at high throughput from mixed bacterial populations. The sensitive detection of very low-frequency resistant sub-populations permits characterisation of antibiotic-linked mutational repertoires in vitro and detection of rare resistant populations during infections. RM-seq will facilitate comprehensive detection, characterisation and surveillance of resistant bacterial populations. Reference
A study paradigm integrating prospective epidemiologic cohorts and electronic health records to identify disease biomarkers
Defining the full spectrum of human disease associated with a biomarker is necessary to advance the biomarker into clinical practice.
We hypothesize that associating biomarker measurements with electronic health record (EHR) populations based on shared genetic architectures would establish the clinical epidemiology of the biomarker. We use Bayesian sparse linear mixed modeling to calculate SNP weightings for 53 biomarkers from the Atherosclerosis Risk in Communities study. We use the SNP weightings to computed predicted biomarker values in an EHR population and test associations with 1139 diagnoses. Here we report 116 associations meeting a Bonferroni level of significance. Reference
MetaCyto: A Tool for Automated Meta-analysis of Mass and Flow Cytometry Data
While meta-analysis has demonstrated increased statistical power and more robust estimations in studies, the application of this commonly accepted methodology to cytometry data has been challenging. Different cytometry studies often involve diverse sets of markers.
Moreover, the detected values of the same marker are inconsistent between studies due to different experimental designs and cytometer configurations. As a result, the cell subsets identified by existing auto-gating methods cannot be directly compared across studies. We developed MetaCyto for automated meta-analysis of both flow and mass cytometry (CyTOF) data. By combining clustering methods with a silhouette scanning method, MetaCyto is able to identify commonly labeled cell subsets across studies, thus enabling meta-analysis. Applying MetaCyto across a set of ten heterogeneous cytometry studies totaling 2,926 samples enabled us to identify multiple cell populations exhibiting differences in abundance between demographic groups. Reference
XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules
We report XCMS-MRM and METLIN-MRM (http://xcmsonline-mrm.scripps.edu/ and http://metlin.scripps.edu/), a cloud-based data-analysis platform and a public multiple-reaction monitoring (MRM) transition repository for small-molecule quantitative tandem mass spectrometry.
This platform provides MRM transitions for more than 15,500 molecules and facilitates data sharing across different instruments and laboratories. Reference
A multi-cohort study of the immune factors associated with M. tuberculosis infection outcomes
Most infections with Mycobacterium tuberculosis (Mtb) manifest as a clinically asymptomatic, contained state, known as latent tuberculosis infection, that affects approximately one-quarter of the global population. Although fewer than one in ten individuals eventually progress to active disease, tuberculosis is a leading cause of death from infectious disease worldwide.
Despite intense efforts, immune factors that influence the infection outcomes remain poorly defined. Here we used integrated analyses of multiple cohorts to identify stage-specific host responses to Mtb infection. First, using high-dimensional mass cytometry analyses and functional assays of a cohort of South African adolescents, we show that latent tuberculosis is associated with enhanced cytotoxic responses, which are mostly mediated by CD16 (also known as FcγRIIIa) and natural killer cells, and continuous inflammation coupled with immune deviations in both T and B cell compartments. Reference
Interaction between the microbiome and TP53 in human lung cancer
Lung cancer is the leading cancer diagnosis worldwide and the number one cause of cancer deaths. Exposure to cigarette smoke, the primary risk factor in lung cancer, reduces epithelial barrier integrity and increases susceptibility to infections.
Herein, we hypothesize that somatic mutations together with cigarette smoke generate a dysbiotic microbiota that is associated with lung carcinogenesis. Using lung tissue from 33 controls and 143 cancer cases, we conduct 16S ribosomal RNA (rRNA) bacterial gene sequencing, with RNA-sequencing data from lung cancer cases in The Cancer Genome Atlas serving as the validation cohort. Reference
HiGlass: web-based visual exploration and analysis of genome interaction maps
We present HiGlass, an open source visualization tool built on web technologies that provides a rich interface for rapid, multiplex, and multiscale navigation of 2D genomic maps alongside 1D genomic tracks, allowing users to combine various data types, synchronize multiple visualization modalities, and share fully customizable views with others.
We demonstrate its utility in exploring different experimental conditions, comparing the results of analyses, and creating interactive snapshots to share with collaborators and the broader public. Reference
Insights from the annotated wheat genome
Wheat is one of the major sources of food for much of the world. However, because bread wheat’s genome is a large hybrid mix of three separate subgenomes, it has been difficult to produce a high-quality reference sequence.
Using recent advances in sequencing, the International Wheat Genome Sequencing Consortium presents an annotated reference genome with a detailed analysis of gene content among subgenomes and the structural organization for all the chromosomes. An annotated reference sequence representing the hexaploid bread wheat genome in the form of 21 chromosome-like sequence assemblies has now been delivered, giving access to 107,891 high-confidence genes, including their genomic context of regulatory sequences. This assembly enabled the discovery of tissue- and developmental stage–related gene coexpression networks using a transcriptome atlas representing all stages of wheat development. Reference
Robust prediction of response to immune checkpoint blockade therapy in metastatic melanoma
Immune checkpoint blockade (ICB) therapy provides remarkable clinical gains and has been very successful in treatment of melanoma.
However, only a subset of patients with advanced tumors currently benefit from ICB therapies, which at times incur considerable side effects and costs. Constructing predictors of patient response has remained a serious challenge because of the complexity of the immune response and the shortage of large cohorts of ICB-treated patients that include both ‘omics’ and response data. Here we build immuno-predictive score (IMPRES), a predictor of ICB response in melanoma which encompasses 15 pairwise transcriptomics relations between immune checkpoint genes. Reference
Selective gene dependencies in MYCN-amplified neuroblastoma
Childhood high-risk neuroblastomas with MYCN gene amplification are difficult to treat effectively. This has focused attention on tumor-specific gene dependencies that underlie tumorigenesis and thus provide valuable targets for the development of novel therapeutics.
Using unbiased genome-scale CRISPR–Cas9 approaches to detect genes involved in tumor cell growth and survival, we identified 147 candidate gene dependencies selective for MYCN-amplified neuroblastoma cell lines, compared to over 300 other human cancer cell lines. We then used genome-wide chromatin-immunoprecipitation coupled to high-throughput sequencing analysis to demonstrate that a small number of essential transcription factors—MYCN, HAND2, ISL1, PHOX2B, GATA3, and TBX2—are members of the transcriptional core regulatory circuitry (CRC) that maintains cell state in MYCN-amplified neuroblastoma. Reference
A simple genetic basis of adaptation to a novel thermal environment results in complex metabolic rewiring in Drosophila
Population genetic theory predicts that rapid adaptation is largely driven by complex traits encoded by many loci of small effect. Because large-effect loci are quickly fixed in natural populations, they should not contribute much to rapid adaptation.
To investigate the genetic architecture of thermal adaptation — a highly complex trait — we performed experimental evolution on a natural Drosophila simulans population. Transcriptome and respiration measurements reveal extensive metabolic rewiring after only approximately 60 generations in a hot environment. Reference
Decoding a cancer-relevant splicing decision in the RON proto-oncogene
Mutations causing aberrant splicing are frequently implicated in human diseases including cancer.
Here, we establish a high-throughput screen of randomly mutated minigenes to decode the cis-regulatory landscape that determines alternative splicing of exon 11 in the proto-oncogene MST1R (RON). Mathematical modelling of splicing kinetics enables us to identify more than 1000 mutations affecting RON exon 11 skipping, which corresponds to the pathological isoform RON∆165. Importantly, the effects correlate with RON alternative splicing in cancer patients bearing the same mutations. Reference
Linking the International Wheat Genome Sequencing Consortium bread wheat reference genome sequence to wheat genetic and phenomic data
The Wheat@URGI portal has been developed to provide the international community of researchers and breeders with access to the bread wheat reference genome sequence produced by the International Wheat Genome Sequencing Consortium.
Genome browsers, BLAST, and InterMine tools have been established for in-depth exploration of the genome sequence together with additional linked datasets including physical maps, sequence variations, gene expression, and genetic and phenomic data from other international collaborative projects already stored in the GnpIS information system. Reference
Discovery of cationic nonribosomal peptides as Gram-negative antibiotics through global genome mining
The worldwide prevalence of infections caused by antibiotic-resistant Gram-negative bacteria poses a serious threat to public health due to the limited therapeutic alternatives.
Cationic peptides represent a large family of antibiotics and have attracted interest due to their diverse chemical structures and potential for combating drug-resistant Gram-negative pathogens. Here, we analyze 7395 bacterial genomes to investigate their capacity for biosynthesis of cationic nonribosomal peptides with activity against Gram-negative bacteria. Reference
Population genomics and morphometric assignment of western honey bees
Apis mellifera scutellata and A.m. capensis (the Cape honey bee) are western honey bee subspecies indigenous to the Republic of South Africa (RSA). Both bees are important for biological and economic reasons. First, A.m. scutellata is the invasive “African honey bee” of the Americas and exhibits a number of traits that beekeepers consider undesirable.
They swarm excessively, are prone to absconding (vacating the nest entirely), usurp other honey bee colonies, and exhibit heightened defensiveness. Second, Cape honey bees are socially parasitic bees; the workers can reproduce thelytokously. Both bees are indistinguishable visually. Therefore, we employed Genotyping-by-Sequencing (GBS), wing geometry and standard morphometric approaches to assess the genetic diversity and population structure of these bees to search for diagnostic markers that can be employed to distinguish between the two subspecies. Reference