Update on: September 27, 2023
Genome-wide enhancer-gene regulatory maps
Genome-wide association studies have identified numerous variants associated with human complex traits, most of which reside in the non-coding regions, but biological mechanisms remain unclear.

However, assigning function to the non-coding elements is still challenging. Here we apply Activity-by-Contact (ABC) model to evaluate enhancer-gene regulation effect by integrating multi-omics data and identified 544,849 connections across 20 cancer types. ABC model outperforms previous approaches in linking regulatory variants to target genes. Furthermore, we identify over 30,000 enhancer-gene connections in colorectal cancer (CRC) tissues.
By integrating large-scale population cohorts (23,813 cases and 29,973 controls) and multipronged functional assays, we demonstrate an ABC regulatory variant rs4810856 associated with CRC risk (Odds Ratio = 1.11, 95%CI = 1.05–1.16, P = 4.02 × 10−5) by acting as an allele-specific enhancer to distally facilitate PREX1, CSE1L and STAU1 expression, which synergistically activate p-AKT signaling. Our study provides comprehensive regulation maps and illuminates a single variant regulating multiple genes, providing insights into cancer etiology. Reference
Accurate proteome-wide missense variant effect prediction with AlphaMissense
Genome sequencing has revealed extensive genetic variation in human populations. Missense variants are genetic variants that alter the amino acid sequence of proteins. Pathogenic missense variants disrupt protein function and reduce organismal fitness, while benign missense variants have limited effect.

Classifying these variants is an important ongoing challenge in human genetics. Of more than 4 million observed missense variants, only an estimated 2% have been clinically classified as pathogenic or benign, while the vast majority of them are of unknown clinical significance. This limits the diagnosis of rare diseases, as well as the development or application of clinical treatments that target the underlying genetic cause. Machine learning approaches could close the variant interpretation gap by exploiting patterns in biological data to predict the pathogenicity of unannotated variants.
We developed AlphaMissense to leverage advances on multiple fronts: (i) unsupervised protein language modeling to learn amino acid distributions conditioned on sequence context; (ii) incorporating structural context by using an AlphaFold-derived system; and (iii) fine-tuning on weak labels from population frequency data, thereby avoiding bias from human-curated annotations. AlphaMissense achieves state-of-the-art missense pathogenicity predictions in clinical annotation, de novo disease variants, and experimental assay benchmarks without explicitly training on such data. Reference
A robust deep learning workflow to predict CD8 + T-cell epitopes
T-cells play a crucial role in the adaptive immune system by triggering responses against cancer cells and pathogens, while maintaining tolerance against self-antigens, which has sparked interest in the development of various T-cell-focused immunotherapies.

However, the identification of antigens recognised by T-cells is low-throughput and laborious. To overcome some of these limitations, computational methods for predicting CD8 + T-cell epitopes have emerged. Despite recent developments, most immunogenicity algorithms struggle to learn features of peptide immunogenicity from small datasets, suffer from HLA bias and are unable to reliably predict pathology-specific CD8 + T-cell epitopes.
TRAP was used to identify epitopes from glioblastoma patients as well as SARS-CoV-2 peptides, and it outperformed other algorithms in both cancer and pathogenic settings. TRAP was especially effective at extracting immunogenicity-associated properties from restricted data of emerging pathogens and translating them onto related species, as well as minimising the loss of likely epitopes in imbalanced datasets. Reference
An immune cell atlas reveals the dynamics of human macrophage specification during prenatal development
Macrophages are heterogeneous and play critical roles in development and disease, but their diversity, function, and specification remain inadequately understood during human development.

We generated a single-cell RNA sequencing map of the dynamics of human macrophage specification from PCW 4–26 across 19 tissues. We identified a microglia-like population and a proangiogenic population in 15 macrophage subtypes. Microglia-like cells, molecularly and morphologically similar to microglia in the CNS, are present in the fetal epidermis, testicle, and heart. They are the major immune population in the early epidermis, exhibit a polarized distribution along the dorsal-lateral-ventral axis, and interact with neural crest cells, modulating their differentiation along the melanocyte lineage.
Through spatial and differentiation trajectory analysis, we also showed that proangiogenic macrophages are perivascular across fetal organs and likely yolk-sac-derived as microglia. Our study provides a comprehensive map of the heterogeneity and developmental dynamics of human macrophages and unravels their diverse functions during development. Reference
GWAS of random glucose in 476,326 individuals
Conventional measurements of fasting and postprandial blood glucose levels investigated in genome-wide association studies (GWAS) cannot capture the effects of DNA variability on ‘around the clock’ glucoregulatory processes.

Here we show that GWAS meta-analysis of glucose measurements under nonstandardized conditions (random glucose (RG)) in 476,326 individuals of diverse ancestries and without diabetes enables locus discovery and innovative pathophysiological observations.
We discovered 120 RG loci represented by 150 distinct signals, including 13 with sex-dimorphic effects, two cross-ancestry and seven rare frequency signals. Of these, 44 loci are new for glycemic traits. Regulatory, glycosylation and metagenomic annotations highlight ileum and colon tissues, indicating an underappreciated role of the gastrointestinal tract in controlling blood glucose. Reference
Spatial multimodal analysis of transcriptomes and metabolomes in tissues
We present a spatial omics approach that combines histology, mass spectrometry imaging and spatial transcriptomics to facilitate precise measurements of mRNA transcripts and low-molecular-weight metabolites across tissue regions.

The workflow is compatible with commercially available Visium glass slides. We demonstrate the potential of our method using mouse and human brain samples in the context of dopamine and Parkinson’s disease. Reference
The Oncology Biomarker Discovery framework reveals cetuximab and bevacizumab response patterns in metastatic colorectal cancer
Precision medicine has revolutionised cancer treatments; however, actionable biomarkers remain scarce. To address this, we develop the Oncology Biomarker Discovery (OncoBird) framework for analysing the molecular and biomarker landscape of randomised controlled clinical trials.

OncoBird identifies biomarkers based on single genes or mutually exclusive genetic alterations in isolation or in the context of tumour subtypes, and finally, assesses predictive components by their treatment interactions. Here, we utilise the open-label, randomised phase III trial (FIRE-3, AIO KRK-0306) in metastatic colorectal carcinoma patients, who received either cetuximab or bevacizumab in combination with 5-fluorouracil, folinic acid and irinotecan (FOLFIRI).
We systematically identify five biomarkers with predictive components, e.g., patients with tumours that carry chr20q amplifications or lack mutually exclusive ERK signalling mutations benefited from cetuximab compared to bevacizumab. In summary, OncoBird characterises the molecular landscape and outlines actionable biomarkers, which generalises to any molecularly characterised randomised controlled trial. Reference
A pan-cancer single-cell panorama of human natural killer cells
Natural killer (NK) cells play indispensable roles in innate immune responses against tumor progression. To depict their phenotypic and functional diversities in the tumor microenvironment, we perform integrative single-cell RNA sequencing analyses on NK cells from 716 patients with cancer, covering 24 cancer types.

We observed heterogeneity in NK cell composition in a tumor-type-specific manner. Notably, we have identified a group of tumor-associated NK cells that are enriched in tumors, show impaired anti-tumor functions, and are associated with unfavorable prognosis and resistance to immunotherapy.
Specific myeloid cell subpopulations, in particular LAMP3+ dendritic cells, appear to mediate the regulation of NK cell anti-tumor immunity. Our study provides insights into NK-cell-based cancer immunity and highlights potential clinical utilities of NK cell subsets as therapeutic targets. Reference
Epitope editing enables targeted immunotherapy of acute myeloid leukaemia
Despite the considerable efficacy observed when targeting a dispensable lineage antigen, such as CD19 in B cell acute lymphoblastic leukaemia, the broader applicability of adoptive immunotherapies is hampered by the absence of tumour-restricted antigens.

Acute myeloid leukaemia immunotherapies target genes expressed by haematopoietic stem/progenitor cells (HSPCs) or differentiated myeloid cells, resulting in intolerable on-target/off-tumour toxicity. Here we show that epitope engineering of donor HSPCs used for bone marrow transplantation endows haematopoietic lineages with selective resistance to chimeric antigen receptor (CAR) T cells or monoclonal antibodies, without affecting protein function or regulation. This strategy enables the targeting of genes that are essential for leukaemia survival regardless of shared expression on HSPCs, reducing the risk of tumour immune escape.
By performing epitope mapping and library screenings, we identified amino acid changes that abrogate the binding of therapeutic monoclonal antibodies targeting FLT3, CD123 and KIT, and optimized a base-editing approach to introduce them into CD34+ HSPCs, which retain long-term engraftment and multilineage differentiation ability. Reference
The complete sequence of a human Y chromosome
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications.

As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region.
We have combined T2T-Y with a previous assembly of the CHM13 genome and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes. Reference
BamQuery: a proteogenomic tool to explore the immunopeptidome
MHC-I-associated peptides deriving from non-coding genomic regions and mutations can generate tumor-specific antigens, including neoantigens.

Quantifying tumor-specific antigens’ RNA expression in malignant and benign tissues is critical for discriminating actionable targets. We present BamQuery, a tool attributing an exhaustive RNA expression to MHC-I-associated peptides of any origin from bulk and single-cell RNA-sequencing data.
We show that many cryptic and mutated tumor-specific antigens can derive from multiple discrete genomic regions, abundantly expressed in normal tissues. BamQuery can also be used to predict MHC-I-associated peptides immunogenicity and identify actionable tumor-specific antigens de novo. Reference
A visual–language foundation model for pathology image analysis using medical Twitter
The lack of annotated publicly available medical images is a major barrier for computational research and education innovations. At the same time, many de-identified images and much knowledge are shared by clinicians on public forums such as medical Twitter.

Here we harness these crowd platforms to curate OpenPath, a large dataset of 208,414 pathology images paired with natural language descriptions. We demonstrate the value of this resource by developing pathology language–image pretraining (PLIP), a multimodal artificial intelligence with both image and text understanding, which is trained on OpenPath.
PLIP achieves state-of-the-art performances for classifying new pathology images across four external datasets: for zero-shot classification, PLIP achieves F1 scores of 0.565–0.832 compared to F1 scores of 0.030–0.481 for previous contrastive language–image pretrained model. Training a simple supervised classifier on top of PLIP embeddings also achieves 2.5% improvement in F1 scores compared to using other supervised model embeddings. Reference
Pan-cancer analysis of post-translational modifications reveals shared patterns of protein regulation
Post-translational modifications (PTMs) play key roles in regulating cell signaling and physiology in both normal and cancer cells.

Advances in mass spectrometry enable high-throughput, accurate, and sensitive measurement of PTM levels to better understand their role, prevalence, and crosstalk. Here, we analyze the largest collection of proteogenomics data from 1,110 patients with PTM profiles across 11 cancer types (10 from the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium [CPTAC]). Our study reveals pan-cancer patterns of changes in protein acetylation and phosphorylation involved in hallmark cancer processes.
These patterns revealed subsets of tumors, from different cancer types, including those with dysregulated DNA repair driven by phosphorylation, altered metabolic regulation associated with immune response driven by acetylation, affected kinase specificity by crosstalk between acetylation and phosphorylation, and modified histone regulation. Overall, this resource highlights the rich biology governed by PTMs and exposes potential new therapeutic avenues. Reference
COMPASS: joint copy number and mutation phylogeny reconstruction from amplicon single-cell sequencing data
Reconstructing the history of somatic DNA alterations can help understand the evolution of a tumor and predict its resistance to treatment.

Single-cell DNA sequencing (scDNAseq) can be used to investigate clonal heterogeneity and to inform phylogeny reconstruction. However, most existing phylogenetic methods for scDNAseq data are designed either for single nucleotide variants (SNVs) or for large copy number alterations (CNAs), or are not applicable to targeted sequencing. Here, we develop COMPASS, a computational method for inferring the joint phylogeny of SNVs and CNAs from targeted scDNAseq data.
We evaluate COMPASS on simulated data and apply it to several datasets including a cohort of 123 patients with acute myeloid leukemia. COMPASS detected clonal CNAs that could be orthogonally validated with bulk data, in addition to subclonal ones that require single-cell resolution, some of which point toward convergent evolution. Reference
Proteogenomic analysis of chemo-refractory high-grade serous ovarian cancer

Copy number architectures define treatment-mediated selection of lethal prostate cancer clones
Despite initial responses to hormone treatment, metastatic prostate cancer invariably evolves to a lethal state. To characterize the intra-patient evolutionary relationships of metastases that evade treatment,

We perform genome-wide copy number profiling and bespoke approaches targeting the androgen receptor (AR) on 167 metastatic regions from 11 organs harvested post-mortem from 10 men who died from prostate cancer. We identify diverse and patient-unique alterations clustering around the AR in metastases from every patient with evidence of independent acquisition of related genomic changes within an individual and, in some patients, the co-existence of AR-neutral clones.
Using the genomic boundaries of pan-autosome copy number changes, we confirm a common clone of origin across metastases and diagnostic biopsies, and identified in individual patients, clusters of metastases occupied by dominant clones with diverged autosomal copy number alterations. Reference
Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary
Cancer of unknown primary (CUP) is a type of cancer that cannot be traced back to its primary site and accounts for 3–5% of all cancers. Established targeted therapies are lacking for CUP, leading to generally poor outcomes.

We developed OncoNPC, a machine-learning classifier trained on targeted next-generation sequencing (NGS) data from 36,445 tumors across 22 cancer types from three institutions. Oncology NGS-based primary cancer-type classifier (OncoNPC) achieved a weighted F1 score of 0.942 for high confidence predictions (≥0.9) on held-out tumor samples, which made up 65.2% of all the held-out samples. When applied to 971 CUP tumors collected at the Dana-Farber Cancer Institute, OncoNPC predicted primary cancer types with high confidence in 41.2% of the tumors.
OncoNPC also identified CUP subgroups with significantly higher polygenic germline risk for the predicted cancer types and with significantly different survival outcomes. Notably, patients with CUP who received first palliative intent treatments concordant with their OncoNPC-predicted cancers had significantly better outcomes (hazard ratio (HR) = 0.348; 95% confidence interval (CI) = 0.210–0.570; P = 2.32×10−5). Reference
Genomic insight into domestication of rubber tree
Understanding the genetic basis of rubber tree (Hevea brasiliensis) domestication is crucial for further improving natural rubber production to meet its increasing demand worldwide.

Here we provide a high-quality H. brasiliensis genome assembly (1.58 Gb, contig N50 of 11.21 megabases), present a map of genome variations by resequencing 335 accessions and reveal domestication-related molecular signals and a major domestication trait, the higher number of laticifer rings. We further show that HbPSK5, encoding the small-peptide hormone phytosulfokine (PSK), is a key domestication gene and closely correlated with the major domestication trait.
The transcriptional activation of HbPSK5 by myelocytomatosis (MYC) members links PSK signaling to jasmonates in regulating the laticifer differentiation in rubber tree. Heterologous overexpression of HbPSK5 in Russian dandelion (Taraxacum kok-saghyz) can increase rubber content by promoting laticifer formation. Our results provide an insight into target genes for improving rubber tree and accelerating the domestication of other rubber-producing plants. Reference
Effective methods for bulk RNA-seq deconvolution using scnRNA-seq transcriptomes
RNA profiling technologies at single-cell resolutions, including single-cell and single-nuclei RNA sequencing (scRNA-seq and snRNA-seq, scnRNA-seq for short), can help characterize the composition of tissues and reveal cells that influence key functions in both healthy and disease tissues.

However, the use of these technologies is operationally challenging because of high costs and stringent sample-collection requirements. Computational deconvolution methods that infer the composition of bulk-profiled samples using scnRNA-seq-characterized cell types can broaden scnRNA-seq applications, but their effectiveness remains controversial.
We produced the first systematic evaluation of deconvolution methods on datasets with either known or scnRNA-seq-estimated compositions. Our analyses revealed biases that are common to scnRNA-seq 10X Genomics assays and illustrated the importance of accurate and properly controlled data preprocessing and method selection and optimization. Reference
A reinforcement learning model for AI-based decision support in skin cancer
We investigated whether human preferences hold the potential to improve diagnostic artificial intelligence (AI)-based decision support using skin cancer diagnosis as a use case.

We utilized nonuniform rewards and penalties based on expert-generated tables, balancing the benefits and harms of various diagnostic errors, which were applied using reinforcement learning. Compared with supervised learning, the reinforcement learning model improved the sensitivity for melanoma from 61.4% to 79.5% (95% confidence interval (CI): 73.5–85.6%) and for basal cell carcinoma from 79.4% to 87.1% (95% CI: 80.3–93.9%). AI overconfidence was also reduced while simultaneously maintaining accuracy.
Reinforcement learning increased the rate of correct diagnoses made by dermatologists by 12.0% (95% CI: 8.8–15.1%) and improved the rate of optimal management decisions from 57.4% to 65.3% (95% CI: 61.7–68.9%). We further demonstrated that the reward-adjusted reinforcement learning model and a threshold-based model outperformed naïve supervised learning in various clinical scenarios. Our findings suggest the potential for incorporating human preferences into image-based diagnostic algorithms. Reference
Genetic history of East-Central Europe in the first millennium CE
The appearance of Slavs in East-Central Europe has been the subject of an over 200-year debate driven by two conflicting hypotheses. The first assumes that Slavs came to the territory of contemporary Poland no earlier than the sixth century CE; the second postulates that they already inhabited this region in the Iron Age (IA).

To address this problem, we determined the genetic makeup of representatives of the IA Wielbark- and MA Slav-associated cultures from the territory of present-day Poland. The study involved 474 individuals buried in 27 cemeteries. For 197 of them, genome-wide data were obtained. We found close genetic affinities between the IA Wielbark culture-associated individuals and contemporary to them and older northern European populations. Further, we observed that the IA individuals had genetic components which were indispensable to model the MA population. Reference
Genotyping and population characteristics of the China Kadoorie Biobank
The China Kadoorie Biobank (CKB) is a population-based prospective cohort of >512,000 adults recruited from 2004 to 2008 from 10 geographically diverse regions across China.

Detailed data from questionnaires and physical measurements were collected at baseline, with additional measurements at three resurveys involving ∼5% of surviving participants. Analyses of genome-wide genotyping, for >100,000 participants using custom-designed Axiom arrays, reveal extensive relatedness, recent consanguinity, and signatures reflecting large-scale population movements from recent Chinese history. Systematic genome-wide association studies of incident disease, captured through electronic linkage to death and disease registries and to the national health insurance system, replicate established disease loci and identify 14 novel disease associations.
Together with studies of candidate drug targets and disease risk factors and contributions to international genetics consortia, these demonstrate the breadth, depth, and quality of the CKB data. Ongoing high-throughput omics assays of collected biosamples and planned whole-genome sequencing will further enhance the scientific value of this biobank. Reference
Genetic variation in the immunoglobulin heavy chain locus shapes the human antibody repertoire
Variation in the antibody response has been linked to differential outcomes in disease, and suboptimal vaccine and therapeutic responsiveness, the determinants of which have not been fully elucidated.

Countering models that presume antibodies are generated largely by stochastic processes, we demonstrate that polymorphisms within the immunoglobulin heavy chain locus (IGH) impact the naive and antigen-experienced antibody repertoire, indicating that genetics predisposes individuals to mount qualitatively and quantitatively different antibody responses. We pair recently developed long-read genomic sequencing methods with antibody repertoire profiling to comprehensively resolve IGH genetic variation, including novel structural variants, single nucleotide variants, and genes and alleles.
We show that IGH germline variants determine the presence and frequency of antibody genes in the expressed repertoire, including those enriched in functional elements linked to V(D)J recombination, and overlapping disease-associated variants. These results illuminate the power of leveraging IGH genetics to better understand the regulation, function, and dynamics of the antibody response in disease. Reference
Depression pathophysiology, risk prediction of recurrence and comorbid psychiatric disorders
Depression is a common psychiatric disorder and a leading cause of disability worldwide.

Here we conducted a genome-wide association study meta-analysis of six datasets, including >1.3 million individuals (371,184 with depression) and identified 243 risk loci. Overall, 64 loci were new, including genes encoding glutamate and GABA receptors, which are targets for antidepressant drugs. Intersection with functional genomics data prioritized likely causal genes and revealed new enrichment of prenatal GABAergic neurons, astrocytes and oligodendrocyte lineages.
We found depression to be highly polygenic, with ~11,700 variants explaining 90% of the single-nucleotide polymorphism heritability, estimating that >95% of risk variants for other psychiatric disorders (anxiety, schizophrenia, bipolar disorder and attention deficit hyperactivity disorder) were influencing depression risk when both concordant and discordant variants were considered, and nearly all depression risk variants influenced educational attainment. Reference
Single-cell spatial transcriptome reveals cell-type organization in the macaque cortex
Elucidating the cellular organization of the cerebral cortex is critical for understanding brain structure and function.

Using large-scale single-nucleus RNA sequencing and spatial transcriptomic analysis of 143 macaque cortical regions, we obtained a comprehensive atlas of 264 transcriptome-defined cortical cell types and mapped their spatial distribution across the entire cortex. We characterized the cortical layer and region preferences of glutamatergic, GABAergic, and non-neuronal cell types, as well as regional differences in cell-type composition and neighborhood complexity. Notably, we discovered a relationship between the regional distribution of various cell types and the region’s hierarchical level in the visual and somatosensory systems.
Cross-species comparison of transcriptomic data from human, macaque, and mouse cortices further revealed primate-specific cell types that are enriched in layer 4, with their marker genes expressed in a region-dependent manner. Our data provide a cellular and molecular basis for understanding the evolution, development, aging, and pathogenesis of the primate brain. Reference
Spatially resolved multiomics of human cardiac niches
The function of a cell is defined by its intrinsic characteristics and its niche: the tissue microenvironment in which it dwells.

Here we combine single-cell and spatial transcriptomics data to discover cellular niches within eight regions of the human heart. We map cells to microanatomical locations and integrate knowledge-based and unsupervised structural annotations.
We also profile the cells of the human cardiac conduction system1. The results revealed their distinctive repertoire of ion channels, G-protein-coupled receptors (GPCRs) and regulatory networks, and implicated FOXP2 in the pacemaker phenotype. We show that the sinoatrial node is compartmentalized, with a core of pacemaker cells, fibroblasts and glial cells supporting glutamatergic signalling. Using a custom CellPhoneDB.org module, we identify trans-synaptic pacemaker cell interactions with glia.
We introduce a druggable target prediction tool, drug2cell, which leverages single-cell profiles and drug–target interactions to provide mechanistic insights into the chronotropic effects of drugs, including GLP-1 analogues. In the epicardium, we show enrichment of both IgG+ and IgA+ plasma cells forming immune niches that may contribute to infection defence. Overall, we provide new clarity to cardiac electro-anatomy and immunology, and our suite of computational approaches can be applied to other tissues and organs. Reference
De novo design of protein structure and function with RFdiffusion
There has been considerable recent progress in designing new proteins using deep learning methods. Despite this progress, a general deep learning framework for protein design that enables solution of a wide range of design challenges, including de novo binder design and design of higher order symmetric architectures, has yet to be described.

Diffusion models have had considerable success in image and language generative modeling but limited success when applied to protein modeling, likely due to the complexity of protein backbone geometry and sequence-structure relationships. Here we show that by fine tuning the RoseTTAFold structure prediction network on protein structure denoising tasks, we obtain a generative model of protein backbones that achieves outstanding performance on unconditional and topology-constrained protein monomer design, protein binder design, symmetric oligomer design, enzyme active site scaffolding, and symmetric motif scaffolding for therapeutic and metal-binding protein design.
We demonstrate the power and generality of the method, called RoseTTAFold Diffusion (RFdiffusion), by experimentally characterizing the structures and functions of hundreds of designed symmetric assemblies, metal binding proteins and protein binders. Reference
Characterization of large-scale genomic differences in the first complete human genome
The first telomere-to-telomere (T2T) human genome assembly (T2T-CHM13) release is a milestone in human genomics. The T2T-CHM13 genome assembly extends our understanding of telomeres, centromeres, segmental duplication, and other complex regions.

The current human genome reference (GRCh38) has been widely used in various human genomic studies. However, the large-scale genomic differences between these two important genome assemblies are not characterized in detail yet. Here, in addition to the previously reported “non-syntenic” regions, we find 67 additional large-scale discrepant regions and precisely categorize them into four structural types with a newly developed website tool called SynPlotter.
The discrepant regions (~ 21.6 Mbp) excluding telomeric and centromeric regions are highly structurally polymorphic in humans, where the deletions or duplications are likely associated with various human diseases, such as immune and neurodevelopmental disorders. Reference