Helmholtz Gemeinschaft

Search
Browse
Statistics
Feeds

Gaps and complex structurally variant loci in phased genome assemblies

[thumbnail of Original Article]
Preview
PDF (Original Article) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
13MB
[thumbnail of Supplemental Material] Other (Supplemental Material)
20MB

Item Type:Article
Title:Gaps and complex structurally variant loci in phased genome assemblies
Creators Name:Porubsky, D., Vollger, M.R., Harvey, W.T., Rozanski, A.N., Ebert, P., Hickey, G., Hasenfeld, P., Sanders, A.D., Stober, C., Korbel, J.O., Paten, B., Marschall, T. and Eichler, E.E.
Abstract:There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
Keywords:DNA Sequence Analysis, Genetic Polymorphism, Genomic Segmental Duplications, Haplotypes, Satellite DNA
Source:Genome Research
ISSN:1088-9051
Publisher:Cold Spring Harbor Laboratory Press
Volume:33
Number:4
Page Range:496-510
Date:April 2023
Official Publication:https://doi.org/10.1101/gr.277334.122
PubMed:View item in PubMed

Repository Staff Only: item control page

Downloads

Downloads per month over past year

Open Access
MDC Library