Abstract:Approximately 3% of the human genome consists of repetitive elements called tandem repeats (TRs), which include short tandem repeats (STRs) of 1-6bp motifs and variable number tandem repeats (VNTRs) of 7+bp motifs. TR variants contribute to several dozen mono- and polygenic diseases but remain understudied and 'enigmatic,' particularly relative to single nucleotide variants. It remains comparatively challenging to interpret the clinical significance of TR variants. Although existing resources provide portions of necessary data for interpretation at disease-associated loci, it is currently difficult or impossible to efficiently invoke the additional details critical to proper interpretation, such as motif pathogenicity, disease penetrance, and age of onset distributions. It is also often unclear how to apply population information to analyses. We present STRchive (S-T-archive, http://strchive.org/), a dynamic resource consolidating information on TR disease loci in humans from research literature, up-to-date clinical resources, and large-scale genomic databases, with the goal of streamlining TR variant interpretation at disease-associated loci. We apply STRchive-including pathogenic thresholds, motif classification, and clinical phenotypes-to a gnomAD cohort of ~18.5k individuals genotyped at 60 disease-associated loci. Through detailed literature curation, we demonstrate that the majority of TR diseases affect children despite being thought of as adult diseases. Additionally, we show that pathogenic genotypes can be found within gnomAD which do not necessarily overlap with known disease prevalence, and leverage STRchive to interpret locus-specific findings therein. We apply a diagnostic blueprint empowered by STRchive to relevant clinical vignettes, highlighting possible pitfalls in TR variant interpretation. As a living resource, STRchive is maintained by experts, takes community contributions, and will evolve as understanding of TR diseases progresses.

Analysis and benchmarking of small and large genomic variants across tandem repeats

Defining a tandem repeat catalog and variation clusters for genome-wide analyses and population databases

Characterization and visualization of tandem repeats at genome scale

Genome-wide profiling of genetic variation at tandem repeat from long reads

TRCompDB: A reference of human tandem repeat sequence and composition variation from long-read assemblies

Characterizing tandem repeat complexities across long-read sequencing platforms with TREAT and otter

Characterising tandem repeat complexities across long-read sequencing platforms with TREAT and otter

TRGT-denovo: accurate detection of tandem repeat mutations

TRcaller: a novel tool for precise and ultrafast tandem repeat variant genotyping in massively parallel sequencing reads

TRGT-denovo: accurate detection of de novo tandem repeat mutations

A genome-wide spectrum of tandem repeat expansions in 338,963 humans

Sequencing and characterizing short tandem repeats in the human genome

STRchive: a dynamic resource detailing population-level and locus-specific insights at tandem repeat disease loci

Curated variation benchmarks for challenging medically relevant autosomal genes

A comprehensive tandem repeat catalog of the human genome

Short Tandem Repeats in the era of next-generation sequencing: from historical loci to population databases

A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples

The Platinum Pedigree: A long-read benchmark for genetic variants

Polymorphic tandem repeats shape single-cell gene expression across the immune landscape

A phenome-wide association study of tandem repeat variation in 168,554 individuals from the UK Biobank

Assessing structural variation in a personal genome—towards a human reference diploid genome