Zimin patterns in genomes

Nikol Chantzi,Ioannis Mouratidis,Ilias Georgakopoulos-Soares
2024-10-17
Abstract:Zimin words are words that have the same prefix and suffix. They are unavoidable patterns, with all sufficiently large strings encompassing them. Here, we examine for the first time the presence of k-mers not containing any Zimin patterns, defined hereafter as Zimin avoidmers, in the human genome. We report that in the reference human genome all k-mers above 104 base-pairs contain Zimin words. We find that Zimin avoidmers are most enriched in coding and Human Satellite 1 regions in the human genome. Zimin avoidmers display a depletion of germline insertions and deletions relative to surrounding genomic areas. We also apply our methodology in the genomes of another eight model organisms from all three domains of life, finding large differences in their Zimin avoidmer frequencies and their genomic localization preferences. We observe that Zimin avoidmers exhibit the highest genomic density in prokaryotic organisms, with E. coli showing particularly high levels, while the lowest density is found in eukaryotic organisms, with D. rerio having the lowest. Among the studied genomes the longest k-mer length at which Zimin avoidmers are observed is that of S. cerevisiae at k-mer length of 115 base-pairs. We conclude that Zimin avoidmers display inhomogeneous distributions in organismal genomes, have intricate properties including lower insertion and deletion rates, and disappear faster than the theoretical expected k-mer length, across the organismal genomes studied.
Genomics
What problem does this paper attempt to address?