Squeakuences: a portable tool for formatting squeaky-clean sequences to eliminate bioinformatic software incompatibilities

Evan S Sullivan Forsythe
DOI: https://doi.org/10.1101/2024.11.01.621607
2024-11-03
Abstract:Computational analysis of biological sequences is the cornerstone of modern bioinformatics research. Complex processing and interpretation of data often entails multi-step workflows. The specific requirements and limitations of individual applications can require laborious reformatting and piecemeal data-wrangling to produce a satisfactory input for each step in a pipeline. We present Squeakuences, a command line tool developed to simplify and automate FASTA file preparation for applications such as phylogenetics, gene annotation, and genome analysis. Implemented in a lightweight Python script, Squeakuences identifies and removes potentially problematic elements in sequence identifiers, such as non-alphanumeric characters, white space, and excessive character count. Squeakuences outputs a new clean version of the sequence file for analysis alongside metadata files to track changes. The user can customize Squeakuences behavior using optional arguments to meet individual processing and formatting requirements. We tested the performance of Squeakuences on molecular data from the human reference genome and found that runtime correlates with the number of sequences processed but not with file size. We expect Squeakuences to save time and manual effort when analyzing sequence data. Squeakuences code is freely available at https://github.com/EvanForsythe/Squeakuences.
Bioinformatics
What problem does this paper attempt to address?