Abstract:Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria's often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called 'sORFdb', was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio.

D-sORF: Accurate Ab Initio Classification of Experimentally Detected Small Open Reading Frames (sORFs) Associated with Translational Machinery

csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames

Accurate detection of short and long active ORFs using Ribo-seq data

sORFdb – A database for sORFs, small proteins, and small protein families in bacteria

sOCP: a framework predicting smORF coding potential based on TIS and in-frame features and effectively applied in the human genome

Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy

InteractORF, predictions of human sORF functions from an interactome study

No country for old methods: New tools for studying microproteins

Improved Identification of Small Open Reading Frames Encoded Peptides by Top-Down Proteomic Approaches and De Novo Sequencing

misORFPred: A Novel Method to Mine Translatable sORFs in Plant Pri-miRNAs Using Enhanced Scalable k-mer and Dynamic Ensemble Voting Strategy

sPepFinder expedites genome-wide identification of small proteins in bacteria

Smorfunction: a Tool for Predicting Functions of Small Open Reading Frames and Microproteins

Mapping start codons of small open reading frames by N-terminomics approach

A comprehensive catalog of predicted functional upstream open reading frames in humans

OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques

RiboNT: A Noise-Tolerant Predictor of Open Reading Frames from Ribosome-Protected Footprints

Identification of short open reading frames in plant genomes

&Lt;em>de Novo</em> Identification of Actively Translated Open Reading Frames with Ribosome Profiling Data

De novo Identification of Actively Translated Open Reading Frames with Ribosome Profiling Data.

Comparison of software packages for detecting unannotated translated small open reading frames by Ribo-seq

Mutational Constraint Analysis Workflow for Overlapping Short Open Reading Frames and Genomic Neighbours