BioNumPy: array programming for biology

Knut Dagestad Rand,Ivar Grytten,Milena Pavlović,Chakravarthi Kanduri,Geir Kjetil Sandve
DOI: https://doi.org/10.1038/s41592-024-02483-4
IF: 48
2024-10-18
Nature Methods
Abstract:Python is a widely used programming language for scientific computing, in large part due to the powerful array programming library NumPy 1 , which makes it easy to write clean, vectorized and computationally efficient code for handling large datasets. A challenge with using array programming in biology is that the data are often non-numeric and variable length (for example, DNA sequences), hindering out-of-the-box use of standard array programming techniques. This may push bioinformaticians to instead rely on complex, custom pipelines of UNIX commands that are non-transparent and error prone. Furthermore, the challenge of developing efficient code directly in high-level languages like Python has led to tool developers almost exclusively relying on low-level languages like C and C++ (or hybrid implementations using for example, Cython 2 or Numba 3 ), making it more difficult for computational biologists to understand and contribute to core methods in the field. We present the BioNumPy package, which enables efficient and intuitive array programming on biological data in Python. Internally, this is handled by a ragged data structure (similar to that in ref. 4 ) that numerically encodes variable-length sequence data in continuous memory blocks, along with arrays describing the sequence lengths and encoding (see Supplementary Material). BioNumPy supports a broad range of bioinformatics analyses, with the main philosophy being that data structures should behave as closely as possible to standard numeric NumPy arrays. This means that BioNumPy is easy to learn for users familiar with NumPy or with array programming languages like R and Matlab.
biochemical research methods
What problem does this paper attempt to address?