Feature selection by replicate reproducibility and non-redundancy

Tümay Capraz,Wolfgang Huber
DOI: https://doi.org/10.1093/bioinformatics/btae548
IF: 5.8
2024-09-02
Bioinformatics
Abstract:Motivation: A fundamental step in many analyses of high-dimensional data is dimension reduction. Two basic approaches are introduction of new synthetic coordinates and selection of extant features. Advantages of the latter include interpretability, simplicity, transferability, and modularity. A common criterion for unsupervized feature selection is variance or dynamic range. However, in practice, it can occur that high-variance features are noisy, that important features have low variance, or that variances are simply not comparable across features because they are measured in unrelated numeric scales or physical units. Moreover, users may want to include measures of signal-to-noise ratio and non-redundancy into feature selection. Results: Here, we introduce the RNR algorithm, which selects features based on (i) the reproducibility of their signal across replicates and (ii) their non-redundancy, measured by linear dependence. It takes as input a typically large set of features measured on a collection of objects with two or more replicates per object. It returns an ordered list of features, i1,i2,…,ik, where feature i1 is the one with the highest reproducibility across replicates, i2 that with the highest reproducibility across replicates after projecting out the dimension spanned by i1, and so on. Applications to microscopy-based imaging of cells and proteomics highlight benefits of the approach. Availability and implementation: The RNR method is available via Bioconductor (Huber W, Carey VJ, Gentleman R et al. (Orchestrating high-throughput genomic analysis with bioconductor. Nat Methods 2015;12:115-21.) in the R package FeatSeekR. Its source code is also available at https://github.com/tcapraz/FeatSeekR under the GPL-3 open source license.
What problem does this paper attempt to address?