Abstract:Motivation: ``Molecular signatures'' or ``gene-expression signatures'' are used to predict patients' characteristics using data from coexpressed genes. Signatures can enhance understanding about biological mechanisms and have diagnostic use. However, available methods to search for signatures fail to address key requirements of signatures, especially the discovery of sets of tightly coexpressed genes. Results: After suggesting an operational definition of signature, we develop a method that fulfills these requirements, returning sets of tightly coexpressed genes with good predictive performance. This method can also identify when the data are inconsistent with the hypothesis of a few, stable, easily interpretable sets of coexpressed genes. Identification of molecular signatures in some widely used data sets is questionable under this simple model, which emphasizes the needed for further work on the operationalization of the biological model and the assessment of the stability of putative signatures. Availability: The code (R with C++) is available from <a class="link-external link-http" href="http://www.ligarto.org/rdiaz/Software/Software.html" rel="external noopener nofollow">this http URL</a> under the GNU GPL.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Existing methods for finding "molecular signatures" or "gene - expression signatures" fail to meet key requirements, especially failing to discover sets of genes that are tightly co - expressed. The authors propose a new statistical method, aiming to overcome the limitations of existing methods, so as to be able to return sets of genes with tight co - expression, and these gene sets have good predictive performance. ### Specific description of the problem 1. **Limitations of existing methods**: - **Loose co - expression between genes**: Most existing methods do not require that genes must be tightly co - expressed within signature components. - **Difficult to interpret**: When using principal component analysis (PCA) or partial least squares (PLS), all genes applied to PCA or PLS will generate loadings for each component, resulting in difficult interpretation. - **Not combined with dependent variable information**: Many PCA or gene clustering methods do not combine the information of the dependent variable when searching for components. - **Single task type**: Most methods are only applicable to specific types of tasks (such as classification or survival analysis) and are difficult to extend to other types of dependent variables. 2. **Goals of the new method**: - **Sets of tightly co - expressed genes**: Ensure that genes within each signature component show tight co - expression. - **Good predictive performance**: Ensure that signature components perform well in prediction. - **Easy to interpret**: Simplify interpretation by weighted - averaging tightly co - expressed subsets of genes. - **Applicable to different types of tasks**: Can be used for different types of dependent variables (continuous, classification, survival, etc.). ### Key elements of the new method 1. **Selection of seed genes**: Start from a seed gene and construct an initial signature component, ensuring that genes within the component are tightly co - expressed and the prediction error is acceptable. 2. **Gradual optimization**: Repeat the process of selecting seed genes and constructing signature components until no new components are required. 3. **Geometric interpretation**: Through geometric interpretation, ensure that the direction of each signature component is similar to the direction of genes within the component, thereby maintaining the co - expression relationship. 4. **Classifier selection**: Use simple classifiers (such as diagonal linear discriminant analysis DLDA and k - nearest neighbor KNN) for prediction. ### Conclusion The new method proposed in this paper aims to find sets of tightly co - expressed genes by meeting the above conditions, thereby improving predictive performance and simplifying biological interpretation. Experimental results show that the performance of this method on multiple data sets is close to or even better than existing classification methods, especially having advantages in interpretability and the number of genes.

Molecular Signatures from Gene Expression Data

Increasing stability and interpretability of gene expression signatures

Extraction and Comparison of Gene Expression Patterns from 2d Rna in Situ Hybridization Images

DNA-based molecular classifiers for the profiling of gene expression signatures

A simple and robust method for connecting small-molecule drugs using gene-expression signatures

Gene signatures for cancer research: A 25-year retrospective and future avenues

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm

Constructing Gene Co-Expression Networks and Predicting Functions of Unknown Genes by Random Matrix Theory

The Iterative Signature Algorithm for the analysis of large scale gene expression data

ConSIG: Consistent Discovery of Molecular Signature from OMIC Data

Gesearch: An Interactive Gui Tool For Identifying Gene Expression Signature

A Common Gene Expression Signature Analysis Method for Multiple Types of Cancer

Data driven refinement of gene expression signatures for enrichment analysis

Towards Mechanism Classifiers: Expression-Anchored Gene Ontology Signature Predicts Clinical Outcome in Lung Adenocarcinoma Patients.

Identification of significant features in DNA microarray data

pyBioSig: optimizing group discrimination using genetic algorithms for biosignature discovery

Abstract 4281: Data driven refinement of gene signatures for enrichment analysis and cell state characterization

DysRegSig: an R package for identifying gene dysregulations and building mechanistic signatures in cancer

Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation

biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data

Optimizing in silico drug discovery: simulation of connected differential expression signatures and applications to benchmarking