Abstract:Housekeeping protein-coding genes are stably expressed genes in cells and tissues that are thought to be engaged in fundamental cellular biological functions. They are often utilized as normalization references in molecular biology research and are especially important in integrated bioinformatic investigations. Prior studies have examined human housekeeping protein-coding genes by analyzing various gene expression datasets. The inclusion of different tissue types significantly impacted the discovery of housekeeping genes. In this report, we investigated particularly individual human subject expression differences in protein-coding genes across different tissue types. We used GTEx V8 gene expression datasets obtained from more than 16,000 human normal tissue samples. Furthermore, the Gini index is utilized to investigate the expression variations of protein-coding genes between tissue and individual donor subjects. Housekeeping protein-coding genes found using Gini index profiles may vary depending on the tissue subtypes investigated, particularly given the diverse sample size collections across the GTEx tissue subtypes. We subsequently selected major tissues and identified subsets of housekeeping genes with stable expression levels among human donors within those tissues. In this work, we provide alternative sets of housekeeping protein-coding genes that show more consistent expression patterns in human subjects across major solid organs. Weblink: https://hpsv.ibms.sinica.edu.tw.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to more comprehensively identify and define housekeeping protein - coding genes that are stably expressed among different tissue types and individuals. Specifically, the researchers hope to explore the expression differences between individuals and among tissues by analyzing large - scale data of normal human tissue samples, and provide more accurate criteria for the selection of housekeeping genes.
### Research Background
Housekeeping genes are generally considered to be stably expressed genes in cells and are involved in basic cellular biological functions. These genes are often used as standardized references in molecular biology research, especially important in integrative bioinformatics research. However, previous studies have mainly focused on the average gene expression of different tissue types, ignoring the variability between individuals and the influence of tissue specificity. Therefore, there are still disputes and uncertainties regarding the definition and selection of housekeeping genes.
### Research Objectives
1. **Explore the expression differences between individuals and among tissues**: By analyzing the GTEx V8 gene expression dataset (from more than 16,000 normal human tissue samples), the researchers hope to gain in - depth understanding of the expression differences between individuals and among tissues.
2. **Use the Gini index to evaluate expression variation**: The Gini index is a non - parametric measurement tool, originally used to measure economic income inequality, and later applied to the study of gene expression distribution. The researchers use the Gini index to evaluate gene expression variation between different tissues and individuals.
3. **Identify stably expressed housekeeping genes**: Based on the analysis of the Gini index, the researchers hope to be able to identify a set of housekeeping genes that are stably expressed among different tissue types and individuals, thereby providing reliable reference genes for subsequent research.
### Main Methods
- **Data Source**: Use the GTEx V8 gene expression dataset, covering 54 tissue subtypes, excluding two cell line datasets, and finally retaining data of 52 normal tissue types.
- **Gini Index Calculation**:
- **Gini index - subject**: Calculate the Gini index of each gene based on all 16,704 samples to measure the expression variation between individuals.
- **Gini index - tissue**: Calculate the Gini index of each gene based on individual samples within each tissue subtype to measure the expression variation within the tissue.
- **Gini index - TPM**: Calculate the Gini index of each gene based on the average TPM values of 52 tissue subtypes to measure the expression variation between tissues.
### Key Findings
- **Definition of Housekeeping Genes**: The researchers screened out 20 housekeeping genes from 19,273 protein - coding genes according to the criterion that the Gini index - subject is less than 0.2. These genes show lower Gini index - tissue values in most tissue subtypes, but not all 52 tissue subtypes.
- **Influence of Tissue Subtypes**: The number of samples and expression levels of different tissue subtypes have a significant impact on the calculation of the Gini index. For example, in the testis and cerebellum, the Gini index - tissue values of more than 10,000 genes are less than 0.2, while in blood tissue, only 35 genes meet this criterion.
- **Housekeeping Genes across Tissue Subtypes**: Further analysis shows that only 4 genes (SHARPIN, TMEM219, ZNF768 and CTDNEP1) show lower Gini index - tissue values in at most 49 tissue subtypes.
### Conclusion
This study proposes more accurate criteria for the definition of housekeeping genes through comprehensive analysis of the expression differences between individuals and among tissues. The research results show that the number of samples and expression levels of different tissue subtypes have an important impact on the selection of housekeeping genes, and future research should consider these factors to improve the accuracy and applicability of housekeeping genes.
### Formula Representation
- **Gini Index Formula**:
\[
G=\frac{1}{n - 1}\sum_{i = 1}^{n}(2i - n - 1)x_i
\]
where \(G\) represents the Gini index, \(n\) is the number of samples, and \(x_i\) is the expression value arranged in ascending order.
Through this method, the researchers can more comprehensively understand housekeeping genes.