Bioinformatic Practices: How to conduct Enrichment Analysis with TCGA-BRCA RNA-seq (FPKM)?
Tissue:
Processed TCGA-BRCA RNAseq data in FPKM (2 excel files) have been given (normal and primary tumour; brca-rsem-fpkm-tcga.xlsx and brca-rsem-fpkm-tcga-t.xlsx, respectively). How to perform analysis to find out the the most enriched and downregulated genes and gene sets in primary breast tumours?
©️ Copyright 2023 @ Authors
Author:
Mujie Lin 📨
日期:2023-12-19
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择 bio-r-notebook:v4镜像 和任意内存大于4G的机型即可开始。
1 Analysis Differential Expression Genes by Wilcoxon Test
The two most popular methods in statistical approaches and software for RNA-seq data analysis are DESeq2 and edgeR. Both are built on the core assumption that sequencing data follow a negative binomial distribution, and they apply their own logic to normalize the raw read counts. If samples lack replicates, the negative binomial distribution may be noisy, in which case one might consider DEGseq based on the Poisson distribution. However, an underlying assumption that must be accepted for software like DESeq2 and edgeR to function properly is that there are no absolute differences between two groups of samples, or in other words, the expression of the vast majority of genes is the same across both groups.
A recent study[1] has indicated that once the analysis scales to large cohorts in population-level RNA-seq studies, with sample sizes ranging from several dozen to hundreds or thousands, the assumption of no differential gene expression across the majority of genes no longer holds. In population-level studies that have a larger sample size (at least several dozen), parameter assumptions often become unnecessary. Furthermore, with a larger number of samples, outliers are more likely to occur, violating assumptions and leading to unstable p-values, which in turn can invalidate the false discovery rate (FDR).
To explore how sample size affects the performance of six analytical methods, researchers conducted subsampling on each dataset, resulting in datasets with sample sizes ranging from 2 to 100. From the results shown above, it is evident that at a 1% FDR threshold, as long as the sample size per condition exceeds eight, the Wilcoxon rank-sum test performs comparably or better than the three parametric methods (DESeq2, edgeR, and limma-voom) and other non-parametric methods.
Therefore, the Wilcoxon rank-sum test should be employed for differential expression analysis in this case.
Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: limma
Unlike DESeq2, edgeR, and limma, as a non-regression-based method, the Wilcoxon rank-sum test cannot adjust for various potential confounding factors, such as differences in sequencing depth. Therefore, to utilize the Wilcoxon rank-sum test, RNA-seq samples should first be normalized to eliminate batch effects before analysis. Given that the provided type of scRNAseq is in FPKM format, which stands for fragments per kilobase of exon per million fragments mapped, FPKM cannot be used directly for differential expression analysis and should first be converted into another form. Here, we choose to transform it into TPM (Transcripts Per Million) because we will use the Wilcoxon Test to conduct a differential expression analysis.
Therefore, I have constructed a function that converts the expression matrix from FPKM to TPM.
Thus, we will obtain a table listing the results of the differential expression analysis.
2 Enrichment Analysis and Visualization
Next, let's conduct enrichment analysis, including GO, KEGG, and GSEA analysis.
Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following object is masked from ‘package:gridExtra’: combine The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted, lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: graph Attaching package: ‘graph’ The following object is masked from ‘package:stringr’: boundary Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: GO.db Loading required package: AnnotationDbi Loading required package: stats4 Loading required package: IRanges Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following object is masked from ‘package:clusterProfiler’: rename The following object is masked from ‘package:utils’: findMatches The following objects are masked from ‘package:base’: I, expand.grid, unname Attaching package: ‘IRanges’ The following object is masked from ‘package:clusterProfiler’: slice Attaching package: ‘AnnotationDbi’ The following object is masked from ‘package:clusterProfiler’: select Loading required package: SparseM Attaching package: ‘SparseM’ The following object is masked from ‘package:base’: backsolve groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Attaching package: ‘topGO’ The following object is masked from ‘package:IRanges’: members ======================================== circlize version 0.4.15 CRAN page: https://cran.r-project.org/package=circlize Github page: https://github.com/jokergoo/circlize Documentation: https://jokergoo.github.io/circlize_book/book/ If you use it in published research, please cite: Gu, Z. circlize implements and enhances circular visualization in R. Bioinformatics 2014. This message can be suppressed by: suppressPackageStartupMessages(library(circlize)) ======================================== Attaching package: ‘circlize’ The following object is masked from ‘package:graph’: degree Loading required package: grid Attaching package: ‘grid’ The following object is masked from ‘package:topGO’: depth ======================================== ComplexHeatmap version 2.16.0 Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/ Github page: https://github.com/jokergoo/ComplexHeatmap Documentation: http://jokergoo.github.io/ComplexHeatmap-reference If you use it in published research, please cite either one: - Gu, Z. Complex Heatmap Visualization. iMeta 2022. - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016. The new InteractiveComplexHeatmap package can directly export static complex heatmaps into an interactive Shiny app with zero effort. Have a try! This message can be suppressed by: suppressPackageStartupMessages(library(ComplexHeatmap)) ========================================
Load in the differential expression data.
First, we need to specify the species databases for enrichment analysis: the species database for GO analysis is 'org.Hs.eg.db', and for KEGG analysis, it is 'hsa'.
'select()' returned 1:many mapping between keys and columns Warning message in bitr(info$gene_symbol, fromType = "SYMBOL", toType = "ENTREZID", : “11.39% of input gene IDs are fail to map...”
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"... Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
Warning message in merge.data.frame(info, gene, by = "SYMBOL"): “column names ‘NA’, ‘NA’ are duplicated in the result”
preparing geneSet collections... GSEA analysis... Warning message in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : “There are ties in the preranked stats (0.09% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.” Warning message in fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize, : “For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.” leading edge analysis... done...
# # Gene Set Enrichment Analysis # #...@organism hsa #...@setType KEGG #...@keytype kegg #...@geneList Named num [1:17493] 7.68 6.52 5.99 5.81 5.78 ... - attr(*, "names")= chr [1:17493] "1300" "1469" "4320" "1301" ... #...nPerm #...pvalues adjusted by 'BH' with cutoff <0.05 #...49 enriched terms found 'data.frame': 49 obs. of 11 variables: $ ID : chr "hsa04110" "hsa00982" "hsa04152" "hsa04923" ... $ Description : chr "Cell cycle" "Drug metabolism - cytochrome P450" "AMPK signaling pathway" "Regulation of lipolysis in adipocytes" ... $ setSize : int 154 68 120 54 74 49 42 73 68 65 ... $ enrichmentScore: num 0.603 -0.684 -0.566 -0.685 -0.642 ... $ NES : num 2.42 -2.22 -1.98 -2.13 -2.11 ... $ pvalue : num 1.00e-10 2.74e-08 2.40e-07 3.85e-07 4.40e-07 ... $ p.adjust : num 3.38e-08 4.64e-06 2.71e-05 2.98e-05 2.98e-05 ... $ qvalue : num 2.34e-08 3.20e-06 1.87e-05 2.06e-05 2.06e-05 ... $ rank : num 1298 3057 1227 1253 1640 ... $ leading_edge : chr "tags=27%, list=7%, signal=25%" "tags=46%, list=17%, signal=38%" "tags=19%, list=7%, signal=18%" "tags=54%, list=7%, signal=50%" ... $ core_enrichment: chr "9088/995/5347/991/699/9133/10403/9212/113130/701/9232/8318/890/9700/983/7272/990/9134/157570/1870/81620/1869/89"| __truncated__ "2941/2330/2329/2952/27306/2328/131/1577/128/130/54577/7363/2950/2326/4257/7364/1576/1548/54490/2946/10720/4129/"| __truncated__ "81617/5519/5521/10891/8660/5209/2308/5295/9586/2998/3479/5207/3953/5468/32/148/6517/3991/948/5105/1149/9370/3952" "7253/5290/51099/134/114/112/109/155/3667/10000/4886/57104/8660/5733/5295/196883/153/11343/111/2770/5743/5140/15"| __truncated__ ... #...Citation T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
The GO terms are grouped into different biological processes, including gland development, small GTPase mediated signal transduction, embryonic organ development, mononuclear cell differentiation, and others. The presence of these terms suggests these biological processes are potentially involved in the differences between normal and tumor samples in breast cancer. For instance, GO terms such as "gland development" and "embryonic organ development" with a high number of counts and very low p-values (dark red bars) suggest that these processes are highly relevant in the context of the tumor's biology and are significantly different compared to normal samples.
Similarly, terms like "DNA-binding transcription factor activity" and "RNA polymerase II-specific DNA-binding transcription factor binding" have a high count, which implies a significant change in gene expression related to transcription regulation between normal and tumor tissues. The blue bar at the bottom represents "GTPase regulator activity," which has a substantial number of genes involved but with a less significant p-value (as indicated by the lighter shade of red).
Due to this chart, the PI3K-Akt signaling pathway has the highest count of differentially expressed genes, suggesting that this pathway plays a significant role in the molecular events of BRCA tumor development or progression.
Pathways such as MAPK signaling, Cytokine-cytokine receptor interaction, and Calcium signaling also have high counts, underscoring their potential involvement in breast cancer pathophysiology. The Breast cancer pathway itself is highlighted, which is expected since the dataset specifically concerns breast cancer.
The color coding shows that pathways related to cancer, immune response, and cell signaling (such as Hepatocellular carcinoma, Human T-cell leukemia virus 1 infection, and Chemokine signaling pathway) are highly significant, as indicated by the deeper red colors.
Interestingly, pathways like Neuroactive ligand-receptor interaction and the Neurotrophin signaling pathway are also represented, which could suggest a role for these neuro-related pathways in breast cancer, a connection that has been explored in recent research.
The presence of non-cancer-specific pathways (such as Salmonella infection or AMPK signaling pathway) may reflect broader cellular processes affected by cancer or might be incidental findings due to the wide-ranging impact of the disease on cellular metabolism.
Processes like "small GTPase mediated signal transduction" and "mononuclear cell differentiation" have a relatively high Gene Ratio and a significant number of genes involved (larger red dots), indicating these are likely important processes in the tumor samples.
"Gland development" and "embryonic organ development" also show significant enrichment, which is expected in the context of breast cancer, as these processes are related to tissue and organ formation, and their dysregulation might play a role in tumorigenesis.
Some terms like "DNA-binding transcription factor activity" and "actin binding" appear less significant (indicated by blue dots) and involve fewer genes (smaller dots), which may suggest a lesser role in the specific aspects of breast cancer pathology being studied.
Overall, the dot plot visualizes which biological processes are most significantly associated with the genes that are differentially expressed between normal and breast cancer tumor samples.
Due to this dotplot, the Neuroactive ligand-receptor interaction and PI3K-Akt signaling pathway appear to be the most significantly enriched pathways, as indicated by their position towards the top of the plot and their red color.
MAPK signaling and Cytokine-cytokine receptor interaction pathways also show significant enrichment, suggesting these signaling pathways are potentially involved in the pathophysiology of breast cancer.
The presence of pathways such as Human papillomavirus infection may reflect underlying processes related to viral mechanisms that could be implicated in cancer or might be associated with the wider effects of tumor biology on cellular function.
The Calcium signaling pathway and cAMP signaling pathway are less significantly enriched but still noteworthy, as indicated by their purple and blue colors.
The size of the dots, particularly for pathways like Neuroactive ligand-receptor interaction and PI3K-Akt signaling, suggests a substantial number of genes are involved, highlighting the importance of these pathways in the context of BRCA tumor biology.
Now, let's discover the correlation between GO pathways.
Warning message in emapplot.enrichResult(x, showCategory = showCategory, ...): “Use 'layout.params = list(layout = your_value)' instead of 'layout'. The layout parameter will be removed in the next version.”
The network plot visualizes the relationships and overlap between different GO biological process categories. Each node (circle) represents a GO term, and the size of the node reflects the number of genes associated with that term—larger nodes indicate more genes are involved. The color of the nodes corresponds to the adjusted p-value (p.adjust) of the enrichment, with the gradient from blue to red indicating less to more significant terms, respectively.
Central nodes, like "small GTPase mediated signal transduction" and "cell junction assembly," suggest they are pivotal biological processes with a larger number of genes involved, making them potentially crucial in the context of the pathology of breast cancer. Highly connected nodes indicate GO terms that share a significant number of genes, suggesting related or overlapping biological functions.
Red nodes, such as "response to metal ion" and "synapse organization," are among the most significantly enriched processes, which might be particularly relevant to the pathophysiology of breast cancer.
Peripheral nodes with fewer connections and smaller size may represent more specific processes that are altered between normal and tumor samples but are not as broadly represented across the differentially expressed genes.
The plot provides a comprehensive overview of the biological processes that are most impacted in breast cancer tumor tissue compared to normal breast tissue.
Furthermore, we can make a directed acyclic graph (DAG) representing a hierarchy of Biological Process (BP) Gene Ontology (GO) terms derived from a GO enrichment analysis.
Warning message in emapplot.enrichResult(x, showCategory = showCategory, ...): “Use 'layout.params = list(layout = your_value)' instead of 'layout'. The layout parameter will be removed in the next version.”
And let's discover the correlation between KEGG pathways.
Warning message in emapplot.enrichResult(x, showCategory = showCategory, ...): “Use 'layout.params = list(layout = your_value)' instead of 'layout'. The layout parameter will be removed in the next version.” Warning message: “ggrepel: 17 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Central and larger nodes like those representing the Cell cycle and MAPK signaling pathway are significant and involve a high number of genes, suggesting these pathways play a central role in the biology of breast cancer presented in the data. Pathways such as Estrogen signaling, PI3K-Akt signaling, and Insulin signaling are also highlighted. These signaling pathways are known to be crucial in breast cancer development and progression, affecting cell growth, metabolism, and survival.
The presence of infection-related pathways like Human papillomavirus infection, Salmonella infection, and notably Coronavirus disease – COVID-19 in the context of breast cancer analysis might reflect the impact of these infections on cellular processes or could be due to the interplay between infectious agents and cancer pathways.
Peripheral nodes such as Peroxisome, Endocytosis, and Lysosome might indicate ancillary pathways that are less central but still impacted by the differential gene expression between normal and tumor samples.
The network layout with densely interconnected nodes in the center and less connected nodes on the periphery helps identify the core biological processes and secondary pathways that may contribute to the phenotype observed.
Warning message in emapplot.enrichResult(x, showCategory = showCategory, ...): “Use 'layout.params = list(layout = your_value)' instead of 'layout'. The layout parameter will be removed in the next version.”
groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 15741 GO terms found. ) Build GO DAG topology .......... ( 15741 GO terms and 35553 relations. ) Annotating nodes ............... ( 18614 genes annotated to the GO terms. ) Loading required package: Rgraphviz Attaching package: ‘Rgraphviz’ The following objects are masked from ‘package:IRanges’: from, to The following objects are masked from ‘package:S4Vectors’: from, to
$dag A graphNEL graph with directed edges Number of Nodes = 51 Number of Edges = 72 $complete.dag [1] "A graph with 51 nodes."
The central node colored in red, "GO:0042127 - regulation of cell proliferation", is a critical biological process often implicated in cancer, indicating that regulation of cell proliferation is significantly affected in BRCA tumor samples.
The nodes in orange may represent processes that are important but have a slightly lower significance level compared to the red node. They include processes such as "GO:0051270 - regulation of cell motion" and "GO:0030334 - regulation of cell migration", which are also relevant to cancer progression as they involve the movement and spread of cells.
Yellow nodes, such as "GO:0007155 - cell adhesion", could be indicative of processes that are important in the context of the pathology, like the adhesion of cancer cells within the tumor microenvironment or to distant tissues during metastasis.
groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 15741 GO terms found. ) Build GO DAG topology .......... ( 15741 GO terms and 35553 relations. ) Annotating nodes ............... ( 18614 genes annotated to the GO terms. )
$dag A graphNEL graph with directed edges Number of Nodes = 51 Number of Edges = 72 $complete.dag [1] "A graph with 51 nodes."
groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 1919 GO terms found. ) Build GO DAG topology .......... ( 1919 GO terms and 3154 relations. ) Annotating nodes ............... ( 19518 genes annotated to the GO terms. )
$dag A graphNEL graph with directed edges Number of Nodes = 38 Number of Edges = 54 $complete.dag [1] "A graph with 38 nodes."
groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 1919 GO terms found. ) Build GO DAG topology .......... ( 1919 GO terms and 3154 relations. ) Annotating nodes ............... ( 19518 genes annotated to the GO terms. )
$dag A graphNEL graph with directed edges Number of Nodes = 38 Number of Edges = 54 $complete.dag [1] "A graph with 38 nodes."
Picking joint bandwidth of 0.424
Picking joint bandwidth of 0.424
Attaching package: ‘ggstance’ The following objects are masked from ‘package:ggplot2’: GeomErrorbarh, geom_errorbarh
Warning message: “The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0. ℹ Please use the `.add` argument instead. ℹ The deprecated feature was likely used in the dplyr package. Please report the issue at <https://github.com/tidyverse/dplyr/issues>.”
The gradient from blue to red indicates increasing significance, with red being the most statistically significant (lowest q-value). Positive NES (to the right) indicates gene sets that are overrepresented at the top of the ranked list, suggesting a potential upregulation or activation in the condition being studied. And negative NES (to the left) indicates gene sets that are overrepresented at the bottom of the ranked list, suggesting potential downregulation or deactivation.
From this plot, we can infer which biological processes or pathways might be relevant in the condition or phenotype being studied. For example, gene sets related to "Antigen processing and presentation," "Nucleocytoplasmic transport", and "Motor proteins" show a positive NES, suggesting they might be more active. On the other hand, gene sets such as "ABC transporters" and "Insulin resistance" have a negative NES, indicating they might be less active or downregulated in the condition under study.
Pathways on the far right with positive NES values and red bars, such as Antigen processing and presentation and Nucleocytoplasmic transport, are significantly enriched in the tumor samples. This suggests these pathways are upregulated in the primary tumor samples compared to the normal samples and could be critical in the cancer’s biology.
The PI3K-Akt signaling pathway, often implicated in cancer, shows a moderate NES and a highly significant q-value. This pathway's role in cell survival, growth, and metabolism is well-documented in oncology, and its enrichment here supports its potential involvement in breast cancer tumorigenesis.
Pathways with negative NES values, depicted by blue bars, such as ABC transporters and Insulin resistance, are potentially downregulated in the tumor samples. These pathways might be less active in tumor cells or could be associated with the tumor microenvironment’s metabolic changes. The q-value color coding indicates that all the depicted pathways are significantly enriched, but those with the reddest bars, such as Nucleocytoplasmic transport and Antigen processing and presentation, are the most statistically significant.
References:
[1] Li, Yumei, Xinzhou Ge, Fanglue Peng, Wei Li, and Jingyi Jessica Li. “Exaggerated False Positives by Popular Differential Expression Methods When Analyzing Human Population Samples.” Genome Biology 23, no. 1 (March 15, 2022): 79. https://doi.org/10.1186/s13059-022-02648-4.