空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

有参考基因组的Bulk RNA-Seq数据分析

生物信息学

rnaseq

生物信息学rnaseq

Julia_Xiang

发布于 2023-07-17

数据集

AI4SCUP-CNS-BBB(v1)

有参考基因组的Bulk RNA-Seq数据分析

📖 上手指南
本文档可在 Bohrium Notebook 上直接运行。你可以点击界面上方按钮开始连接，选择 rnaseq-0714-final:rnaseq 镜像及 c32_m128_cpu 节点配置，稍等片刻即可运行。

代码

文本

代码

文本

RNA-Seq是目前生信用的非常多的测序数据。本教程内容主要包括以下几个部分：

上游分析 (Linux Shell Script)

注：由于上游分析非常耗时，所以本教程已经提前将上游分析代码运行后的所有结果存储在镜像中。若读者想独立运行上游代码，建议提前准备100G内存，16个CPU的服务器。并且预留充足的时间(例如1-2天)，然后再去除代码前的注释运行代码。

使用fasterq-dump从SRA数据库下载RNA-Seq fastq格式的Raw data数据
使用fastqc和multiqc检查数据质量，使用fastp软件对fastq数据进行质量控制(Quality Control), 过滤低质量reads
使用常见的比对软件STAR将RNA-Seq测得的reads比对到Human Genome上
然后使用featureCounts将比对得到的结果，转换为基因表达量raw counts(这就是行为基因，列为样本的基因表达矩阵)

下游分析 (R语言)

使用edgeR将raw counts标准化为TMM
使用edgeR针对实验组和对照组进行差异基因表达分析
EnhancedVolcano对差异基因结果可视化（绘制火山图）

代码

文本

Step0: 运行上游代码请提前将Kernel切换为`rnaseq`

kernel选rnaseq.png

Step1: 下载用于分析的RNA-Seq数据、参考基因组、GTF注释文件

注：由于数据已下载在镜像中, 读者不必重复下载, 此处只是展示下载数据所需软件、数据集链接和代码
首先从SRA数据库下载我们需要分析的RNA-Seq原始序列数据, 这里选择ZIKA数据集; 选择前四套数据集(感染前/后，且双端测序(PAIRED))
然后从GENCODE下载人类基因组序列和基因组注释文件GTF

数据集信息.png

代码

文本

[15]

# 创建文件夹

#!mkdir ref rawdata tools fastp_output star_output fc_output deg_output

# 本地下载sra-toolkit

#!wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.5/sratoolkit.3.0.5-ubuntu64.tar.gz

#!tar -xvzf sratoolkit.3.0.5-ubuntu64.tar.gz

# 下载ZIKA RNA-Seq测序数据

#!./tools/sratoolkit.3.0.5-ubuntu64/bin/fasterq-dump SRR3191542 -O ./rawdata

#!./tools/sratoolkit.3.0.5-ubuntu64/bin/fasterq-dump SRR3191543 -O ./rawdata

#!./tools/sratoolkit.3.0.5-ubuntu64/bin/fasterq-dump SRR3191544 -O ./rawdata

#!./tools/sratoolkit.3.0.5-ubuntu64/bin/fasterq-dump SRR3191545 -O ./rawdata

# 下载人类参考基因组Hg38和对应的注释文件

#!cd ./ref/

#!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/GRCh38.p13.genome.fa.gz

#!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.basic.annotation.gtf.gz

#!cd ..

代码

文本

Step2: RNA-Seq reads的质量控制

RNA-seq reads由于测序平台的原因, 可能会有质量不过关的reads, 因此需要用fastqc和multiqc检查RNA-seq序列的质量, 然后用fastp过滤低质量序列。

代码

文本

[2]

# 质控

#!fastqc ./rawdata/*.fastq

#!multiqc ./rawdata/

代码

文本

如下图所示 multiqc整合了fastqc对多个RNA-Seq样本的质控结果。首先可以整体看一下样本中reads的测序分布情况。

代码

文本

RNA-Seq质控会主要关注四个内容：

测序Reads的碱基质量, 一般需要选Q30以上
Per Base Sequence Content, 可以看到下图序列中的前15个碱基出现了较大的波动, 这是由于测序仪刚刚开机运行产生的噪声会影响reads的测定, 之后需要切掉这些碱基
Overpresented sequences, 测序的序列里面可能会混杂其它污染物的序列, 这些序列也需要检测后去除。我们的序列中没有太多的reads是过表达的, 所以暂时可以忽视
Adapter, 测序的时候每个序列会加一个Adapter, 所以也需要检查测序得到的文库里面Adapter比例情况。具体解释可以参考这里

代码

文本

完成质控之后, 使用fastp来对测序的Reads做质量控制

代码

文本

[ ]

# FASTQ quality control, trimming, filtering

#!for fq in SRR3191542 SRR3191543 SRR3191544 SRR3191545;

#!do

#!fastp -w 16 -i ./rawdata/${fq}_1.fastq -I ./rawdata/${fq}_2.fastq -f 15 -F 15 -o ./fastp_output/${fq}.R1.fq.gz -O ./fastp_output/${fq}.R2.fq.gz

#!done

代码

文本

Step3: 序列比对和定量

完成RNA-Seq序列质控之后, 重要的一步就是将测序得到的序列比对到基因组上，这也是最耗时的过程。目前有多款用于序列比对的软件, 包括: STAR、HISAT2、Bowtie2/Bowtie1、BWA等等。这些软件整体在二代测序reads比对这个任务上, 都有不错的效果, 但是各自也有各自的特点。例如Bowtie 1就更适用于长度<50bp的短reads比对, 而Bowtie2可用于50-1000bp长度的reads比对。STAR在可变剪切任务中做比对有更好的效果。在本次教程中就使用STAR作为主要的比对器。做序列比对的流程首先是对参考基因组建立索引, 然后再将reads比对到参考基因组上, 建立索引的步骤是为了后续加快比对速度。
序列比对的输出是SAM或者SAM的二进制格式文件BAM; 我们使用samtools来处理这种格式的文件。
完成序列比对之后, 重要的是了解各个基因的表达量。因此, 需要使用定量软件来获取每个基因的表达值, 我们使用最流行的软件featureCounts完成这个步骤。

代码

文本

[ ]

# STAR构建基因组索引

#!STAR --runThreadN 16 \

# --runMode genomeGenerate \

# --genomeDir ./ref/hg38_index \

# --genomeFastaFiles ./ref/GRCh38.p13.genome.fa \

# --sjdbGTFfile ./ref/gencode.v43.basic.annotation.gtf \

# --sjdbOverhang 74

# tips:在STAR建立索引的时候，sjdbOverhang这个参数值=测序得到的Reads长度-1，

# 本教程所用数据测序得到的单条reads长度是75bp, 因此--sjdbOverhang设置为75-1=74

# 按道理来说, 每换一次不同长度reads的样品做比对的时候, 那么就需要用STAR重新构建一次index (有点麻烦, 当然, 用默认99也行)

代码

文本

[ ]

# STAR完成序列比对

#!for fq in SRR3191542 SRR3191543 SRR3191544 SRR3191545;

#!do

#!STAR --genomeDir ./ref/hg38_index --sjdbGTFfile ./ref/gencode.v43.basic.annotation.gtf --runThreadN 16 \

# --outFileNamePrefix ./star_output/${fq}.star.sam \

# --readFilesIn ./fastp_output/${fq}.R1.fq.gz ./fastp_output/${fq}.R2.fq.gz \

# --readFilesCommand zcat \

# --outSAMtype BAM SortedByCoordinate \

# --outReadsUnmapped Fastx \

# --outSAMmode Full

#!done

# --readFilesCommand zcat 代表input data是.gz的压缩文件

# --outSAMtype BAM SortedByCoordinate output sorted by coordinate Aligned.sortedByCoord.out.bam file, similar to samtools sort command.

# --outReadsUnmapped Fastx output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).

代码

文本

[ ]

# featureCounts完成定量

#!featureCounts -T 16 -p -t exon -g gene_id -a ./ref/gencode.v43.basic.annotation.gtf -o ./fc_output/ZIKA.featureCounts ./star_output/*.star.samAligned.sortedByCoord.out.bam

代码

文本

featureCounts结果.png 如图所示是featureCounts的定量结果：

Geneid: 基因ID
Chr: 该基因所处染色体编号
Start/End: 该基因在染色体上的起点与结束点
Strand: 正链OR负链
Length: 该基因长度
SRR3191542/SRR3191542/SRR3191542/SRR3191542 这四列就是我们的样本了, 这四列的值代表该基因在该样本的Raw Counts. Raw Counts后续标准化之后，就能视作基因的表达量

Step4: 差异基因表达分析

注意：从Step4开始, 后续所有分析都要用R语言完成, 请注意切换右上角的Kernel为R噢

featureCounts的输出数据主要是raw counts, 为了使不同样本之间的基因表达量是可比较的, 我们需要将counts进行标准化。目前有多种标准化的方法, 例如TPM/FPKM/RPKM等等。
The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and within samples.
标准化过程中经常考虑的一些主要因素是：
- 测序深度：考虑测序深度对于比较样本之间的基因表达是必要的。在下面的示例中，相对于样品B，样品A中每个基因的表达似乎加倍，但这是样品A测序深度加倍的结果。
- Gene length: Accounting for gene length is necessary for comparing expression between different genes within the same sample. In the example, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer.

代码

文本

此处不讨论Normalization的细节，主要使用edgeR软件包来完成RNA-Seq下游的差异基因表达分析。

开始做下游分析的时候请注意将Kernel切换成R

kernel选R.png

代码

文本

[16]

library(edgeR)

## 读入raw Counts数据, 并构建分组信息

counts = read.table("../root/fc_output/ZIKA.featureCounts", sep="\t", header=T)

rownames(counts)=counts[,1]

counts = counts[,c(7:10)]

colnames(counts)=c("normal_1","normal_2","infective_1","infective_2")

groups = c(rep("normal",2),rep("infective",2)) #注意这里的顺序需要和counts表格里面列的分组顺序保持一致

# Step1: 构建DGEList对象

y <- DGEList(counts=counts, group=groups)

# Step2: 将Raw counts数据标准化为TMM

y <- calcNormFactors(y, method="TMM")

# Step3: 差异基因表达分析

design <- model.matrix(~groups)

#计算基因表达值的离散度

d1 <- estimateDisp(y, design, robust = TRUE)

fit <- exactTest(d1)

lrt <- topTags(fit, n = nrow(y$counts))

# Step4:提取差异表达基因

deseq_res = as.data.frame(lrt)

up_diff_result<-deseq_res[which(deseq_res$FDR < 0.05 & deseq_res$logFC > 2),] #上调差异表达基因

down_diff_result<-deseq_res[which(deseq_res$FDR < 0.05 & deseq_res$logFC < -2),] #下调差异表达基因

代码

文本

差异基因表达分析结果解释

logFC (log2FoldChange) log2 fold change between the groups. E.g. value 2 means that the expression has increased 4-fold. It tells us how much the gene’s expression seems to have changed due to treatment in comparison to untreated samples. In another words, This value is reported on a logarithmic scale to base 2. for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of 21.5≈2.82. A positive log2 fold-change values show that the gene is more highly expressed in the first group and is labelled as ‘up-regulated’.
logCPM the average log2-counts-per-million (CPM是counts标准化后的格式之一).
p-value and FDR(adjusted p-value) From a statistical point of view, for each gene we are testing the null hypothesis that there is no differential expression across the sample groups. This may represent thousands of tests. The more genes we test, the more we inflate the false positive rate. This is the multiple testing problem. For example, the p-value with a significance cut-off of 0.05 means there is a 5% chance of error. If we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives. We would not want to sift through our “significant” genes to identify which ones are true positives. The common approaches for multiple test correction are Bonferroni, FDR/Benjamini-Hochberg, Q-value / Storey method. edgeR uses FDR by default. Remember that a p-value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis.

参考：https://chipster.csc.fi/manual/ngs-dea-edger-RNA.html

代码

文本

[17]

# 展示差异分析得到的上调基因

head(up_diff_result)

A data.frame: 6 × 4
	logFC	logCPM	PValue	FDR
	<dbl>	<dbl>	<dbl>	<dbl>
ENSG00000171848.16	2.620488	5.927174	6.167350e-118	4.296793e-114
ENSG00000137558.9	2.126719	6.049292	2.954778e-89	8.421519e-86
ENSG00000075290.8	2.147723	6.006580	5.640962e-89	1.537849e-85
ENSG00000228716.7	2.353406	5.567130	1.060697e-84	2.463292e-81
ENSG00000179388.9	2.105424	5.082693	1.961905e-54	1.685169e-51
ENSG00000160298.18	2.273785	4.799834	7.642453e-54	6.305325e-51

代码

文本

火山图可视化

The volcano plot visualize the log of the p-value on Y-axis and log fold change of samples on X-axis.

The dashed line is the cutoff to annotate differentially expressed genes, here the cut-off for log2 fold-change is >|1.5|; the cut-off for the adjusted P value is 0.05.

This results in data points with low p values (highly significant) appearing toward the top of the plot. Points that are found toward the top of the plot that are far to either the left- or right-hand sides represent values that display large magnitude fold changes (hence being left or right of center) as well as high statistical significance.

代码

文本

[18]

top_up <- rownames(up_diff_result)[1:5]

top_down <- rownames(down_diff_result)[1:5]

library(EnhancedVolcano)

EnhancedVolcano(deseq_res,

lab = rownames(deseq_res),

x = 'logFC',

y = 'FDR',

selectLab = c(top_up,top_down),

pCutoff = 0.05,

FCcutoff = 1.5,

)

代码

文本

生物信息学

rnaseq

生物信息学rnaseq

已赞5

本文被以下合集收录

生物信息学 Notebooks Collection

liyongge

更新于 2024-09-13

33 篇75 人关注

多细胞转录组

liyongge

更新于 2023-11-17

2 篇0 人关注

有参考基因组的Bulk RNA-Seq数据分析

上游分析 (Linux Shell Script)

下游分析 (R语言)

Step0: 运行上游代码请提前将Kernel切换为rnaseq

Step1: 下载用于分析的RNA-Seq数据、参考基因组、GTF注释文件

Step2: RNA-Seq reads的质量控制

Step3: 序列比对和定量

Step4: 差异基因表达分析

开始做下游分析的时候请注意将Kernel切换成R

火山图可视化

Step0: 运行上游代码请提前将Kernel切换为`rnaseq`