RNAseq---Hisat2 标准输出中比对率信息解读

本文详细解析了HISAT2在RNA-Seq数据比对后的输出信息,包括比对率、唯一映射和多重映射的计算方法,并给出了实例分析。通过比对率和不同类型的比对情况,可以评估比对质量。同时,讨论了concordant和discordant配对的概念及其在寻找结构变异中的作用。

RNA-Seq—Hisat2 标准输出中比对率信息解读

本文具体解释部分(一)中内容复制自Biostar内容,后面附上我实际的例子,二者略有不同,整体理解上没大问题,有不合适之处可联系我。

(一)Biostar上解释的例子

HISAT2是速度、敏感性和比对率方面比较优秀的比对软件,通常用于比对二代测序产生的RNA-Seq数据。在实际使用时需要统计HISAT2需要对结果中的比对率统计,HISAT2完成比对后标准输出如下:

HISAT2 summary stats:
            Total pairs: 11587225
                    Aligned concordantly or discordantly 0 time: 4464083 (38.53%)
                    Aligned concordantly 1 time: 2195620 (18.95%)
                    Aligned concordantly >1 times: 4877336 (42.09%)
                    Aligned discordantly 1 time: 50186 (0.43%)
            Total unpaired reads: 8928166
                    Aligned 0 time: 8019048 (89.82%)
                    Aligned 1 time: 304653 (3.41%)
                    Aligned >1 times: 604465 (6.77%)
            Overall alignment rate: 65.40%

具体解释如下:

1. Total pairs: 11587225

Total reads = 11587225 * 2 = 23174450 (matches total number of reads in the sample)

2. Aligned concordantly or discordantly 0 time: 4464083 (38.53%)

These are unmapped reads : 4464083 * 2 (paired end) = 8928166

( 8928166 / 23174450 (Total reads) ) * 100 ~ 38.53%

3. Aligned concordantly 1 time: 2195620 (18.95%)

These are uniquely mapped reads : 2195620 * 2 (paired end) = 4391240

( 4391240 / 23174450 (Total reads) ) * 100 ~ 18.95%

4. Aligned concordantly >1 times: 4877336 (42.09%)

These are multi mapped reads : 4877336 * 2 = 9754672

( 9754672 / 23174450 (Total reads) ) * 100 ~ 42.09%

5.Aligned discordantly 1 time: 50186 (0.43%)

Discordant aligned : 50186 * 2 = 100372

( 100372 / 23174450 (Total reads) ) * 100 ~ 0.43%

6. Total unpaired reads: 8928166

These are not paired reads

Aligned 0 time: 8019048 (89.82%)

(8019048 / 8928166 ) * 100 = 89.82% i.e. 89% of the unpaired reads did not align at all

Aligned 1 time: 304653 (3.41%)

(304653 / 8928166 ) * 100 = 3.41% i.e. 3.41% of the unpaired reads aligned once

Aligned >1 times: 604465 (6.77%)

(604465 / 8928166 ) * 100 = 6.77% i.e. 6.77% of the unpaired reads are multi mapped

7. Overall alignment rate: 65.40%

Calculation as explained below

Paired Reads
Aligned concordantly 1 time: (2195620 * 2 = 4391240)
Aligned concordantly >1 times: (4877336 * 2 = 9754672)
Aligned discordantly 1 time: (50186 * 2 = 100372)

Unpaired Reads
Aligned 1 time: 304653
Aligned >1 times: 604465

Total = 4391240 + 9754672 + 100372 + 304653 + 604465 = 15155402

Overall Alignment Rate = (15155402 / 23174450) * 100 = 65.40%

(二)实际运行的结果

基于上面的解释,以下面的结果为例,二者略有不同,可能是版本的问题:
目前我使用的版本是hisat2 version 2.1.0

(1)双端测序数据比对

23116294 reads; of these:
  23116294 (100.00%) were paired; of these:
    2117146 (9.16%) aligned concordantly 0 times
    18275745 (79.06%) aligned concordantly exactly 1 time
    2723403 (11.78%) aligned concordantly >1 times
    ----
    2117146 pairs aligned concordantly 0 times; of these:
      449895 (21.25%) aligned discordantly 1 time
    ----
    1667251 pairs aligned 0 times concordantly or discordantly; of these:
      3334502 mates make up the pairs; of these:
        2176277 (65.27%) aligned 0 times
        889188 (26.67%) aligned exactly 1 time
        269037 (8.07%) aligned >1 times
95.29% overall alignment rate

我们计算总的比对率0.9529276,与上面结果一致:

Overall alignment rate比例为

(18275745 * 2 + 2723403 * 2 + 449895 * 2 + 889188 + 269037 )/(23116294 * 2) = 0.9529276

Unique mapping 比例为

(18275745 * 2 + 449895 * 2 + 889188)/(23116294 * 2)
[1] 0.8292953

Multiple mapping比例为

(2723403 * 2 + 269037)/(23116294 * 2)
[1] 0.1236323

(2)单端测序数据比对

14479369 reads; of these:
  14479369 (100.00%) were unpaired; of these:
    4838655 (33.42%) aligned 0 times
    8260600 (57.05%) aligned exactly 1 time
    1380114 (9.53%) aligned >1 times
66.58% overall alignment rate

关于上面的Concordant 解释如下(bowtie2 官网):

Concordant pairs match pair expectations, discordant pairs don’t

A pair that aligns with the expected relative mate orientation and with the expected range of distances between mates is said to align “concordantly”. If both mates have unique alignments, but the alignments do not match paired-end expectations (i.e. the mates aren’t in the expected relative orientation, or aren’t within the expected distance range, or both), the pair is said to align “discordantly”. Discordant alignments may be of particular interest, for instance, when seeking structural variants.

The expected relative orientation of the mates is set using the --ff, --fr, or --rf options. The expected range of inter-mates distances (as measured from the furthest extremes of the mates; also called “outer distance”) is set with the -I and -X options. Note that setting -I and -X far apart makes Bowtie 2 slower. See documentation for -I and -X.

To declare that a pair aligns discordantly, Bowtie 2 requires that both mates align uniquely. This is a conservative threshold, but this is often desirable when seeking structural variants.

By default, Bowtie 2 searches for both concordant and discordant alignments, though searching for discordant alignments can be disabled with the --no-discordantoption.

此外,还得考虑的是如何从SAM/BAM结果文件中将unique mapping reads提取出来,以便于后续分析,具体可以参考链接2。

有一个问题是,看到有的文章是使用concordant unique alignments for paired-end reads进行表达分析,这样的比对数据在我的数据集是19%,不知道这部分数据的实际定量效果如何,目前还没有和全部的数据进行比较。之前我计算表达量都未按SAM tag去提取。

参考:

  1. https://www.biostars.org/p/395017/
  2. https://www.jianshu.com/p/bc3751b72f0c
  3. http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#concordant-pairs-match-pair-expectations-discordant-pairs-dont
  4. https://www.cnblogs.com/leezx/p/8540862.html
我在常规转录组分析中,已经把亲本和杂交种都比对到参考基因组了,生成了对应的bam文件。我要对所有的杂交种组合进行等位基因特异性表达(ASE)分析,鉴定出正常和处理条件下的ASE基因,严格按照这篇文献的方法。 9个亲本为:A632、C72、Dan340、J724、J92、Mo17、PH207、Ye478、Z58 7个杂交种为:A632_PH207、A632_Ye478、C72_Ye478、C72_Z58、Dan340_J724、Dan340_J92、Mo17_Ye478 1、我有比对好的亲本和杂交种的bam文件,存放在/public/home/xumiaoyun/wy/cxgg/rnaseq/4_hisat2_results文件夹下面,包括正常(文件名包括-C-)和盐碱处理后的(文件名包括-T-),分别有1~2个生物学重复,我的文件夹/public/home/xumiaoyun/wy/cxgg/rnaseq/4_hisat2_results内容如下: A632-C-1.Hisat_aln.sorted.bam A632-T-1.Hisat_aln.sorted.bam C72-C-1.Hisat_aln.sorted.bam C72_Ye478-T-2.Hisat_aln.sorted.bam Dan340_J724-C-1.Hisat_aln.sorted.bam Dan340_J92-T-1.Hisat_aln.sorted.bam J92-C-1.Hisat_aln.sorted.bam Mo17_Ye478-C-1.Hisat_aln.sorted.bam PH207-T-1.Hisat_aln.sorted.bam Z58-C-1.Hisat_aln.sorted.bam A632-C-2.Hisat_aln.sorted.bam A632-T-2.Hisat_aln.sorted.bam C72-T-1.Hisat_aln.sorted.bam C72_Z58-C-1.Hisat_aln.sorted.bam Dan340_J724-C-2.Hisat_aln.sorted.bam Dan340_J92-T-2.Hisat_aln.sorted.bam J92-T-1.Hisat_aln.sorted.bam Mo17_Ye478-C-2.Hisat_aln.sorted.bam PH207-T-2.Hisat_aln.sorted.bam Z58-C-2.Hisat_aln.sorted.bam A632_PH207-C-1.Hisat_aln.sorted.bam A632_Ye478-C-1.Hisat_aln.sorted.bam C72-T-2.Hisat_aln.sorted.bam C72_Z58-C-2.Hisat_aln.sorted.bam Dan340_J724-T-1.Hisat_aln.sorted.bam Dan340-T-1.Hisat_aln.sorted.bam Mo17-C-1.Hisat_aln.sorted.bam Mo17_Ye478-T-1.Hisat_aln.sorted.bam Ye478-C-1.Hisat_aln.sorted.bam Z58-T-1.Hisat_aln.sorted.bam A632_PH207-C-2.Hisat_aln.sorted.bam A632_Ye478-C-2.Hisat_aln.sorted.bam C72_Ye478-C-1.Hisat_aln.sorted.bam C72_Z58-T-1.Hisat_aln.sorted.bam Dan340_J724-T-2.Hisat_aln.sorted.bam J724-C-1.Hisat_aln.sorted.bam Mo17-C-2.Hisat_aln.sorted.bam Mo17_Ye478-T-2.Hisat_aln.sorted.bam Ye478-C-2.Hisat_aln.sorted.bam Z58-T-2.Hisat_aln.sorted.bam A632_PH207-T-1.Hisat_aln.sorted.bam A632_Ye478-T-1.Hisat_aln.sorted.bam C72_Ye478-C-2.Hisat_aln.sorted.bam C72_Z58-T-2.Hisat_aln.sorted.bam Dan340_J92-C-1.Hisat_aln.sorted.bam J724-T-1.Hisat_aln.sorted.bam Mo17-T-1.Hisat_aln.sorted.bam PH207-C-1.Hisat_aln.sorted.bam Ye478-T-1.Hisat_aln.sorted.bam A632_PH207-T-2.Hisat_aln.sorted.bam A632_Ye478-T-2.Hisat_aln.sorted.bam C72_Ye478-T-1.Hisat_aln.sorted.bam Dan340-C-1.Hisat_aln.sorted.bam Dan340_J92-C-2.Hisat_aln.sorted.bam J724-T-2.Hisat_aln.sorted.bam Mo17-T-2.Hisat_aln.sorted.bam PH207-C-2.Hisat_aln.sorted.bam Ye478-T-2.Hisat_aln.sorted.bam 2、参考的基因组文件路径为/public/home/xumiaoyun/wy/cxgg/rnaseq/maize_v5_data/Zm-B73-REFERENCE-NAM-5.0.fa,内容格式如下: >chr1 TCATGGCTATTTTCATAAAAAATGGGGGTTGTGTGGCCATTTATCATCGACTAGAGGCTCATAAACCTCACCCCACATAT GTTTCCTTGCCATAGATTACATTCTTGGATTTCTGGTGGAAACCATTTCTTGCTTAAAAACTCGTACGTGTTAGCCTTCG GTATTATTGAAAATGGTCATTCATGGCTATTTTTCGGCAAAATGGGGGTTGTGTGGCCATTGATCGTCGACCAGAGGCTC ATACACCTCACCCCACATATGTTTCCTTGTCGTAGATCACATTCTTGGATTTCTGGTGGAGACCATTTCTTGGTCAGAAA TCCGTAGGTGTTAGCCTTCGATATTATTGAAAATGGTCGTTCATGGCTATTTTCGACAAAAATGGGGGTTGTGTGGCCAT TGATCATCGACCAGAGGCTCATACACCTCACCCCACATATGTTTCCTTGCCATAGATCACATTCTTGGATTTCTGGTGGA GACCATTTCTTGGTCAAAAATCCGTAGGTGTTAGCCTTCGGTATTATTGTAAATGGTCATTCATGGCTATTTTCGACAAA AATGGGGGTTGTGTGGCCATTGATCATCGACCAGAGGCTCATACACCTCACCCCACATATGTTTCCTTGCCATAGATCAC ATTCTTGGATTTATGGTGGAGACCATTTCTTGGTCAAAAATCCGTAGGTGTTAGCCTTCGGTATTATTGTAAATGGTCAT 3、基因组GTF文件路径为/public/home/xumiaoyun/wy/cxgg/rnaseq/maize_v5_data/maize_v5.gtf,内容格式如下: chr1 NAM transcript 34617 40204 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010" chr1 NAM exon 34617 35318 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 36037 36174 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 36259 36504 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 36600 36713 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 36822 37004 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 37416 37633 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 38021 38482 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 38571 39618 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; chr1 NAM exon 39701 40204 . + . transcript_id "Zm00001eb000010_T001"; gene_id "Zm00001eb000010"; 请给我生成对应的shell分析脚本和R可视化脚本,要求简洁易懂,运行报错,效高且无误。 注意: 软件直接进行调用即可,无需加载; shell脚本要求运行效高,批运行,节约时间,输出R可视化所需的文件即可。 R可视化要求输出的基因list格式规范,输出的图片均为PDF发表级图片(无网格线)。 /public/home/xumiaoyun/wy/cxgg/rnaseq/maize_v5_data/Zm-B73-REFERENCE-NAM-5.0.fa.fai是基因组索引文件。 请新建并生成结果文件在/public/home/xumiaoyun/wy/cxgg/rnaseq/aseg_analysis/results中。
最新发布
08-07
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值