bam 共11列+12列的可选tag。每一列的意义见 此前的总结, sam格式详解。
1. bam 中第10列保存的序列是原始fastq中的序列吗?
做一个测试,
(1) 一段序列,及其反向互补序列
$ vim test.fq
@seq001_plus
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFF,F:FFFF:F:FF:F
@seq001_minus
CTTTCTCTTTCTTTTTTTTTTTTTACTTAAACATTTTTATGATGATTTAGATGGAAGTTGTTCTTCGTCACTTAATGTTGGTTCCAGTTCTTCAACTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFF:,FF,,,:,,:F,,,,,F,:,,,,F,FF,,F,F:,F:F:,:F:FFF,,,F
压缩 $ gzip test.fq
(2) 比对
比对,从 snakemake 复制的:
$ mkdir map
$ STAR --runThreadN 1 \
--genomeDir /home/wangjl/data/ref/hg38/gencode/index/STAR/ \
--readFilesIn test.fq.gz \
--readFilesCommand zcat \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--outFilterMatchNminOverLread 0.13 --outFilterScoreMinOverLread 0.13 \
--outFileNamePrefix map/A_
耗时 2min,载入基因组占了1分多。
(3) 查看结果
查看结果:
$ samtools view map/A_Aligned.sortedByCoord.out.bam
seq001_plus 0 chr14 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFF,F:FFFF:F:FF:F NH:i:1 HI:i:1 AS:i:94 nM:i:1
seq001_minus 16 chr14 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG F,,,FFF:F:,:F:F,:F,F,,FF,F,,,,:,F,,,,,F:,,:,,,FF,:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:94 nM:i:1
结论: bam中保存的是参考基因组fa中的正向序列。
- 也就是说,对比到
+链上的序列,bam中第10列的序列就是fastq的原始序列。 - 比对到
-链的序列,bam中的序列是fq的反向互补序列。
2. pysam 从bam怎么获取序列才能和fastq中一致?
继续测试,使用上文输出的bam
$ samtools index /home/wangjl/data/MDA/test2/mapTest/map/A_Aligned.sortedByCoord.out.bam
inputFile="/home/wangjl/data/MDA/test2/mapTest/map/A_Aligned.sortedByCoord.out.bam"
import pysam
samfile = pysam.AlignmentFile(inputFile, "rb")
for line in samfile:
print(line)
print(line.get_forward_sequence()) #和fq一致
print(line.seq) #仅仅是bam中的序列
print()
samfile.close()
输出:
seq001_plus 0 #13 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 25, 37, 25, 37, 37, 25, 37]) [('NH', 1), ('HI', 1), ('AS', 94), ('nM', 1)]
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
seq001_minus 16 #13 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG array('B', [37, 11, 11, 11, 37, 37, 37, 25, 37, 25, 11, 25, 37, 25, 37, 11, 25, 37, 11, 37, 11, 11, 37, 37, 11, 37, 11, 11, 11, 11, 25, 11, 37, 11, 11, 11, 11, 11, 37, 25, 11, 11, 25, 11, 11, 11, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37]) [('NH', 1), ('HI', 1), ('AS', 94), ('nM', 1)]
CTTTCTCTTTCTTTTTTTTTTTTTACTTAAACATTTTTATGATGATTTAGATGGAAGTTGTTCTTCGTCACTTAATGTTGGTTCCAGTTCTTCAACTG
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
结论: 也就是说,我们要使用 line.get_forward_sequence() 获取序列,才是 fastq 中的序列。
3. bam 中第5列MAPQ255(STAR输出)时第二列Flag1024的意义?
flag1024 表示该序列为PCR重复,但是如果 mapq255呢?
$ samtools view B04_bam_pA/pA_AGCCGGTGCGATAC-1.bam |awk '{print $2"\t"$5}' |sort|uniq -c
588 0 255
84 1024 255
87 1040 255
685 16 255
解释第二列Flag的意义:
- https://broadinstitute.github.io/picard/explain-flags.html
- -F 3844 过滤掉质量低的reads,对于没有UMI的测序,使用比对层面判断是否PCR重复。
- -F 2820 对于有UMI的reads,按照UMI去重,而不是依赖sam比对层面判断的是否PCR重复。
Ref
- http://events.jianshu.io/p/3a56ab001a83
本文探讨了BAM文件格式中序列的存储方式,特别是在面对正反向比对时序列的变化,并介绍了如何通过pysam库正确获取与原始FASTQ文件一致的序列。此外,还解析了BAM文件中特定标志位的意义。
6168

被折叠的 条评论
为什么被折叠?



