(How to) Call somatic mutations using GATK4 Mutect2

本文详细介绍了如何利用GATK4 Mutect2进行体细胞突变鉴定,包括原始数据检查、数据修剪、数据预处理、PON构建、污染计算、肿瘤-正常模式下的新过滤策略等关键步骤,旨在提高变异检测的准确性和可靠性。

Rawdata

  • 对rawdata的fastq进行fastqc检查,决定是否cut某些位置,对所有数据进行了cut前10bp的操作

Data trimming

fastp --thread=16 -q 20 -l 50 -f 10 -F 10 -i tumor_R1.fastq.gz -o 
tumor_R1_clean.fastq -I tumor_R2.fastq.gz -O tumor_R2_clean.fastq

-l参数对长度低于50的reads进行过滤
-h -j可改变质控结果文件名

  • 质量过滤
    -q, --qualified_quality_phred碱基质量值不小于多少时为合格碱基,默认碱基质量值15,默认碱基质量>=15是合格碱基;
    -u, --unqualified_percent_limit允许不合格碱基的占比为多少时去掉这条read,默认不合格碱基占比>40%时,去掉该read;
  • 整体切除
    -f -F对R1和R1分别cut开端多少bp
    -t -T对R1和R1分别cut尾部多少bp

Prepipline 数据预处理 bwa-bqsr

  • BWA
bwa mem -t 24 -R "@RG\tID:tumor\tSM:tumor\tLB:WES\tPL:Illumina" /data2/references/Homo_sapiens/hg38.genomic.fa /data1/01.projects/USER006/cleandata/tumor_R1_clean.fastq 
/data1/01.projects/USER006/cleandata/tumor_R2_clean.fastq > /data1/01.projects/USER006/mapping/tumor.sam
  • SortSam
samtools view -bS tumor.sam > tumore.bam
samtools sort -@ 5 tumor.bam -o tumor.sorted.bam
samtools index tumor.sorted.bam
  • MarkDuplicates
    MarkDuplicates单核运行,MarkDuplicatesSpark多核运行。
gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I tumor.sorted.bam
 --REMOVE_DUPLICATES true -O tumor.sorted.rmdup.bam -M tumor.metrics

–REMOVE_DUPLICATES 默认false,选择true是去重复而不是标记。

  • BaseRecalibrator
gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
-R hg38.genomic.fa \
-I tumor.sorted.rmdup.bam -O tumor.recal_data.table \
--known-sites dbsnp_146.hg38.vcf.gz \
-known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
  • ApplyBQSR
gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" ApplyBQSR \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-I tumor.sorted.rmdup.bam \
--bqsr-recal-file tumor.recal_data.table \
-O tumor.sorted.rmdup.BQSR.bam
  • Reads with mapping quality below 30 in the BAM files were filtered out before mutation calling.
samtools view -b -q 30 tumor.sorted.rmdup.BQSR.bam >tumor.allready.bam
samtools index tumor.allready.bam

PON构建

the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.
不是同一批样本,但是在同一测序仪上进行测序的,也可以作为PON来过滤掉一些测序仪以及实验人为误差。

  • Step 1: Run Mutect2 in tumor-only mode for each normal sample:
gatk4 --java-options "-Xmx20G" Mutect2 \
--native-pair-hmm-threads 16 \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-I ./mapping/normal1.sorted.rmdup.BQSR.bam \
-L target_interval_hg38.bed \
--max-mnp-distance 0 \
--independent-mates true \
-O normal1.vcf.gz
etc
  • Step 2: Create a GenomicsDB from the normal Mutect2 calls:
time gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" GenomicsDBImport \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-L target_interval_hg38.bed \
--reader-threads 16 \#How many simultaneous threads to use when opening VCFs in batches; higher values may improve performance when network latency is an issue
--genomicsdb-workspace-path pon_db \
-V normal1.vcf.gz \
-V normal2.vcf.gz \
-V normal3.vcf.gz \
-V normal4.vcf.gz \
-V nomrmal2.vcf.gz
  • Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:
time gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" CreateSomaticPanelOfNormals \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
--germline-resource af-only-gnomad-hg38.vcf.gz \
-V gendb://pon_db \
-O five_normal_pon.vcf.gz

Calculate contamination

  • 对于现在的Mutect2的流程,GATK只考虑了normal contamination of tumor这种情况
  • 生成待用文件,只包含common biallelic variants,biallelic site are only two possible alleles-- a deletion(other), or the reference allele. Multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.
gatk4 SelectVariants \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-V af-only-gnomad-hg38.vcf.gz \
--select-type-to-include SNP \
--restrict-alleles-to BIALLELIC \
-O ./af-only-gnomad.hg38.SNP_biallelic.vcf
  • GetPileupSummaries
    用GetPileupSummaries计算resource site位点上的count数,这样并不是计算所有的位点,而是用AF参数过滤后的(大于0.01并小于0.2)的homozygous纯合 alternate sites,同时也可以用-L参数指定区域,分别对Normal和Tumor样本计算.
    If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
gatk4 GetPileupSummaries \
-I normal.allready.bam \
-L target_interval.bed \
-V af-only-gnomad.hg38.SelectCommonBiallelicSNPs.mutect2-contamination.vcf \
-O ./contamination/normal.pileups.table
gatk4 GetPileupSummaries \
-I tumor.allready.bam \
-L target_interval.bed \
-V af-only-gnomad.hg38.SelectCommonBiallelicSNPs.mutect2-contamination.vcf \
-O ./contamination/tumor.pileups.table
  • CalculateContamination
gatk4 CalculateContamination \
-I ./contamination/tumor.pileups.table \
-matched ./contamination/normal.pileups.table \
-O ./contamination/tumor.calculatecontamination.table
  • 比如Primary-IIIG样本的污染评估是0.00131826,那么在过滤后的VCF里,AF低于0.00131826的位点则会被标注上contamination

Tumor-Normal 模式

time gatk4 --java-options "-Xmx40G -Djava.io.tmpdir=./tmp/mutect2-T-N-tmp/" Mutect2 \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-I /data1/01.projects/USER006/mapping/pa07_tumor4.allready.bam \
-tumor pa07_tumor4 \
-I /data1/01.projects/USER006/mapping/pa07_normal.allready.bam \
-normal pa07_normal \
-A FisherStrand \
-A QualByDepth \
-A StrandBiasBySample \
-A StrandOddsRatio \
-A RMSMappingQuality \
-A MappingQualityRankSumTest \
-A ReadPosRankSumTest \
-pon ./five_normal_pon.vcf.gz \
--germline-resource /data1/01.projects/USER006/af-only-gnomad-hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--independent-mates true \
-L /data1/01.projects/USER006/target_interval_hg38-del2row.bed \
-O ./Mutect2-T-N/pa07_tumor4.mutect2.vcf \
-bamout ./Mutect2-T-N/pa07_tumor4_normal.bam

New Filtering Strategy in FilterMutectCalls

  • FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation.
    F score,灵敏度和精密度的调和平均值。 -f-score-beta默认值为1,侧重于sensitivity则设置为大于1,侧重于precision则小于1.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值