Rawdata
- 对rawdata的fastq进行fastqc检查,决定是否cut某些位置,对所有数据进行了cut前10bp的操作
Data trimming
- 所有fastp参数详解 fastp-parameters
fastp --thread=16 -q 20 -l 50 -f 10 -F 10 -i tumor_R1.fastq.gz -o
tumor_R1_clean.fastq -I tumor_R2.fastq.gz -O tumor_R2_clean.fastq
-l参数对长度低于50的reads进行过滤
-h -j可改变质控结果文件名
- 质量过滤
-q, --qualified_quality_phred碱基质量值不小于多少时为合格碱基,默认碱基质量值15,默认碱基质量>=15是合格碱基;
-u, --unqualified_percent_limit允许不合格碱基的占比为多少时去掉这条read,默认不合格碱基占比>40%时,去掉该read; - 整体切除
-f -F对R1和R1分别cut开端多少bp
-t -T对R1和R1分别cut尾部多少bp
Prepipline 数据预处理 bwa-bqsr
- BWA
bwa mem -t 24 -R "@RG\tID:tumor\tSM:tumor\tLB:WES\tPL:Illumina" /data2/references/Homo_sapiens/hg38.genomic.fa /data1/01.projects/USER006/cleandata/tumor_R1_clean.fastq
/data1/01.projects/USER006/cleandata/tumor_R2_clean.fastq > /data1/01.projects/USER006/mapping/tumor.sam
- SortSam
samtools view -bS tumor.sam > tumore.bam
samtools sort -@ 5 tumor.bam -o tumor.sorted.bam
samtools index tumor.sorted.bam
- MarkDuplicates
MarkDuplicates单核运行,MarkDuplicatesSpark多核运行。
gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I tumor.sorted.bam
--REMOVE_DUPLICATES true -O tumor.sorted.rmdup.bam -M tumor.metrics
–REMOVE_DUPLICATES 默认false,选择true是去重复而不是标记。
- BaseRecalibrator
gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
-R hg38.genomic.fa \
-I tumor.sorted.rmdup.bam -O tumor.recal_data.table \
--known-sites dbsnp_146.hg38.vcf.gz \
-known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
- ApplyBQSR
gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" ApplyBQSR \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-I tumor.sorted.rmdup.bam \
--bqsr-recal-file tumor.recal_data.table \
-O tumor.sorted.rmdup.BQSR.bam
- Reads with mapping quality below 30 in the BAM files were filtered out before mutation calling.
samtools view -b -q 30 tumor.sorted.rmdup.BQSR.bam >tumor.allready.bam
samtools index tumor.allready.bam
PON构建
the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.
不是同一批样本,但是在同一测序仪上进行测序的,也可以作为PON来过滤掉一些测序仪以及实验人为误差。
- Step 1: Run Mutect2 in tumor-only mode for each normal sample:
gatk4 --java-options "-Xmx20G" Mutect2 \
--native-pair-hmm-threads 16 \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-I ./mapping/normal1.sorted.rmdup.BQSR.bam \
-L target_interval_hg38.bed \
--max-mnp-distance 0 \
--independent-mates true \
-O normal1.vcf.gz
etc
- Step 2: Create a GenomicsDB from the normal Mutect2 calls:
time gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" GenomicsDBImport \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-L target_interval_hg38.bed \
--reader-threads 16 \#How many simultaneous threads to use when opening VCFs in batches; higher values may improve performance when network latency is an issue
--genomicsdb-workspace-path pon_db \
-V normal1.vcf.gz \
-V normal2.vcf.gz \
-V normal3.vcf.gz \
-V normal4.vcf.gz \
-V nomrmal2.vcf.gz
- Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:
time gatk4 --java-options "-Xmx20G -Djava.io.tmpdir=./" CreateSomaticPanelOfNormals \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
--germline-resource af-only-gnomad-hg38.vcf.gz \
-V gendb://pon_db \
-O five_normal_pon.vcf.gz
Calculate contamination
- 对于现在的Mutect2的流程,GATK只考虑了normal contamination of tumor这种情况
- 生成待用文件,只包含common biallelic variants,biallelic site are only two possible alleles-- a deletion(other), or the reference allele. Multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.
gatk4 SelectVariants \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-V af-only-gnomad-hg38.vcf.gz \
--select-type-to-include SNP \
--restrict-alleles-to BIALLELIC \
-O ./af-only-gnomad.hg38.SNP_biallelic.vcf
- GetPileupSummaries
用GetPileupSummaries计算resource site位点上的count数,这样并不是计算所有的位点,而是用AF参数过滤后的(大于0.01并小于0.2)的homozygous纯合 alternate sites,同时也可以用-L参数指定区域,分别对Normal和Tumor样本计算.
If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
gatk4 GetPileupSummaries \
-I normal.allready.bam \
-L target_interval.bed \
-V af-only-gnomad.hg38.SelectCommonBiallelicSNPs.mutect2-contamination.vcf \
-O ./contamination/normal.pileups.table
gatk4 GetPileupSummaries \
-I tumor.allready.bam \
-L target_interval.bed \
-V af-only-gnomad.hg38.SelectCommonBiallelicSNPs.mutect2-contamination.vcf \
-O ./contamination/tumor.pileups.table
- CalculateContamination
gatk4 CalculateContamination \
-I ./contamination/tumor.pileups.table \
-matched ./contamination/normal.pileups.table \
-O ./contamination/tumor.calculatecontamination.table
- 比如Primary-IIIG样本的污染评估是0.00131826,那么在过滤后的VCF里,AF低于0.00131826的位点则会被标注上contamination
Tumor-Normal 模式
time gatk4 --java-options "-Xmx40G -Djava.io.tmpdir=./tmp/mutect2-T-N-tmp/" Mutect2 \
-R /data2/references/Homo_sapiens/hg38.genomic.fa \
-I /data1/01.projects/USER006/mapping/pa07_tumor4.allready.bam \
-tumor pa07_tumor4 \
-I /data1/01.projects/USER006/mapping/pa07_normal.allready.bam \
-normal pa07_normal \
-A FisherStrand \
-A QualByDepth \
-A StrandBiasBySample \
-A StrandOddsRatio \
-A RMSMappingQuality \
-A MappingQualityRankSumTest \
-A ReadPosRankSumTest \
-pon ./five_normal_pon.vcf.gz \
--germline-resource /data1/01.projects/USER006/af-only-gnomad-hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--independent-mates true \
-L /data1/01.projects/USER006/target_interval_hg38-del2row.bed \
-O ./Mutect2-T-N/pa07_tumor4.mutect2.vcf \
-bamout ./Mutect2-T-N/pa07_tumor4_normal.bam
New Filtering Strategy in FilterMutectCalls
- FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation.
F score,灵敏度和精密度的调和平均值。-f-score-beta默认值为1,侧重于sensitivity则设置为大于1,侧重于precision则小于1.
本文详细介绍了如何利用GATK4 Mutect2进行体细胞突变鉴定,包括原始数据检查、数据修剪、数据预处理、PON构建、污染计算、肿瘤-正常模式下的新过滤策略等关键步骤,旨在提高变异检测的准确性和可靠性。
1143

被折叠的 条评论
为什么被折叠?



