生信分析进阶5 - 全外显子组变异检测和ANNOVAR注释Snakemake分析流程

基于yaml或ini配置文件,配置文件包含例如样本名称、参考基因组版本、exon capture bed文件路径、参考基因组路径和ANNOVAR注释文件等信息。

基于该流程可以实现全外显测序的fastq文件输入到得到最终变异VCF文件。

1. Snakemake分析流程基础软件安装

# conda安装
conda install -c bioconda snakemake -y
conda install -c bioconda fastqc -y
conda install -c bioconda trim-galore -y
conda install -c bioconda bwa -y
conda install -c bioconda samtools -y

# GATK 4.2.1版本
conda install -c bioconda gatk -y

ANNOVAR注释软件安装和使用参考文章:
https://www.jianshu.com/p/461f2cd47564

ANNOVAR注释全外显子有些坑和经常会报错的问题,
本人后续有空会写下对全外显子注释的ANNOVAR软件安装和使用

2. AgilentSureSelect 外显子序列捕获流程图

外显子捕获基本原理参考下图。
AgilentSureSelect

3. GATK SNP和INDEL分析基本流程

GATK

4. Snakemake全外显子分析流程

其中regions_bed配置为外显子捕获所采用试剂盒对应的bed文件路径。

##################### parser configfile #########################
import os
configfile: "config.yaml"
# configfile: "config.ini"

sample = config["sample"]
genome = config["genome"]
#################################################################
PROJECT_DATA_DIR = config["project_data_dir"]
PROJECT_RUN_DIR = config["project_run_dir"]

os.system("mkdir -p {0}".format(PROJECT_RUN_DIR))

######################### Output #################################
ALL_FASTQC_RAW_ZIP1 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.zip"
ALL_FASTQC_RAW_HTML1 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.html"
ALL_FASTQC_RAW_ZIP2 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.zip"
ALL_FASTQC_RAW_HTML2 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.html"

ALL_CELAN_FASTQ1 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz"
ALL_CELAN_FASTQ2 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz"

ALL_SORTED_BAM = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
ALL_SORTED_BAM_BAI1 = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.bai"
ALL_BAM_FLAGSTAT = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.flagstat"

ALL_EXON_INTERVAL_BED = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
ALL_STAT_TXT = PROJECT_RUN_DIR + "/gatk/" + sample + ".stat.txt"
ALL_MKDUP_BAM = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam"
ALL_METRICS = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.bam.metrics"
ALL_SORTED_BAM_BAI2 = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai"

ALL_INSERT_SIZE_HISTOGRAM_pdf = PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_histogram.pdf"

ALL_RECAL_DATA_TABLE = PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table"
ALL_BQSR_SORTED_BAM = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
ALL_depthOfcoverage = PROJECT_RUN_DIR + "/gatk/depthOfcoverage"

ALL_RAW_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
All_SNP_SELECT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
ALL_SNP_FILTER_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
ALL_SNP_FILTER_VCF_STAT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf_stat.csv"

All_INDEL_SELECT = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
ALL_INDEL_FILTER_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"

ALL_SNP_AVINPUT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"
ALL_INDEL_AVINPUT = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"

ALL_MERG_FILTERE_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"

ALL_SNPANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_snpanno." + genome + "_multianno.csv"
ALL_INDELANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_indelanno." + genome + "_multianno.csv"
ALL_ALLANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_allanno." + genome + "_multianno.csv"

QUALIMAP_REPORT = PROJECT_RUN_DIR + "/gatk/bam_QC/" + sample + "_bamQC_report.pdf"


######################## Workflow #################################
rule all:
    input:
        ALL_FASTQC_RAW_ZIP1,
        ALL_FASTQC_RAW_HTML1,
        ALL_FASTQC_RAW_ZIP2,
        ALL_FASTQC_RAW_HTML2,
        ALL_SORTED_BAM,
        ALL_SORTED_BAM_BAI1,
        ALL_BAM_FLAGSTAT,
        ALL_EXON_INTERVAL_BED,
        ALL_STAT_TXT,
        ALL_MKDUP_BAM,
        ALL_METRICS,
        ALL_SORTED_BAM_BAI2,
        ALL_RECAL_DATA_TABLE,
        ALL_BQSR_SORTED_BAM,
        ALL_RAW_VCF,
        All_SNP_SELECT,
        All_INDEL_SELECT,
        ALL_SNP_FILTER_VCF,
        ALL_INDEL_FILTER_VCF,
        ALL_MERG_FILTERE_VCF,
        ALL_SNPANNO,
        ALL_INDELANNO,
        ALL_ALLANNO


rule fastqc_raw:
    input:
        r1 = PROJECT_DATA_DIR + "/" + sample + "_1.fq.gz",
        r2 = PROJECT_DATA_DIR + "/" + sample + "_2.fq.gz"
    output:
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.zip",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.zip",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.html",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.html"
    threads: 8
    params:
        raw_data_dir = PROJECT_RUN_DIR + "/fastqc/raw_data",
        clean_data_dir = PROJECT_RUN_DIR + "/fastqc/clean_data",	
        align_dir = PROJECT_RUN_DIR + "/align", 
        gatk_dir = PROJECT_RUN_DIR + "/gatk",
        vcf_dir = PROJECT_RUN_DIR + "/vcf",
        tmp_dir = PROJECT_RUN_DIR + "/tmp",
	fastqc_report_dir = PROJECT_RUN_DIR + "/report/fastqc_report",
	align_report_dir = PROJECT_RUN_DIR + "/report/align_report"
    shell:
        """
        mkdir -p {params.raw_data_dir}
        mkdir -p {params.clean_data_dir}
        mkdir -p {params.align_dir}
        mkdir -p {params.gatk_dir}
        mkdir -p {params.vcf_dir}
        mkdir -p {params.tmp_dir}
        fastqc -f fastq --extract -t {threads} -o {params.raw_data_dir} {input.r1} {input.r2}        
	"""

rule fastqc_clean:
    input:
        r1 = PROJECT_DATA_DIR + "/" + sample + "_1.fq.gz",
        r2 = PROJECT_DATA_DIR + "/" + sample + "_2.fq.gz"
    output:
        temp(PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz"),
        temp(PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz")
    threads: 8
    params:
        quality = 20,
        length = 20,
	    clean_data_dir = PROJECT_RUN_DIR + "/fastqc/clean_data"
    shell:
        """
        trim_galore --paired --quality {params.quality} --length {params.length} -o {params.clean_data_dir} {input.r1} {input.r2}
	    """

# BWA mem比对
rule bwa_mem:
    input:
        reads1 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz",
        reads2 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    threads:8
    params:
        hg = config["hg_fa"]
    shell:
        """
        bwa mem -t {threads} -M -R "@RG\\tID:{sample}\\tPL:ILLUMINA\\tLB:{sample}\\tSM:{sample}" {params.hg} {input.reads1} {input.reads2}|samtools sort -@ {threads} -o {output}
        """

rule bam_index_1:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.bai"
    threads: 2
    shell:
        "samtools index {input} > {output}"

rule flagstat:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.flagstat"
    threads: 1
    shell:
        "samtools flagstat {input} > {output}"

# 创建reference fasta 字典,GATK需要用到
# rule CreateSequenceDictionary:
#     input:
#         "/reference/hg19/ucsc.hg19.fa"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".hg19.dict"
#         # /GATK_bundle/hg19/ucsc.hg19.dict
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         gatk_dir = PROJECT_RUN_DIR + "/" + sample + "/gatk"
#     shell:
#         """
#         mkdir -p {params.gatk_dir}
#         gatk CreateSequenceDictionary -R {input} -O {ouptut}
#         """ 

rule CreateExonIntervalBed:
    input:
        regions_bed = config["regions_bed"],
        hg_dict = config["hg_dict"]
    output:
        Exon_Interval_bed = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
    threads: 1
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk BedToIntervalList -I {input.regions_bed} -O {output.Exon_Interval_bed} -SD {input.hg_dict}
        """ 

# 外显子区域覆盖度
rule exon_region:
    input:
        bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam",
        Exon_Interval_bed = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
    output:
        stat_txt = PROJECT_RUN_DIR + "/gatk/" + sample + ".stat.txt"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk CollectHsMetrics -BI {input.Exon_Interval_bed} -TI {input.Exon_Interval_bed} -I {input.bam} -O {output.stat_txt}
        """ 

# 标记PCR重复序列
rule mark_duplicates:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        metrics = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.bam.metrics"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" MarkDuplicates -I {input} -O {output.mkdup_bam} -M {output.metrics}
        """

rule bam_index_2:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai"
    threads: 2
    shell:
        "samtools index {input} > {output}"

# 插入片段分布统计
#rule InsertSizeStat:
#    input:
#        PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.bam"
#    output:
#        insert_size_metrics_txt = temp(PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_metrics.txt"),
#        insert_size_histogram_pdf = PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_histogram.pdf"
#    threads:1
#    params:
#        picard = "/public/software/picard.jar",
#        M = 0.5
#    shell:
#        """
#       java -jar {params.picard} CollectInsertSizeMetrics \
#        I={input} \
#        O={output.insert_size_metrics_txt} \
#        H={output.insert_size_histogram_pdf} \
#        M=0.5
#        """

# 重新校正碱基质量值
# 计算出所有需要进行重校正的read和特征值,输出为一份校准表文件.recal_data.table
rule BaseRecalibrator:
    input:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        mkdup_bam_index = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai",
        vcf_1000G_phase1_indels_sites = config["vcf_1000G_phase1_indels_sites"],
        vcf_Mills_and_1000G_gold_standard_indels_sites = config["vcf_Mills_and_1000G_gold_standard_indels_sites"],
        vcf_dbsnp = config["vcf_dbsnp"]
    output:
        PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table"
    threads: 8
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"],
        regions_bed = config["regions_bed"]
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" BaseRecalibrator \
        -R {params.hg_fa} \
        -I {input.mkdup_bam} \
        --known-sites {input.vcf_1000G_phase1_indels_sites} \
        --known-sites {input.vcf_Mills_and_1000G_gold_standard_indels_sites} \
        --known-sites {input.vcf_dbsnp} \
        -L {params.regions_bed} \
        -O {output}
        """
        
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" BaseRecalibrator -R genome/hg19.fa -I ./TKQX.sorted.mkdup.bam --known-sites /GATK_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf --known-sites /GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --known-sites /GATK_bundle/hg19/dbsnp_138.hg19.vcf -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./TKQX.recal_data.table
# java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict

# 利用校准表文件重新调整原来BAM文件中的碱基质量值,使用这个新的质量值重新输出新的BAM文件
rule ApplyBQSR:
    input:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        recal_data = PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table",
        regions_bed = config["regions_bed"]
        #regions_bed = "/GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed"
    output:
        PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"]
        #hg_fa = "genome/hg19.fa"
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" ApplyBQSR \
        -R {params.hg_fa} \
        -I {input.mkdup_bam} \
        -bqsr {input.recal_data} \
        -L {input.regions_bed} \
        -O {output}
        """
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" ApplyBQSR -R genome/hg19.fa -I ./test_sample.sorted.mkdup.bam -bqsr ./test_sample.recal_data.table -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./test_sample.sorted.mkdup.BQSR.bam

# # 校正碱基图
# rule recal_data_plot:
#     input:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".recal_data.table"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".recal_data.table.plot"
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         gatk = "/public/software/anaconda3/envs/PGS_1M/bin/gatk"
#     shell:
#         """
#         gatk AnalyzeCovariates -bqsr {input} -plots {output}
#         """
# gatk AnalyzeCovariates -bqsr ./test_sample.recal_data.table -plots ./test_sample.recal_data.table.plot

# 统计测序深度和覆盖度
#rule depthOfcoverage:
#    input:
#        BQSR_mkdup_bam = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
#    output:
#        PROJECT_RUN_DIR + "/gatk/depthOfcoverage"
#    threads:1
#    params:
#        #hg19_fa = "genome/hg19.fa",
#        hg_fa = config["hg_fa"],
#        bamdst = "/public/software/bamdst/bamdst",
#        regions_bed = config["regions_bed"]
#        #regions_bed = "/GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed"
#    shell:
#        """
#        {params.bamdst} -p {params.regions_bed} -o {output} {input.BQSR_mkdup_bam}
#        """
# """
# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" DepthOfCoverage  \
# -R {params.hg19} \
# -o {output} \
# -I {input.BQSR_mkdup_bam} \
# -L {input.regions_bed} \
# --omitDepthOutputAtEachBase --omitIntervalStatistics \
# -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100
# """
#gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" DepthOfCoverage -R genome/hg19.fa -o ./TKQX -I ./TKQX.sorted.mkdup.BQSR.bam -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed --omitDepthOutputAtEachBase --omitIntervalStatistics -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100

# /public/software/bamdst/bamdst -p /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -o ./depth_coverage ./TKQX.sorted.mkdup.BQSR.bam

# # 单样本VCF calling
rule vcf_calling:
    input:
        BQSR_mkdup_bam = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"],
        vcf_dbsnp = config["vcf_dbsnp"],
        regions_bed = config["regions_bed"]
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" HaplotypeCaller \
        -R {params.hg_fa} \
        -I {input.BQSR_mkdup_bam} \
        -D {params.vcf_dbsnp} \
        -L {params.regions_bed} \
        -O {output}
        """
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" HaplotypeCaller -R genome/hg19.fa -I ./TKQX.sorted.mkdup.BQSR.bam -D /GATK_bundle/hg19/dbsnp_138.hg19.vcf -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./TKQX.raw.vcf

# 变异质控和过滤,对raw data进行质控,剔除假阳性的标记
rule vcf_select_SNP:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk SelectVariants -select-type SNP -V {input} -O {output}
        """

rule vcf_filter_SNP:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        QD = 2.0,
        MQ = 40.0,
        FS = 60.0,
        SOR = 3.0,
        MQRankSum = -12.5,
        ReadPosRankSum = -8.0
    shell:
        """
        gatk VariantFiltration \
        -V {input} \
        --filter-expression "QD < {params.QD} || MQ < {params.MQ} || FS > {params.FS} || SOR > {params.SOR} || MQRankSum < {params.MQRankSum} || ReadPosRankSum < {params.ReadPosRankSum}" --filter-name "PASS" \
        -O {output}
        """
# --filter-expression "QD < 2.0|| MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS"‘
# gatk VariantFiltration -V TKQX.snp.vcf --filter-expression "QD < 2.0|| MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS" -O TKQX.snp.filter.vcf

# rule snp_frequency_stat:
#     input:
#         PROJECT_RUN_DIR + "/" + sample + "/vcf/" + sample + ".snp.filter.vcf"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/vcf/" + sample + ".snp.filter.vcf_stat.csv"
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         snp_frequency = "/public/analysis/pipeline/WES/scripts/snp_frequency.py"
#     shell:
#         """
#         python {params.snp_frequency} {input}
#         """

rule vcf_select_InDel:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk SelectVariants -select-type INDEL -V {input} -O {output}
        """

rule vcf_filter_InDel:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        QD = 2.0,
        FS = 200.0,
        SOR = 10.0,
        MQRankSum = -12.5,
        ReadPosRankSum = -8.0
    shell:
        """
        gatk VariantFiltration \
        -V {input} \
        --filter-expression "QD < {params.QD} || FS > {params.FS} || SOR > {params.SOR} || MQRankSum < {params.MQRankSum} || ReadPosRankSum < {params.ReadPosRankSum}" --filter-name "PASS" \
        -O {output}
        """
# --filter-expression "QD < 2.0 || FS > 200.0 || SOR > 10.0|| MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS"

rule merge_vcf:
    input:
        snp_filter_vcf = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf",
        indel_filter_vcf = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"
    threads: 1
    params:
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        gatk MergeVcfs \
        -I {input.snp_filter_vcf} \
        -I {input.indel_filter_vcf} \
        -O {output}
        """

rule transfer_avinput:
    input:
        snp = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf",
        indel= PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf",
        merge = PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"
    output:
        snp = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"),
        indel = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"),
        merge = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.avinput")
    threads: 4
    params:
        convert2annovar = "/public/software/annovar/convert2annovar.pl"
    shell:
        """
        perl {params.convert2annovar} -format vcf4 {input.snp} > {output.snp}
        perl {params.convert2annovar} -format vcf4 {input.indel} > {output.indel}
        perl {params.convert2annovar} -format vcf4 {input.merge} > {output.merge}
        """
# SNP注释
rule snp_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_snpanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_snpanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,avsnp147,exac03,gnomad211_exome,SAS.sites.2015_08,esp6500siv2_all,intervar_20180118,dbnsfp42c \
        -operation g,r,f,f,f,f,f,f,f,f -nastring . -csvout
        """
        
# INDEL注释
rule indel_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_indelanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_indelanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf",
        perl = "/home/miniconda3/pkgs/perl-5.26.2-h36c2ea0_1008/bin/perl"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,gnomad211_exome,exac03,esp6500siv2_all,ALL.sites.2015_08 \
        -operation g,r,f,f,f,f,f -nastring . -csvout
        """

#
rule merge_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_allanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_allanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,avsnp147,exac03,gnomad211_exome,SAS.sites.2015_08,esp6500siv2_all,intervar_20180118,dbnsfp42c \
        -operation g,r,f,f,f,f,f,f,f,f -nastring . -csvout
        """

生信分析进阶文章推荐

生信分析进阶1 - HLA分析的HLA区域reads提取及bam转换fastq

生信分析进阶2 - 利用GC含量的Loess回归矫正reads数量

生信分析进阶3 - pysam操作bam文件统计unique reads和mapped reads高级技巧合辑

生信分析进阶4 - 比对结果的FLAG和CIGAR信息含义与BAM文件指定区域提取

<think> 我们正在构建一个完整的外显子测序(WES)息学分析流程,基于Ubuntu系统。流程通常包括以下步骤: 1. 数据质控(Quality Control, QC) 2. 序列比对(Alignment) 3. 标记重复序列(Marking Duplicates) 4. 碱基质量重校正(Base Quality Score Recalibration, BQSR) 5. 变异检测(Variant Calling) 6. 变异质控与过滤(Variant Filtering) 7. 变异注释(Variant Annotation) 注意:我们将使用GRCh38参考基因,并假设数据为Illumina平台产的双端测序数据(paired-end)。 下面逐步说明每个步骤,包括使用的软件典型命令参数。由于实际分析中可能需要根据具体情况调整参数,这里给出的是常用参数。 ### 1. 数据质控(QC) 使用FastQC进行原始测序数据的质量评估,然后使用Trimmomatic进行数据过滤。 **软件安装:** ```bash sudo apt-get update sudo apt-get install fastqc sudo apt-get install trimmomatic ``` **FastQC运行:** ```bash fastqc -o ./qc_results sample_R1.fastq.gz sample_R2.fastq.gz ``` 成的质量报告需要人工检查,然后决定过滤参数。 **Trimmomatic运行:** ```bash java -jar /usr/share/trimmomatic/trimmomatic.jar PE -phred33 \ sample_R1.fastq.gz sample_R2.fastq.gz \ sample_R1_trimmed_paired.fastq.gz sample_R1_trimmed_unpaired.fastq.gz \ sample_R2_trimmed_paired.fastq.gz sample_R2_trimmed_unpaired.fastq.gz \ ILLUMINACLIP:/usr/share/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:10 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ``` 参数说明: - PE: 双端数据 - ILLUMINACLIP: 去除接头,后面接接头文件、允许的最大错配数等 - LEADING: 切除首端碱基质量低于3的碱基 - TRAILING: 切除末端碱基质量低于3的碱基 - SLIDINGWINDOW: 滑动窗口大小4,平均质量低于15则切除 - MINLEN: 最小长度36 ### 2. 序列比对 使用BWA-MEM将过滤后的reads比对到参考基因。 **安装BWA:** ```bash sudo apt-get install bwa ``` **建立参考基因索引(如果尚未建立):** ```bash bwa index /path/to/GRCh38.fa ``` **比对:** ```bash bwa mem -t 8 -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA' \ /path/to/GRCh38.fa \ sample_R1_trimmed_paired.fastq.gz sample_R2_trimmed_paired.fastq.gz \ > sample.sam ``` 参数说明: - -t: 线程数 - -R: 设置reads息,包括样本ID(SM)等,用于后续分析 将SAM转为BAM并排序(使用samtools): ```bash samtools view -@ 8 -Sb sample.sam | samtools sort -@ 8 -o sample_sorted.bam ``` ### 3. 标记重复序列 使用GATK的MarkDuplicates来标记PCR重复。 **安装GATK:** 从官网下载GATK(需要注册)并安装。假设GATK安装在`/path/to/gatk`。 **运行:** ```bash java -jar /path/to/gatk/gatk-package-4.x.x.x-local.jar MarkDuplicates \ -I sample_sorted.bam \ -O sample_marked_duplicates.bam \ -M sample_metrics.txt ``` ### 4. 碱基质量重校正(BQSR) 使用GATK的BaseRecalibratorApplyBQSR。 **第一步:成重校准表** ```bash java -jar /path/to/gatk/gatk-package-4.x.x.x-local.jar BaseRecalibrator \ -R /path/to/GRCh38.fa \ -I sample_marked_duplicates.bam \ --known-sites /path/to/dbsnp.vcf \ # 已知的SNP位点,用于训练模型 -O sample_recal_data.table ``` **第二步:应用重校准** ```bash java -jar /path/to/gatk/gatk-package-4.x.x.x-local.jar ApplyBQSR \ -R /path/to/GRCh38.fa \ -I sample_marked_duplicates.bam \ --bqsr-recal-file sample_recal_data.table \ -O sample_recalibrated.bam ``` ### 5. 变异检测 使用GATK HaplotypeCaller进行变异检测。 **运行HaplotypeCaller:** ```bash java -jar /path/to/gatk/gatk-package-4.x.x.x-local.jar HaplotypeCaller \ -R /path/to/GRCh38.fa \ -I sample_recalibrated.bam \ -O sample_raw_variants.vcf ``` ### 6. 变异质控与过滤 使用GATK的VariantFiltration对原始变异进行过滤。 **对于SNP:** ```bash java -jar /path/to/gatk/gatk-package-4.x.x.x-local.jar VariantFiltration \ -R /path/to/GRCh38.fa \ -V sample_raw_variants.vcf \ -O sample_filtered_snps.vcf \ --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filter-name "my_snp_filter" ``` **对于INDEL:** ```bash java -jar /path/to/gatk/gatk-package-4.x.x.x-local.jar VariantFiltration \ -R /path/to/GRCh38.fa \ -V sample_raw_indels.vcf \ # 注意:需要先通过SelectVariants提取INDEL -O sample_filtered_indels.vcf \ --filter-expression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ --filter-name "my_indel_filter" ``` 注意:在实际流程中,可能需要先使用SelectVariants将原始VCF分成SNPINDEL两部分,然后分别过滤,再合并。 ### 7. 变异注释 使用ANNOVAR或VEP(Variant Effect Predictor)进行注释。 **安装VEP:** 可以按照VEP官方文档安装(使用cpan或conda)。 **运行VEP:** ```bash vep -i sample_filtered.vcf \ # 假设已经合并了SNPINDEL --species homo_sapiens \ --assembly GRCh38 \ --offline \ # 离线模式,需提前下载缓存数据库 --cache \ --dir_cache /path/to/vep_cache \ --everything \ --output_file sample_annotated.vcf ``` 以上是一个外显子测序分析的标准流程概览。实际项目中,可能还需要进行批次效应校正、样本间比较(如肿瘤-正常配对分析)等。 注意:每一步骤的输出都应检查质量,确保上一步成功再进行下一步。 §§相关问题§§ 1. 外显子测序分析中,如何评估数据质控的结果? 2. 除了BWA,还有哪些常用的比对工具?它们各有什么优缺点? 3. 变异注释中,VEPANNOVAR有何异同?如何选择? 4. 在肿瘤样本的外显子测序分析中,有哪些特别的步骤? </think> 以下是基于Ubuntu系统的完整外显子测序(WES)分析流程,包含各步骤、软件及代码参数。流程基于行业标准实践设计[^1],适用于人类基因分析(GRCh38参考基因): --- ### **1. 原始数据质控(FastQC + MultiQC)** **软件**:FastQC + MultiQC **作用**:评估原始FASTQ文件质量 ```bash # 单个样本质控 fastqc -t 8 -o qc_report/ SRR28391647_1.fastq.gz SRR28391647_2.fastq.gz # 整合报告 multiqc qc_report/ -o multiqc_report/ ``` **关键参数**: - `-t 8`:使用8线程 - 需检查:Per base sequence quality > Q30,Adapter contamination --- ### **2. 数据过滤(Trimmomatic)** **软件**:Trimmomatic **作用**:去除低质量碱基接头 ```bash java -jar trimmomatic PE -phred33 \ SRR28391647_1.fastq.gz SRR28391647_2.fastq.gz \ SRR28391647_1_trimmed_paired.fq SRR28391647_1_trimmed_unpaired.fq \ SRR28391647_2_trimmed_paired.fq SRR28391647_2_trimmed_unpaired.fq \ ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36 ``` **参数说明**: - `ILLUMINACLIP`:切除Illumina接头 - `SLIDINGWINDOW:4:20`:窗口大小4bp,平均质量≥Q20 - `MINLEN:36`:保留长度≥36bp的reads --- ### **3. 序列比对(BWA-MEM)** **软件**:BWA + SAMtools **作用**:将reads比对到参考基因 ```bash # 成索引(只需一次) bwa index GRCh38.fa # 比对 bwa mem -t 16 -R '@RG\tID:16177_CCPM_1300019\tSM:BB5\tLB:Illumina\tPL:ILLUMINA' \ GRCh38.fa \ SRR28391647_1_trimmed_paired.fq SRR28391647_2_trimmed_paired.fq \ | samtools sort -@ 8 -o SRR28391647.sorted.bam # 建立索引 samtools index SRR28391647.sorted.bam ``` **关键参数**: - `-R`:设置Read Group息(样本ID/SM必填) - `-t 16`:16线程加速 --- ### **4. 标记重复序列(GATK MarkDuplicates)** **软件**:GATK4 ```bash gatk MarkDuplicates \ -I SRR28391647.sorted.bam \ -O SRR28391647.marked_dups.bam \ -M metrics.txt \ --CREATE_INDEX true ``` --- ### **5. 碱基质量校正(GATK BQSR)** **软件**:GATK4 **作用**:校正系统偏差 ```bash # 成重校准表 gatk BaseRecalibrator \ -R GRCh38.fa \ -I SRR28391647.marked_dups.bam \ --known-sites dbsnp_146.hg38.vcf.gz \ -O recal_data.table # 应用校正 gatk ApplyBQSR \ -R GRCh38.fa \ -I SRR28391647.marked_dups.bam \ --bqsr-recal-file recal_data.table \ -O SRR28391647.recal.bam ``` --- ### **6. 变异检测(GATK HaplotypeCaller)** **软件**:GATK4 ```bash gatk HaplotypeCaller \ -R GRCh38.fa \ -I SRR28391647.recal.bam \ -O sample_raw_variants.vcf.gz \ -ERC GVCF \ --dbsnp dbsnp_146.hg38.vcf.gz ``` **参数说明**: - `-ERC GVCF`:输出gVCF便于多样本联合分析 - 肿瘤样本需添加`--af-of-alleles-not-in-resource`参数 --- ### **7. 变异质控与过滤(GATK VariantFiltration)** ```bash # SNP过滤 gatk VariantFiltration \ -R GRCh38.fa \ -V sample_raw_snps.vcf.gz \ -O filtered_snps.vcf \ --filter-name "lowQD" --filter-expression "QD < 2.0" \ --filter-name "highFS" --filter-expression "FS > 60.0" # Indel过滤 gatk VariantFiltration \ -V sample_raw_indels.vcf.gz \ --filter-name "lowQD" --filter-expression "QD < 2.0" \ --filter-name "highFS" --filter-expression "FS > 200.0" ``` --- ### **8. 变异注释(VEP)** **软件**:Ensembl VEP ```bash vep -i filtered_variants.vcf \ --species homo_sapiens \ --assembly GRCh38 \ --cache --dir_cache ${VEP_CACHE} \ --offline \ --fork 8 \ --everything \ --output_file annotated_variants.vcf ``` **关键参数**: - `--everything`:启用所有注释模块 - `--fork 8`:并行加速 --- ### **9. 结果可视化(IGV)** **软件**:Integrative Genomics Viewer 手动载入BAMVCF文件检查变异位点可靠性 --- ### **环境配置建议** ```bash # 通过Bioconda安装所有软件 conda create -n wes_analysis \ fastqc multiqc trimmomatic bwa samtools gatk4 vep ``` ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

生信与基因组学

每一份鼓励是我坚持下去动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值