基于snakemake和python脚本分析宏基因组全流程整理

已经在Linux系统的conda环境安装好如下软件(conda activate snakemake)

软件名称

软件功能

Trimmomatic

质控

megahit

组装

prodigal

基因预测

linclust

聚类

salmon

定量

diamond

注释

准备工作:

将raw read放到./raw_fq文件中

编写mapping.txt(/ZYdata2/ys2022/metegenome/script/mapping.txt)首行不动,一行一样品名称

mapping.txt


#SampleID
VAB
VCD

备注:./表示project_config.yaml中的Workdir

project_config.yaml


# == current & fastq & mapping file location ==
workdir: /ZYdata2/ys2022/metegenome/V24_LZmedium2
Fqlocation: /ZYdata2/ys2022/metegenome/V24_LZmedium2/raw_fq
mapping_file: /ZYdata2/ys2022/metegenome/script/mapping.txt
# == Software parameters ==
trim:
    method: fastp
    fastp:
       # parameter: -5 -W 5 -M 20 -q 15 -u 40 -l 50#华师钟胜基使用的参数,本方案采用默认参数,比钟的参数更加严格
        threads: 120

assembly:
    method: megahit
    threads: 40
    megahit:
        parameter: --k-min 35 --k-max 95 --k-step 20 --min-contig-len 500 -m 0.8

taxa:
    topN:
        10

代码全部都在/ZYdata2/ys2022/metegenome/script中,跑程序就在改路径跑就好了

一、质控

  • 方法一:fastp

运行下列qc.smk文件

snakemake -s qc.smk --core 2 -n  --scheduler greedy

conda snakemake环境2025.5出了问题,需要加上 --scheduler greedy才能解析命令,暂时先用,不做优化

备注:此处的--core是并行运算的数量,多少个样品就设置多少个core。它与软件内部的--thread是没有矛盾的,后者只单个任务使用的核数

fastp默认参数-W 4( the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])) -M 20(the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])) -q 15(the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])) -u 40(how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])) -l 15(reads shorter than length_required will be discarded, default is 15. (int [=15]))

默认参数中--cut_front关闭,现在启用(-5)


import os#导入os模块
from os.path import join#导入join模块
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
    for line in mf:
        Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
Fqlocation = config["Fqlocation"]
os.chdir(Workdir)



rule all:#编写全局输出,即最终输出文件
    input:
        expand("1.clean_data/{sample}/clean_R1.fq.gz",sample = Sample),
        expand("1.clean_data/{sample}/clean_R2.fq.gz",sample = Sample)

rule trim:#编写质控规则
        input:#输入双端测序原始数据位置,其中{sample}是指所有样品名称
            fq1 = join(Fqlocation,"{sample}.R1.fq.gz"),#Fqlocation中要以/结尾。此处需要加","
            fq2 = join(Fqlocation,"{sample}.R2.fq.gz")#此处不需要加","
        output:#编写输出文件路径和名称
            "1.clean_data/{sample}/clean_R1.fq.gz",
            "1.clean_data/{sample}/clean_R2.fq.gz"
        params:
            threads = config["trim"]["fastp"]["threads"]
        shell:#运行软件,及其参数。"斜杠"表示换行。
            "fastp --in1 {input.fq1} --in2 {input.fq2} --out1 {output[0]} --out2 {output[1]} --thread {params.threads} -5 -W 5 -M 20 -q 15 -u 40 -l 50 \
            -j ./1.clean_data/{wildcards.sample}/{wildcards.sample}_fastp.json \
            -h ./1.clean_data/{wildcards.sample}/{wildcards.sample}_fastp.html"
  • 方法二:Trimmomatic(采用此方法,输出数据质量会高一点)

conda activate snakemake

snakemake -s qctrim.smk --core 100 -p


import os#导入os模块
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
    for line in mf:
        Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
Fqlocation = config["Fqlocation"]
os.chdir(Workdir)



rule all:#编写全局输出,即最终输出文件
    input:
        expand("1.clean_data/{sample}/clean_R1.fq.gz",sample = Sample),
        expand("1.clean_data/{sample}/clean_R2.fq.gz",sample = Sample),
        expand("1.clean_data/{sample}/output_forward_unpaired.fq.gz",sample = Sample),
        expand("1.clean_data/{sample}/output_reverse_unpaired.fq.gz",sample = Sample)

rule trim:#编写质控规则
        input:#输入双端测序原始数据位置,其中{sample}是指所有样品名称
            fq1 = join(Fqlocation,"{sample}.R1.fq.gz"),#Fqlocation中要以/结尾。此处需要加","
            fq2 = join(Fqlocation,"{sample}.R2.fq.gz")#此处不需要加","
        output:#编写输出文件路径和名称
            "1.clean_data/{sample}/clean_R1.fq.gz",
            "1.clean_data/{sample}/clean_R2.fq.gz",
            "1.clean_data/{sample}/output_forward_unpaired.fq.gz",
            "1.clean_data/{sample}/output_reverse_unpaired.fq.gz"
        shell:#运行软件,及其参数。"斜杠"表示换行。
            "java -jar /ZYdata2/ys2022/metegenome/script/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 -threads 10 {input.fq1} {input.fq2} {output[0]} {output[2]} {output[1]} {output[3]} ILLUMINACLIP:/ZYdata2/ys2022/metegenome/script/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 SLIDINGWINDOW:5:20 MINLEN:50"
#FastQC判断清洗质量
rule FastQC:
        input:
            "1.clean_data/{sample}/clean_R1.fq.gz",
            "1.clean_data/{sample}/clean_R2.fq.gz"
        output:"1.clean_data/{sample}/FastQC"
        shell:
            "/ZYdata2/ys2022/metegenome/script/FastQC/fastqc {input[0]} {input[1]} -o {output} -t 10"

如果FastQC步骤没有自己跑,就手动就跑,进入目录1.clean_data,创建FastQC

运行:/ZYdata2/ys2022/metegenome/script/FastQC/fastqc clean_R1.fq.gz clean_R2.fq.gz -o ./FastQC -t 10

二、组装——megahit

snakemake -s assambly.smk --core 40 -n

megahit中的-t参数是单个样品的核心数,snakemake的--core是所有样品总核心数。例如-t 20和--core 100,将会有5个core = 20的样品同时跑,其余样品排队。


import os
from os.path import join
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
    for line in mf:
        Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
Fqlocation = config["Fqlocation"]
os.chdir(Workdir)

rule all:
    input:
        expand("2.assembly/{sample}",sample = Sample)

rule megahit:
        input:
            r1 = "1.clean_data/{sample}/clean_R1.fq.gz",#Fqlocation中要以/结尾
            r2 = "1.clean_data/{sample}/clean_R2.fq.gz"
        output:directory("2.assembly/{sample}")
        params:
            param = config['assembly']['megahit']["parameter"]
        threads:config["assembly"]["threads"]
        shell:
            "megahit -1 {input.r1} -2 {input.r2} -o {output} {params.param} -t {threads}"

运行RenameContig.py在基因名称前加上样品名称

python RenameContig.py


import os

print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
    for line in mf:
        Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
    for line in pj:
        if 'workdir' in line:
            line = line.strip()
            workdir = line.split(': ')[1]
os.chdir(workdir)

print('正在修改名称,请等待..................')
#修改名称
for sample in Sample:
    input = os.path.join(os.getcwd(),'2.assembly',sample,"final.contigs.fa")
    file = open(input,'r')
    Newfile = open(os.path.join(os.getcwd(),'2.assembly',sample,"final.contigsRename.fa"),'a+')
    for line in file:
        if line.startswith('>'):
            line = line.replace('>', ">" + sample + '_')
            Newfile.write(line)
        else:
            Newfile.write(line)
print('成功修改基因名称(在名称前加上样品标签),保存于2.assembly/final.contigsRename.fa')

      三、基因预测——prodigal,聚类——cd-hit,定量——salmon

      方法一:基因预测——prodigal,聚类——cd-hit,定量——salmon(舍弃,cd-hit聚类程度太大)

      snakemake -s gene.smk --core 120 -p

      
      import os
      from os.path import join
      #加载配置文件
      configfile:'project_config.yaml'
      mapping_file = config["mapping_file"]
      Sample = []
      with open (mapping_file,'r') as mf:
          for line in mf:
              Sample.append(line.split('\n')[0])
      Sample = Sample[1:]
      Workdir = config["workdir"]
      os.chdir(Workdir)
      
      rule all:
          input:
              expand("3.gene/prodigal/{sample}.gff",sample = Sample),
              expand("3.gene/prodigal/{sample}_prot.faa",sample = Sample),
              expand("3.gene/prodigal/{sample}_nucl.fna",sample = Sample),
              expand("3.gene/prodigal/log/{sample}.log",sample = Sample),
              expand("3.gene/prodigal/all_nucl.fna"),
              expand("3.gene/uniq_genes_nucl.fna"),
              expand("3.gene/uniq_genes_prot.faa"),
              expand("3.gene/salmon/index"),
              expand("3.gene/salmon/{sample}.quant",sample = Sample),
              expand("3.gene/gene.TPM"),
              expand("3.gene/gene.count")
      #用prodigal预测基因
      rule prodigal:
          input:"2.assembly/{sample}/final.contigsRename.fa"
          output:
              gff = ("3.gene/prodigal/{sample}.gff"),
              faa = ("3.gene/prodigal/{sample}_prot.faa"),
              fna = ("3.gene/prodigal/{sample}_nucl.fna"),
              log = ("3.gene/prodigal/log/{sample}.log")
          shell:
              "prodigal -i {input} -o {output.gff} -d {output.fna} -a {output.faa} -p meta  &> {output.log}"
      
      #将所有样品的基因核酸序列合并到一个文件
      rule get_all_nucl:
          input:
              expand("3.gene/prodigal/{sample}_nucl.fna",sample=Sample)
          output:
              "3.gene/prodigal/all_nucl.fna"
          shell:
              "cat {input} > {output}"
      #聚类获得非冗余基因
      # aS覆盖度,c相似度,G局部比对,g最优解,T多线程,M内存0不限制
      rule cluster:
          input: "3.gene/prodigal/all_nucl.fna"
          output: "3.gene/uniq_genes_nucl.fna"
          shell:
              "cd-hit-est -i {input} -o {output} -aS 0.9 -c 0.95 -G 0 -g 0 -T 0 -M 0"
      #将非冗余基因翻译成蛋白质
      rule trans_seq:
          input: rules.cluster.output
          output: "3.gene/uniq_genes_prot.faa"
          shell:
              "seqkit translate {input} -T 11 > {output}"
      #创建salmon索引
      rule salmon_index:
          input: "3.gene/uniq_genes_nucl.fna"
          output: directory("3.gene/salmon/index")
          shell:
              "salmon index -t {input} -p 9 -i {output}"
      
      # 定量,l文库类型自动选择,p线程,--meta宏基因组模式, 2个任务并行2个样
      rule salmon_quant:
          input:
              fq1 = "1.clean_data/{sample}/clean_R1.fq.gz",
              fq2 = "1.clean_data/{sample}/clean_R2.fq.gz",
              index = "3.gene/salmon/index"
          output: directory("3.gene/salmon/{sample}.quant")
          shell:
              "salmon quant -i {input.index} -l A -p 8 --meta \
              -1 {input.fq1} -2 {input.fq2} -o {output}"
      
      rule salmon_quantmerge:
          input: expand("3.gene/salmon/{sample}.quant",sample=Sample)
          output: TPM = "3.gene/gene.TPM",
                  Count = "3.gene/gene.count"
          shell:'''
              salmon quantmerge --quants 3.gene/salmon/*.quant -o {output.TPM}
              salmon quantmerge --quants 3.gene/salmon/*.quant --column NumReads -o {output.Count}
      '''
      • 方法二:基因预测——prodigal,聚类——linclust,定量——salmon

      snakemake -s geneLinclust1.smk -p --core 100

      
      import os
      from os.path import join
      
      # 加载配置文件
      configfile: 'project_config.yaml'
      mapping_file = config["mapping_file"]
      Sample = []
      with open(mapping_file, 'r') as mf:
          for line in mf:
              Sample.append(line.strip())
      Sample = Sample[1:]
      Workdir = config["workdir"]
      os.chdir(Workdir)
      
      # 调试信息
      print("Current working directory:", os.getcwd())
      print("Sample list:", Sample)
      
      rule all:
          input:
              expand("3.gene/prodigal/{sample}.gff", sample=Sample),
              expand("3.gene/prodigal/{sample}_prot.faa", sample=Sample),
              expand("3.gene/prodigal/{sample}_nucl.fna", sample=Sample),
              expand("3.gene/prodigal/log/{sample}.log", sample=Sample),
              "3.gene/prodigal/all_nucl.fna",
      
      
      # 用prodigal预测基因
      rule prodigal:
          input: "2.assembly/{sample}/final.contigsRename.fa"
          output:
              gff = "3.gene/prodigal/{sample}.gff",
              faa = "3.gene/prodigal/{sample}_prot.faa",
              fna = "3.gene/prodigal/{sample}_nucl.fna",
              log = "3.gene/prodigal/log/{sample}.log"
          shell:
              "prodigal -i {input} -o {output.gff} -d {output.fna} -a {output.faa} -p meta &> {output.log}"
      
      # 将所有样品的基因核酸序列合并到一个文件
      rule get_all_nucl:
          input:
              expand("3.gene/prodigal/{sample}_nucl.fna", sample=Sample)
          output:
              "3.gene/prodigal/all_nucl.fna"
          shell:
              "cat {input} > {output}"
      
      

      snakemake -s geneLinclust2.smk -p --core 100 -n

      此处是为了获取命令,手动跑每一条命令。

       
      import os
      from os.path import join
       
      # 加载配置文件
      configfile: 'project_config.yaml'
      mapping_file = config["mapping_file"]
      Sample = []
      with open(mapping_file, 'r') as mf:
          for line in mf:
              Sample.append(line.strip())
      Sample = Sample[1:]
      Workdir = config["workdir"]
      os.chdir(Workdir)
       
      # 调试信息
      print("Current working directory:", os.getcwd())
      print("Sample list:", Sample)
       
      rule all:
          input:
              "3.gene/linclust/all_nucl.linclustDB",
              "3.gene/linclust/all_nucl.linclust_DB_clu",
              "3.gene/linclust/tmp",
              "3.gene/linclust/DB_clu_re",
              "3.gene/uniq_genes_nucl.fna",
              "3.gene/uniq_genes_prot.faa",
              directory("3.gene/salmon/index"),
              expand("3.gene/salmon/{sample}.quant", sample=Sample),
              "3.gene/gene.TPM",
              "3.gene/gene.count"
       
      # 聚类获得非冗余基因
      # 将all_nucl.fna数据转为DB格式
      rule DB:
          input: "3.gene/prodigal/all_nucl.fna"
          output: "3.gene/linclust/all_nucl.linclustDB"
          shell:
              "mmseqs createdb {input} {output}"
       
      # linclust聚类
      # 手动跑
      rule cluster:
          input: "3.gene/linclust/all_nucl.linclustDB"
          output:
              DB_clu = "3.gene/linclust/all_nucl.linclust_DB_clu",
              tmp = "3.gene/linclust/tmp"
          shell:
              "mmseqs linclust {input} {output.DB_clu} {output.tmp} -e 0.001 --min-seq-id 0.9 -c 0.80 --threads 100"
       
      # 提取代表序列
      rule DB_clu_re:
          input:
              DB = "3.gene/linclust/all_nucl.linclustDB",
              DB_clu = "3.gene/linclust/all_nucl.linclust_DB_clu"
          output: "3.gene/linclust/DB_clu_re"
          shell:
              "mmseqs createsubdb {input.DB_clu} {input.DB} {output}"
       
      # 将DB_clu_re转为fasta格式
      rule refasta:
          input: "3.gene/linclust/DB_clu_re"
          output: "3.gene/uniq_genes_nucl.fna"
          shell:
              "mmseqs convert2fasta {input} {output}"
       
      # 将非冗余基因翻译成蛋白质
      rule trans_seq:
          input: "3.gene/uniq_genes_nucl.fna"
          output: "3.gene/uniq_genes_prot.faa"
          shell:
              "seqkit translate {input} -T 11 > {output}"
       
      # 创建salmon索引
      rule salmon_index:
          input: "3.gene/uniq_genes_nucl.fna"
          output: directory("3.gene/salmon/index")
          shell:
              "salmon index -t {input} -p 10 -i {output}"
       
      # 定量,文库类型自动选择,p线程,--meta宏基因组模式
      rule salmon_quant:
          input:
              fq1 = "1.clean_data/{sample}/clean_R1.fq.gz",
              fq2 = "1.clean_data/{sample}/clean_R2.fq.gz",
              index = "3.gene/salmon/index"
          output: directory("3.gene/salmon/{sample}.quant")
          shell:
              "salmon quant -i {input.index} -l A -p 10 --meta -1 {input.fq1} -2 {input.fq2} -o {output}"
       
      rule salmon_quantmerge:
          input: expand("3.gene/salmon/{sample}.quant", sample=Sample)
          output:
              TPM = "3.gene/gene.TPM",
              Count = "3.gene/gene.count"
          shell: '''
              salmon quantmerge --quants 3.gene/salmon/*.quant -o {output.TPM}
              salmon quantmerge --quants 3.gene/salmon/*.quant --column NumReads -o {output.Count}
          '''

        2025年5月份conda snakemake环境的salmon定量错报,从新更新了该环境下的salmon仍无法使用,原因是该环境太多依赖关系了

        因此创建了一个新的conda环境:salmon2025

        用上述geneLinclust2.smk获取命令后,使用conda环境:salmon2025,完成定量步骤

        NR物种注释——diamond

        方法一(不推荐)

        此方法使用diamond内置简化LCA算法,大概率naive LCA

        snakemake -s annotionNR.smk --core 120 -p

        
        import os
        from os.path import join
        
        #加载配置文件
        configfile:'project_config.yaml'
        mapping_file = config["mapping_file"]
        Sample = []
        with open (mapping_file,'r') as mf:
            for line in mf:
                Sample.append(line.split('\n')[0])
        Sample = Sample[1:]
        Workdir = config["workdir"]
        os.chdir(Workdir)
        
        
        rule all:
            input:
               expand("4.annotion/NR/nr.blast.LCA")
        
        rule NR_align:
            input:
                gene = "3.gene/uniq_genes_prot.faa",
                nr_dmnd = "/ZYdata1/ys2022/metegenome/Database/NRtaxonomy.dmnd"
            output:"4.annotion/NR/nr.blast.LCA"
            shell:'''
            diamond blastp -d {input.nr_dmnd} -q {input.gene} -f 102 -p 120 -e 0.00001 -o {output} -b 10 -c 1
            '''

        将diamond LCA注释的Taxonomy ID转为Lineages

        其中/ZYdata1/ys2022/metegenome/Database/NRfromYs/ncbi_lineages_2022-12-20.csv参照GitHub - zyxue/ncbitax2lin: 🐞 Convert NCBI taxonomy dump into lineages获取

        python TaxonomyIDtoLineages.py

        
        import pandas as pd
        import os
        
        print('读取样品名称')
        #读取mapping.txt中的样品名称
        Sample = []
        with open ('mapping.txt','r') as mf:
            for line in mf:
                Sample.append(line.split('\n')[0])
        Sample = Sample[1:]
        print('读取工作路径,并修改当前工作路径')
        #读取project_config.yaml中的文件路径,并修改当前工作路径
        with open ('project_config.yaml','r') as pj:
            for line in pj:
                if 'workdir' in line:
                    line = line.strip()
                    workdir = line.split(': ')[1]
        os.chdir(workdir)
        
        #步骤一,增加表头
        #读取diamond比对NCBI,LCA算法输出的结果
        request = pd.read_csv(('./4.annotion/NR/nr.blast.LCA'),
                              sep='\t',header=None,names=['Unigene_ID','Taxonomy_ID','e-value'])
        request = request.sort_values(by='Taxonomy_ID')
        #增加新的列
        request[['superkingdom','phylum','class','order',
                   'family','genus','species']]=request.apply(lambda x:('','','','','','',''),axis=1,result_type='expand')
        #读取ncbi_lineages数据库
        data = pd.read_csv(('/ZYdata1/ys2022/metegenome/Database/NRfromYs/ncbi_lineages_2022-12-20.csv'),
                usecols=['tax_id','superkingdom','phylum','class','order','family','genus','species'],
                low_memory=False, index_col='tax_id')
        #Lineages搜索
        row_num = len(request)
        i = 0
        ID1 = ''
        
        #创建需要报错的,double-check,如果输出的double-check文件非空,则需要去NCBI taxonomy数据库搜索报错的ID,
        #再去ncbi_lineages_2022-12-20-2.csv修改,然后重新跑一次该步骤
        check = open('./4.annotion/NR/neededDoubleCheck','w')
        
        while i < row_num:
            ID2 = request.loc[i,'Taxonomy_ID']
            if ID2 != ID1:#判断ID2是否等于ID1
                try:#如果try中的步骤没有报错,则执行它而不执行except
                    #从data中搜索ID2,并将搜索结果做成dataframe
                    s = data.loc[ID2]
                    df = s.to_frame().T
                    a = df.reset_index(drop=True)
                    #将搜索结果合并到request dataframe中
                    request.loc[i,'superkingdom'] = a.iloc[0,0]
                    request.loc[i, 'phylum'] = a.iloc[0, 1]
                    request.loc[i, 'class'] = a.iloc[0, 2]
                    request.loc[i, 'order'] = a.iloc[0, 3]
                    request.loc[i, 'family'] = a.iloc[0, 4]
                    request.loc[i, 'genus'] = a.iloc[0, 5]
                    request.loc[i, 'species'] = a.iloc[0, 6]
                except:#如果try中步骤报错,表明data中搜索不到该Taxonomy ID,执行except中的步骤
                    check.write(str(ID2)+'\n')
            else:#如果ID2==ID1成立,则不需要进行搜索,直接写入上一次搜索结果
                request.loc[i, 'superkingdom'] = a.iloc[0, 0]
                request.loc[i, 'phylum'] = a.iloc[0, 1]
                request.loc[i, 'class'] = a.iloc[0, 2]
                request.loc[i, 'order'] = a.iloc[0, 3]
                request.loc[i, 'family'] = a.iloc[0, 4]
                request.loc[i, 'genus'] = a.iloc[0, 5]
                request.loc[i, 'species'] = a.iloc[0, 6]
            ID1 = ID2#将上一次搜索的taxonomy ID赋值给ID1.返回while下的第一步,即读取第一个taxonomy ID
            i += 1
        request.to_csv('./4.annotion/NR/nr.blast.LCA2Lineages.csv',index=False)#输出文件,默认mode='w',即清理文件后写入
        
        #检查是否有需要double-check的ID
        check = open('check.txt','r')
        num = len(check.readlines())
        if num == 0:
            print('匹配成功,请进入Lineages2Quant操作')
        else:
            print(str(num) + ' ID needed double-check')
            print('请到NCBI toxonomy数据库搜索该ID,并将其整合到ncbi_lineages_2022-12-20.csv中,然后再重新跑一次')
        check = open('check.txt','r')
        for line in check:
            line = line.strip()
            print(str(line) + ' needed double-check')

        整个Lineages与Quant

        python Lineages2Quant.py

        
        import pandas as pd
        import os
        
        print('读取样品名称')
        #读取mapping.txt中的样品名称
        Sample = []
        with open ('mapping.txt','r') as mf:
            for line in mf:
                Sample.append(line.split('\n')[0])
        Sample = Sample[1:]
        print('读取工作路径,并修改当前工作路径')
        #读取project_config.yaml中的文件路径,并修改当前工作路径
        with open ('project_config.yaml','r') as pj:
            for line in pj:
                if 'workdir' in line:
                    line = line.strip()
                    workdir = line.split(': ')[1]
        os.chdir(workdir)
        
        
        #步骤三:整合quant
        print('正在读取nr.blast.LCA2Lineages.csv,请等待................')
        request = pd.read_csv('./4.annotion/NR/nr.blast.LCA2Lineages.csv',sep=',')
        
        #增加新的列
        #编写新的列名称,需要做成[]字典格式
        new_TMP_col = []
        new_count_col = []
        kongzhi = []
        for name in Sample:
            new_TMP_col.append(name + '.gene.TPM')
            kongzhi.append('')
            new_count_col.append(name + '.gene.count')
            kongzhi.append('')
        #增加新的列
        request[new_TMP_col + new_count_col]=request.apply(lambda x:(kongzhi),axis=1,result_type='expand')
        print('request增加新列成功')
        
        #读取定量文件
        print('正在读取定量文件,请等待...............')
        TPM = pd.read_csv('./3.gene/gene.TPM',sep='\t',index_col='Name')
        count = pd.read_csv('./3.gene/gene.count',sep='\t',index_col='Name')
        
        
        row_num = len(request)
        i = 0
        print('正在匹配,请等待.................')
        while i < row_num:
            ID = request.loc[i,'Unigene_ID']#逐一读取nr.blast.LCA2Lineages.csv中的Unigene_ID列
            sTMP = TPM.loc[ID]#在TMP定量文件中索引ID
            #将索引结果转化为dataframe
            dfTMP = sTMP.to_frame().T
            a = dfTMP.reset_index(drop=True)
            #在nr.blast.LCA2Lineages中追加TMP列
            j = 0
            for new_TMP in new_TMP_col:
                request.loc[i, new_TMP] = a.iloc[0, j]
                j += 1
            #整合count结果
            scount = count.loc[ID]
            dfcount = scount.to_frame().T
            b = dfcount.reset_index(drop=True)
            k = 0
            for new_count in new_count_col:
                request.loc[i, new_count] = b.iloc[0, k]
                k += 1
            i += 1
        print('匹配完成,正在保存文件,请等待..............')
        request.to_csv('./4.annotion/NR/lineages2geneQuant.csv',index=False)
        print('已成功将NR.LCA2Lineages与定量文件整和,输出路径:./4.annotion/NR/lineages2geneQuant.csv')
        
        #步骤四,拆分结果文件,如果有必要的话
        #为了满足windows excel最多读取104w行,将lineages2geneQuant.csv每100W行另存为一个文件
        lineages2geneQuant = pd.read_csv('./4.annotion/NR/lineages2geneQuant.csv',sep=',',low_memory=False)
        row_num = len(lineages2geneQuant)
        step = 1000000
        
        for start in range(0, row_num, step):
            stop = start + step
            filename = './4.annotion/NR/lineages2geneQuant_{}-{}.csv'.format(start,stop)
            d = lineages2geneQuant[start : stop]
            print("Saving file : " + filename + ", data size : " + str(len(d)))
            d.to_csv(filename,index=0)
        
        

          方法二:diamond(比对) + megan(显示物种 + weighted LCA)

          diamond比对输出.daa文件

          snakemake -s annotionNRdaa.smk --core 120 -p

          
          import os
          from os.path import join
          
          #加载配置文件
          configfile:'project_config.yaml'
          mapping_file = config["mapping_file"]
          Sample = []
          with open (mapping_file,'r') as mf:
              for line in mf:
                  Sample.append(line.split('\n')[0])
          Sample = Sample[1:]
          Workdir = config["workdir"]
          os.chdir(Workdir)
          
          
          rule all:
              input:
                 expand("4.annotion/NR/nr.blast.daa")
          
          rule NR_align:
              input:
                  gene = "3.gene/uniq_genes_prot.faa",
                  nr_dmnd = "/ZYdata1/ys2022/metegenome/Database/NRfromYs/NR/NR.dmnd"
              output:"4.annotion/NR/nr.blast.daa"
              shell:'''
              diamond blastp -d {input.nr_dmnd} -q {input.gene} -f 100 -p 120 -e 0.00001 -o {output} -b 10 -c 1
              '''

          .daa转.rma。在pwd='/4.annotion/NR'中运行下列代码

          /home/ys202108/megan/tools/daa2rma -i nr.blast.daa -mdb /ZYdata1/ys2022/metegenome/Database/MeganData/megan-map-Feb2022.db -alg weighted -t 50 -o ./nr.blast.rma

          继续运行,提取物种信息:

          /home/ys202108/megan/tools/rma2info -i ./nr.blast.rma -r2c Taxonomy -n true --paths true --ranks true -mro true -v > ./TaxonomyMegan.txt

          运行python脚本,整理物种信息(方法一):(此脚本运行太慢,发现[K]kingdom分类的行都是病毒,先将它们排除出来)(舍弃该方法)

          python MeganSpilt.py

          
          import pandas as pd
          import os
          
          print('读取样品名称')
          #读取mapping.txt中的样品名称
          Sample = []
          with open ('mapping.txt','r') as mf:
              for line in mf:
                  Sample.append(line.split('\n')[0])
          Sample = Sample[1:]
          print('读取工作路径,并修改当前工作路径')
          #读取project_config.yaml中的文件路径,并修改当前工作路径
          with open ('project_config.yaml','r') as pj:
              for line in pj:
                  if 'workdir' in line:
                      line = line.strip()
                      workdir = line.split(': ')[1]
          os.chdir(workdir)
          
          request = pd.read_csv(('./4.annotion/NR/TaxonomyMegan.txt'),
                                sep='\t',header=None,names=['Unigene_ID','Level','Taxonomy'])
          
          request[['superkingdom','kingdom','phylum','class','order',
                     'family','genus','species']]=request.apply(lambda x:('','','','','','','',''),axis=1,result_type='expand')
          
          request.sort_values(by='Level',ignore_index = True,inplace= True)
          
          row_num = len(request)
          i = 0
          j = 0
          k = 0
          while i < row_num:
              ID2 = request.loc[i, 'Taxonomy']
              if isinstance(ID2, float):
                  k += 1
              i += 1
          i = k
          
          while i < row_num:
              ID2 = request.loc[i, 'Taxonomy']
              toxonomy = ID2.split(';')
              long = len(toxonomy)
              for tox in toxonomy:
                  level = tox.split('] ')
                  if level != ['']:
                      #print(level[0].split('[')[1])
                      a =  (level[0].split('[')[1])
                      if a == 'D':
                          request.loc[i, 'superkingdom'] = level[1]
                      if a == 'K':
                          request.loc[i, 'kingdom'] = level[1]
                      if a == 'P':
                          request.loc[i, 'phylum'] = level[1]
                      if a == 'C':
                          request.loc[i, 'class'] = level[1]
                      if a == 'O':
                          request.loc[i, 'order'] = level[1]
                      if a == 'F':
                          request.loc[i, 'family'] = level[1]
                      if a == 'G':
                          request.loc[i, 'genus'] = level[1]
                      if a == 'S':
                          request.loc[i, 'species'] = level[1]
                  j += 1
              i += 1
          request.to_csv('./4.annotion/NR/TaxonomyMeganOutput.csv',index=False)

          运行python脚本,整理物种信息(方法二):

          python MeganSpilt.py

          
          import pandas as pd
          import os
          
          print('读取样品名称')
          #读取mapping.txt中的样品名称
          Sample = []
          with open ('mapping.txt','r') as mf:
              for line in mf:
                  Sample.append(line.split('\n')[0])
          Sample = Sample[1:]
          print('读取工作路径,并修改当前工作路径')
          #读取project_config.yaml中的文件路径,并修改当前工作路径
          with open ('project_config.yaml','r') as pj:
              for line in pj:
                  if 'workdir' in line:
                      line = line.strip()
                      workdir = line.split(': ')[1]
          os.chdir(workdir)
          #增加表头,另存为csv文件
          print('增加表头,另存为csv文件')
          request = pd.read_csv(('./4.annotion/NR/TaxonomyMeganWeigh5.txt'),
                                sep='\t',header=None,names=['Unigene_ID','Level','Taxonomy'])
          request.to_csv('./4.annotion/NR/TaxonomyMegan.csv',index=False)
          #筛选含有kingdom层级的行
          print('筛选含有kingdom层级的行,并另存为')
          NoK = open('./4.annotion/NR/TaxonomyMeganNoK.csv','w')
          K = open('./4.annotion/NR/TaxonomyMeganK.csv','w')
          with open('./4.annotion/NR/TaxonomyMegan.csv','r') as request:
              for line in request:
                  if '[K]' in line:
                      K.write(line)
                  if 'Unigene_ID' in line:
                      NoK.write('Unigene_ID,Level,superkingdom,phylum,class,order,family,genus,species' + '\n')
                  if '[K]' not in line:
                      if 'Unigene_ID' in line:
                          pass
                      else:
                          line = line.replace('; ',',')
                          line = line.replace(';', ',')
                          NoK.write(line)

          在./4.annotion/NR/目录下运行:

          cat TaxonomyMeganK.csv >> TaxonomyMeganNoK.csv

          rm TaxonomyMeganK.csv TaxonomyMegan.csv

          mv TaxonomyMeganNoK.csv TaxonomyMegan.csv

          kegg功能注释——diamond

          snakemake -s annotionKegg.smk --core 120 -p

          
          import os
          from os.path import join
          #加载配置文件
          configfile:'project_config.yaml'
          mapping_file = config["mapping_file"]
          Sample = []
          with open (mapping_file,'r') as mf:
              for line in mf:
                  Sample.append(line.split('\n')[0])
          Sample = Sample[1:]
          Workdir = config["workdir"]
          os.chdir(Workdir)
          
          
          rule all:
              input:
                  expand('4.annotion/kegg/kegg.blast.txt')
          
          rule kegg_align:
              input:
                  gene = "3.gene/uniq_genes_prot.faa",
                  kegg_dmnd = "/ZYdata1/ys2022/metegenome/Database/kegg.dmnd"
              output:"4.annotion/kegg/kegg.blast.txt"
              shell:'''
              diamond blastp -d {input.kegg_dmnd} -q {input.gene} -f 6 -p 120 -e 0.00001 -o {output} --outfmt 6 qseqid qlen sseqid stitle slen pident length mismatch gapopen qstart qend sstart send evalue bitscore
              '''

          取e-value最小值作为基因的注释结果

          python lowestevalueKegg.py

          
          import pandas as pd
          import os
          
          print('读取样品名称')
          #读取mapping.txt中的样品名称
          Sample = []
          with open ('mapping.txt','r') as mf:
              for line in mf:
                  Sample.append(line.split('\n')[0])
          Sample = Sample[1:]
          sample = Sample
          print('读取工作路径,并修改当前工作路径')
          #读取project_config.yaml中的文件路径,并修改当前工作路径
          with open ('project_config.yaml','r') as pj:
              for line in pj:
                  if 'workdir' in line:
                      line = line.strip()
                      workdir = line.split(': ')[1]
          os.chdir(workdir)
          
          #一共三个步骤,每个步骤分开运行
          #运行第一步
          #读取diamond比对kegg结果
          print('正在读取kegg结果,请等待.............')
          request = pd.read_csv(os.path.join('./4.annotion/kegg/kegg.blast.txt'),
                                        sep='\t',header=None,names=['qseqid','qlen','sseqid','stitle','slen','pident','length','mismatch','gapopen',
                                            'qstart','qend','sstart','send','evalue','bitscore'])
          #运行第二步
          #方法一
          #重新读取带表头的csv
          print('正在取最小e-value,请等待.............')
          request.drop_duplicates(subset='qseqid',keep='first',inplace=True)
          request.to_csv('./4.annotion/kegg/kegg.bestevalue.csv',index=False)
          print('取最小e-value完成,保存在./4.annotion/kegg/kegg.bestevalue.csv')
          '''
          #方法二,用了方法一,无需运行方法二
          file = open('kegg.blast.csv','r')
          out = open('kegg.bestevalue.csv','w')
          qseqid1 = ''
          for line in file:
              qseqid2 = line.split(",")[0]
              if qseqid2 != qseqid1:
                  out.write(line)
              else:
                  pass                              
              qseqid1 = qseqid2
          '''
          
          

          整合kegg和NR结果

          方法一:将kegg.bestevalue.csv与gene.TMP和gene.count整合(不推荐)(舍弃该步骤)

          python kegg2quant.py

          
          import pandas as pd
          import os
          
          print('读取样品名称')
          #读取mapping.txt中的样品名称
          Sample = []
          with open ('mapping.txt','r') as mf:
              for line in mf:
                  Sample.append(line.split('\n')[0])
          Sample = Sample[1:]
          print('读取工作路径,并修改当前工作路径')
          #读取project_config.yaml中的文件路径,并修改当前工作路径
          with open ('project_config.yaml','r') as pj:
              for line in pj:
                  if 'workdir' in line:
                      line = line.strip()
                      workdir = line.split(': ')[1]
          os.chdir(workdir)
          
          #将kegg.bestevalue.csv与gene.TMP和gene.count整合
          request = pd.read_csv('./4.annotion/kegg/kegg.bestevalue.csv',sep=',',low_memory=False)
          
          #增加新的列
          #编写新的列名称,需要做成[]字典格式
          new_TMP_col = []
          new_count_col = []
          kongzhi = []
          for name in Sample:
              new_TMP_col.append(name + '.gene.TPM')
              kongzhi.append('')
              new_count_col.append(name + '.gene.count')
              kongzhi.append('')
          
          request[new_TMP_col + new_count_col]=request.apply(lambda x:(kongzhi),axis=1,result_type='expand')
          print('request增加新列成功')
          
          print('正在读取定量文件,请等待...............')
          TPM = pd.read_csv('./3.gene/gene.TPM',sep='\t',index_col='Name')
          count = pd.read_csv('./3.gene/gene.count',sep='\t',index_col='Name')
          
          row_num = len(request)
          i = 0
          print('正在匹配,请等待.................')
          while i < row_num:
              ID = request.loc[i,'qseqid']
              sTMP = TPM.loc[ID]
              dfTMP = sTMP.to_frame().T
              a = dfTMP.reset_index(drop=True)
              j = 0
              for new_TMP in new_TMP_col:
                  request.loc[i, new_TMP] = a.iloc[0, j]
                  j += 1
              scount = count.loc[ID]
              dfcount = scount.to_frame().T
              b = dfcount.reset_index(drop=True)
              k = 0
              for new_count in new_count_col:
                  request.loc[i, new_count] = b.iloc[0, k]
                  k += 1
              i += 1
          request.to_csv('./4.annotion/kegg/kegg.bestevalue2geneQuant.csv',index=False)
          print('完成kegg2quant,保存在./4.annotion/kegg/kegg.bestevalue2geneQuant.csv')
          
          #运行第四步
          #为了满足windows excel最多读取104w行,将kegg.bestevalue.csv每100W行另存为一个文件
          request = pd.read_csv('./4.annotion/kegg/kegg.bestevalue2geneQuant.csv',sep=',',low_memory=False)
          row_num = len(request)
          step = 1000000
          
          for start in range(0, row_num, step):
              stop = start + step
              filename = './4.annotion/kegg/kegg.bestevalue2geneQuant_{}-{}.csv'.format(start,stop)
              d = request[start : stop]
              print("Saving file : " + filename + ", data size : " + str(len(d)))
              d.to_csv(filename,index=0)

            整合NR_taxonomy和kegg2quant,运行NRandKeggMerge.py(舍弃该步骤)

            Bug问题:步骤5中的kegg2quant.py,和步骤6的NRandKeggMerge.py,因为kegg注释结果一般少于NR,以kegg2quant作为输出结果中的定量数据会导致部分NR注释没有定量。所以现在放弃上述两个步骤,而采用下属思路:

            此方法无论基因是否被NR或kegg注释,都会被整合

            
            import pandas as pd
            import os
            
            
            print('读取样品名称')
            #读取mapping.txt中的样品名称
            Sample = []
            with open ('mapping.txt','r') as mf:
                for line in mf:
                    Sample.append(line.split('\n')[0])
            Sample = Sample[1:]
            print('读取工作路径,并修改当前工作路径')
            #读取project_config.yaml中的文件路径,并修改当前工作路径
            with open ('project_config.yaml','r') as pj:
                for line in pj:
                    if 'workdir' in line:
                        line = line.strip()
                        workdir = line.split(': ')[1]
            os.chdir(workdir)
            
            def merge(file1,file2):
                data1 = pd.read_csv(file1,sep=',',low_memory=False)
                data2 = pd.read_csv(file2, sep=',', low_memory=False)
                result = pd.merge(data1,data2,left_on= ['Unigene_ID'],right_on=['qseqid'], how= 'outer')
                result.to_csv('./4.annotion/NRandKegg/NR2keggQuant.csv',index=False)
            
            NR = r'./4.annotion/NR/TaxonomyMegan.csv'
            kegg = r'./4.annotion/kegg/kegg.bestevalue.csv'
            merge(NR,kegg)
            
            #运行第四步
            #为了满足windows excel最多读取104w行,将kegg.bestevalue.csv每100W行另存为一个文件
            request = pd.read_csv('./4.annotion/NRandKegg/NR2keggQuant.csv',sep=',',low_memory=False)
            row_num = len(request)
            step = 1000000
            
            for start in range(0, row_num, step):
                stop = start + step
                filename = './4.annotion/NRandKegg/NR2keggQuant_{}-{}.csv'.format(start,stop)
                d = request[start : stop]
                print("Saving file : " + filename + ", data size : " + str(len(d)))
                d.to_csv(filename,index=0)

            整和NR kegg TMP count结果(使用该方法整合)

            思路:

            1.将TMP和count取并集1(取count没必要了,将其放弃)

            2.将NR和kegg结果取并集2

            3.取上述TMP和并集2的并集

            此方法无论基因是否被NR或kegg注释,都会被整合

            记得将列名称都改成Unigene_ID

            python allMerge.py

            
            import pandas as pd
            import os
            
            print('读取样品名称')
            #读取mapping.txt中的样品名称
            Sample = []
            with open ('mapping.txt','r') as mf:
                for line in mf:
                    Sample.append(line.split('\n')[0])
            Sample = Sample[1:]
            print('读取工作路径,并修改当前工作路径')
            #读取project_config.yaml中的文件路径,并修改当前工作路径
            with open ('project_config.yaml','r') as pj:
                for line in pj:
                    if 'workdir' in line:
                        line = line.strip()
                        workdir = line.split(': ')[1]
            os.chdir(workdir)
            
            #记得将列名称都改成Unigene_ID
            print('merge NR kegg')
            def merge(file1,file2):
                largest_column_count =0
                with open(file1, 'r') as temp_f:
                    lines = temp_f.readlines()
                    for l in lines:
                        column_count = len(l.split(',')) + 1
                    #找到列数最多的行
                        largest_column_count = column_count if largest_column_count < column_count else largest_column_count
                temp_f.close()
                # colunm_names为最大列数展开
                column_names = [i for i in range(0, largest_column_count)]
                data1 = pd.read_csv(file1, header=None, delimiter=',', names=column_names)
            #    data1 = pd.read_csv(file1,sep=',',low_memory=False)
                data2 = pd.read_csv(file2, sep=',', low_memory=False)
                result = pd.merge(data1,data2,on= 'Unigene_ID', how= 'outer')
                result.to_csv('./4.annotion/NRandKegg/NR2kegg.csv',index=False)
            
            NR = r'./4.annotion/NR/TaxonomyMeganSort.csv'
            kegg = r'./4.annotion/kegg/kegg.bestevalue.csv'
            merge(NR,kegg)
            
            NR2kegg = pd.read_csv('./4.annotion/NRandKegg/NR2kegg.csv',sep=',',low_memory=False)
            
            #记得将列名称都改成Unigene_ID
            print('merge NRkegg TMP')
            def merge(file1,file2):
                data1 = pd.read_csv(file1,sep=',',low_memory=False)
                data2 = pd.read_csv(file2, sep='\t', low_memory=False)
                result = pd.merge(data1,data2,on= 'Unigene_ID', how= 'outer')
                result.to_csv('./4.annotion/NRandKegg/NR2kegg2TMP.csv',index=False)
            
            NRkegg = r'./4.annotion/NRandKegg/NR2kegg.csv'
            TMP = r'./3.gene/gene.TMP'
            merge(NRkegg,TMP)
            
            #运行第四步
            #为了满足windows excel最多读取104w行,将kegg.bestevalue.csv每100W行另存为一个文件
            request = pd.read_csv('./4.annotion/NRandKegg/NR2kegg2TMP.csv',sep=',',low_memory=False)
            row_num = len(request)
            step = 1000000
            
            for start in range(0, row_num, step):
                stop = start + step
                filename = './4.annotion/NRandKegg/NR2kegg2TMP_{}-{}.csv'.format(start,stop)
                d = request[start : stop]
                print("Saving file : " + filename + ", data size : " + str(len(d)))
                d.to_csv(filename,index=0)

            评论
            添加红包

            请填写红包祝福语或标题

            红包个数最小为10个

            红包金额最低5元

            当前余额3.43前往充值 >
            需支付:10.00
            成就一亿技术人!
            领取后你会自动成为博主和红包主的粉丝 规则
            hope_wisdom
            发出的红包
            实付
            使用余额支付
            点击重新获取
            扫码支付
            钱包余额 0

            抵扣说明:

            1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
            2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

            余额充值