微生物组基因分析流程shotgene安装与nextflow初探

3. 基于GeneShot的分析

参考:

  1. Minot, S.S., Barry, K.C., Kasman, C. et al. geneshot: gene-level metagenomics identifies genome islands associated with immunotherapy response. Genome Biol 22, 135 (2021). https://doi.org/10.1186/s13059-021-02355-6
  2. Github代码:https://github.com/Golob-Minot/geneshot
3.1 geneshot文档研究

参考:

  1. 生信自动化流程搭建 01 | Nextflow的介绍与安装: https://cloud.tencent.com/developer/article/1771426
  2. 生信自动化流程搭建 02 | 脚本: https://cloud.tencent.com/developer/article/1771429
  3. 生信自动化流程搭建 03 | 输入 input: https://cloud.tencent.com/developer/article/1771432
  4. 生信自动化流程搭建 04 | 输出 output: https://cloud.tencent.com/developer/article/1771434
  5. 生信自动化流程搭建 05 | 通道 Channels: https://cloud.tencent.com/developer/article/1771437
  6. 生信自动化流程搭建 06 | 指令: https://cloud.tencent.com/developer/article/1771439
  7. 生信自动化流程搭建 07 | 配置文件: https://cloud.tencent.com/developer/article/1771441
  8. Nextflow官方文档:https://www.nextflow.io/docs/latest/index.html
3.1.1 Get started

基于conda的geneshot安装:https://rbutleriii.github.io/microbial/installing-geneshot.html

conda activate geneshot
conda install -c bioconda cutadapt bwa megahit prodigal metaphlan2 diamond   biopython mmseqs2==7.4e23d zarr pytables pyarrow

##单独安装samtools
conda install -c conda-forge -c bioconda samtools bzip2

##保留已经安装的包,继续安装上述包。
conda install --freeze-installed -c bioconda cutadapt bwa megahit prodigal metaphlan2 diamond   biopython mmseqs2==7.4e23d zarr pytables pyarrow

##分别安装trim-galore,bwa,megahit,prodigal,metaphlan2,
conda install -c bioconda trim-galore
conda install -c bioconda bwa
conda install -c bioconda megahit
conda install -c bioconda prodigal
conda install -c bioconda metaphlan2
conda install -c bioconda diamond ##这一步要安装boost-cpp包,非常慢。
conda install -c bioconda samtools==1.14 ##安装samtools.

上述文档提供了conda安装策略:

#!/usr/bin/env bash
exec 1>> command.log 2>&1
set -ex

# build conda environment (r-breakaway built for 3.6.3)
conda create -n geneshot -y -vv python=3 nextflow r-tidyverse r-devtools \
  r-vroom bioconductor-phyloseq cutadapt bwa megahit prodigal metaphlan2 diamond \
  biopython mmseqs2==7.4e23d zarr pytables pyarrow
# --max-seqs option dropped in mmseqs2 >7
# pyarrow downgrades diamond quite a bit

# activate
source $(conda info --base)/etc/profile.d/conda.sh
conda activate geneshot || { echo "geneshot Conda environment not activated"; exit; }

# install non-conda components (required older python-dateutil for famli)
pip install barcodecop
pip install git+https://github.com/FredHutch/FAMLI.git@v1.5
pip install git+https://github.com/FredHutch/find-cags.git@v0.13.0

git clone https://github.com/Golob-Minot/geneshot.git -b v0.8.6
# need some fastatools to run, put them in the conda env bin folder
git clone https://github.com/Golob-Minot/fastatools.git
chmod a+x fastatools/*.py
for k in fastatools/*.py
  do
    j=$(basename $k)
    cp $k $(conda info --base)/envs/geneshot/bin/$j
done
rm -rf fastatools

# need github R packages, for now install with R script (fix later)
#####R code
#devtools::install_github(c("adw96/breakaway", "bryandmartin/corncob"), upgrade="never")
#####
echo 'devtools::install_github(c("adw96/breakaway", "bryandmartin/corncob"), upgrade="never")' \
  > devtools.R
Rscript devtools.R
rm devtools.R

通过在终端中输入以下命令来执行脚本:

nextflow run tutorial.nf
3.1.2 基本概念
  • Processes and channels: 进程独立执行并且彼此隔离,即它们不共享公共(可写)的状态。进程通过异步First in first out(FIFO)的方式来通信,该方式在Nextflow中称为channels。
##下载蛋白质序列。
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
gunzip GCF_000005845.2_ASM584v2_protein.faa.gz

##下载蛋白质序列
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz

##构建数据库
makeblastdb -in E.coli_protein.faa  -dbtype prot -out E.coli_protein.db -parse_seqids
rename GCF_000005845.2_ASM584v2_ E.coli_ GCF_000005845.2_ASM584v2_*

##修改/home1/jialh/brain/tools/nextflow_tutorial/02blastdbcmd.nf的输入,并开始运行。
nextflow run 02blastdbcmd.nf

image-20220215153437933

3.1.3 Nextflow scripting(Groovy语言的拓展)
  • 语言基础:Variables,Lists,Maps, Multiple assignment, Conditional Execution,Strings,String interpolation,Multi-line strings.
  • Implicit variables(内隐变量): Script implicit variables, Configuration implicit variables, Process implicit variables.
  • Closures(闭包): 当某个函数被当成对象返回时,夹带了外部变量,就形成了一个闭包(参考:https://www.cnblogs.com/yycc/p/10904416.html)。
  • Regular expressions(正则表达式):文本处理的瑞士军刀。
  • Files and I/O(文件输入与输出)file方法,可以由路径字符串返回一个系统文件对象。file的可用选项包括:glob, type, hidden, maxDepth, followLinks, checkIfExists
3.1.4 Process(进程)

在Nextflow中,流程是执行用户脚本的基本处理原语。 Nextflow中的进程使用关键字process定义,然后是进程名,最终是花括号中定义的process body。进程体中包含表示命令行,或者更广泛的执行脚本。基本的进程如下所示:

process sayHello {

    """
    echo 'Hello world!' > file
    """

}

进程相关关键词包括:Script, Stub, Inputs, Outputs, When, Directives.

3.1.5 Channels(通道)

理解同步、异步、阻塞和非阻塞: https://blog.youkuaiyun.com/GJQI12/article/details/105118571

Nextflow是基于Dataflow的编程模型,即process通过channels来交流。

A channel有两个主要的性质:

  • Sending a message is an asynchronous operation(异步操作:调用者不会立刻得到结果) which completes immediately, without having to wait for the receiving process.
  • Receiving data is a blocking operation(阻塞操作:调用结果返回之前,当前线程会被挂起) which stops the receiving process until the message has arrived.
  1. Channel types(通道类型):queue channels(队列通道:将一个流程输出通道连接到多个流程的情况) and value channels (值通道:绑定单个值)。
  2. Channel factory(通道工厂):通道可以由流程输出(s)声明隐式创建,或者显式地使用下面的通道工厂方法(create, empty, from, fromPath, fromFilePairs, fromSRA, of, value, watchPath)。【说白了,通道就是定义了各种输入列表、文件内容的方式
  3. Binding values(绑定值):发送消息等价于将值绑定到交流通道中的对象,相关函数包括:bind, operator <<。【类似编程中常见的赋值操作
  4. Observing events(观测事件)subscribe方法允许您在每次源通道发出新值时执行用户定义的函数。
// define a channel emitting three values
source = Channel.from ( 'alpha', 'beta', 'delta' )

// subscribe a function to the channel printing the emitted values
source.subscribe {  println "Got: $it"  }

结果为:

Got: alpha
Got: beta
Got: delta
3.1.6 Operators

Nextflow operators允许你连接channels, 或者根据用户提供的规则对通道发出的值进行转换。操作符可以分为7种类型:

  1. Filtering operators: 选择符合给定规则的条目。包括filter, unique, distinct, first, randomSample, take, last, until。
  2. Transforming operators: 将通道发出的条目转化为新的值,包括map, flatMap, reduce,groupBy,groupTuple,buffer,collate,collect,flatten,toList,toSortedList,transpose。
  3. Splitting operators:用于将通道发出的项分割为可由下游操作符或流程处理的块。包括splitCsv, splitFasta, splitFastq, splitText。
  4. Combining operators:join(表连接), merge, mix(合并为单个channel), phase, cross(类似表格全连接), collectFile(收集channel发出的内容,保存到文件), combine, concat, spread。
  5. Forking operators:branch(将一个源通道发出的条目推向一个或多个通道),choice(构建分支),multiMap,into,tap,separate。
  6. Maths operators:count,countBy,min(最小值), max(最大值), sum(求和),toInteger。
  7. Other operators:dump, set, ifEmpty, print, println, view, close。
3.1.7 Executors

在Nextflow框架体系结构中,**执行器(executor)**是决定管道进程运行的系统并监督其执行的组件。

**执行器(executor)**提供了管道进程和底层执行系统之间的抽象。这允许您独立于实际的处理平台编写管道函数逻辑。

换句话说,只需修改Nextflow配置文件中的执行器定义,就可以编写管道脚本,并让它在计算机、集群资源管理器或云上运行。

常见执行器包括:Local, SGE(Sun Grid Engine), LSF(Platform Load Sharing Facility), SLURM,PBS/Torque,PBS Pro,Moab,NQSII,HTCondor,Ignite,AWS Batch, Azure Batch,Google Life Sciences,GA4GH TES,OAR。

3.1.8 Configuration
  1. Configuration file:参数的优先级(命令行参数--something value>-params-file>config文件指定-c my_config>当前目录的nextflow.config>流程工程目录下的nextflow.config>config文件$HOME/.nextflow/config>流程脚本自身定义的值main.nf
  2. Config scopes(配置文件范围):有很多相关的命令。
  3. Config profiles: configuration file可以包含一个或多个profiles的定义。profile文件可以在流程执行时通过-profile命令行选项来激活。
  4. Environment variables:环境变量控制Nextflow运行时及其使用的Java虚拟机的配置。
3.1.9 运行geneshot需要注意的问题

安装Nextflow后,使用geneshot需要解决如下问题:

  • What computational resources (“executor”) am I going to use to process the data?
  • Where is my input data located?
  • Where do I want to store the results?
  • Where can I keep temporary data generated during the course of the workflow? (note: the -work-dir used during execution can be safely deleted once the workflow has completely finished).
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值