3. 基于GeneShot的分析
参考:
- Minot, S.S., Barry, K.C., Kasman, C. et al. geneshot: gene-level metagenomics identifies genome islands associated with immunotherapy response. Genome Biol 22, 135 (2021). https://doi.org/10.1186/s13059-021-02355-6
- Github代码:https://github.com/Golob-Minot/geneshot
3.1 geneshot文档研究
参考:
- 生信自动化流程搭建 01 | Nextflow的介绍与安装: https://cloud.tencent.com/developer/article/1771426
- 生信自动化流程搭建 02 | 脚本: https://cloud.tencent.com/developer/article/1771429
- 生信自动化流程搭建 03 | 输入 input: https://cloud.tencent.com/developer/article/1771432
- 生信自动化流程搭建 04 | 输出 output: https://cloud.tencent.com/developer/article/1771434
- 生信自动化流程搭建 05 | 通道 Channels: https://cloud.tencent.com/developer/article/1771437
- 生信自动化流程搭建 06 | 指令: https://cloud.tencent.com/developer/article/1771439
- 生信自动化流程搭建 07 | 配置文件: https://cloud.tencent.com/developer/article/1771441
- Nextflow官方文档:https://www.nextflow.io/docs/latest/index.html
3.1.1 Get started
基于conda的geneshot安装:https://rbutleriii.github.io/microbial/installing-geneshot.html
conda activate geneshot
conda install -c bioconda cutadapt bwa megahit prodigal metaphlan2 diamond biopython mmseqs2==7.4e23d zarr pytables pyarrow
##单独安装samtools
conda install -c conda-forge -c bioconda samtools bzip2
##保留已经安装的包,继续安装上述包。
conda install --freeze-installed -c bioconda cutadapt bwa megahit prodigal metaphlan2 diamond biopython mmseqs2==7.4e23d zarr pytables pyarrow
##分别安装trim-galore,bwa,megahit,prodigal,metaphlan2,
conda install -c bioconda trim-galore
conda install -c bioconda bwa
conda install -c bioconda megahit
conda install -c bioconda prodigal
conda install -c bioconda metaphlan2
conda install -c bioconda diamond ##这一步要安装boost-cpp包,非常慢。
conda install -c bioconda samtools==1.14 ##安装samtools.
上述文档提供了conda安装策略:
#!/usr/bin/env bash
exec 1>> command.log 2>&1
set -ex
# build conda environment (r-breakaway built for 3.6.3)
conda create -n geneshot -y -vv python=3 nextflow r-tidyverse r-devtools \
r-vroom bioconductor-phyloseq cutadapt bwa megahit prodigal metaphlan2 diamond \
biopython mmseqs2==7.4e23d zarr pytables pyarrow
# --max-seqs option dropped in mmseqs2 >7
# pyarrow downgrades diamond quite a bit
# activate
source $(conda info --base)/etc/profile.d/conda.sh
conda activate geneshot || { echo "geneshot Conda environment not activated"; exit; }
# install non-conda components (required older python-dateutil for famli)
pip install barcodecop
pip install git+https://github.com/FredHutch/FAMLI.git@v1.5
pip install git+https://github.com/FredHutch/find-cags.git@v0.13.0
git clone https://github.com/Golob-Minot/geneshot.git -b v0.8.6
# need some fastatools to run, put them in the conda env bin folder
git clone https://github.com/Golob-Minot/fastatools.git
chmod a+x fastatools/*.py
for k in fastatools/*.py
do
j=$(basename $k)
cp $k $(conda info --base)/envs/geneshot/bin/$j
done
rm -rf fastatools
# need github R packages, for now install with R script (fix later)
#####R code
#devtools::install_github(c("adw96/breakaway", "bryandmartin/corncob"), upgrade="never")
#####
echo 'devtools::install_github(c("adw96/breakaway", "bryandmartin/corncob"), upgrade="never")' \
> devtools.R
Rscript devtools.R
rm devtools.R
通过在终端中输入以下命令来执行脚本:
nextflow run tutorial.nf
3.1.2 基本概念
- Processes and channels: 进程独立执行并且彼此隔离,即它们不共享公共(可写)的状态。进程通过异步First in first out(FIFO)的方式来通信,该方式在Nextflow中称为channels。
##下载蛋白质序列。
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
gunzip GCF_000005845.2_ASM584v2_protein.faa.gz
##下载蛋白质序列
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz
##构建数据库
makeblastdb -in E.coli_protein.faa -dbtype prot -out E.coli_protein.db -parse_seqids
rename GCF_000005845.2_ASM584v2_ E.coli_ GCF_000005845.2_ASM584v2_*
##修改/home1/jialh/brain/tools/nextflow_tutorial/02blastdbcmd.nf的输入,并开始运行。
nextflow run 02blastdbcmd.nf
3.1.3 Nextflow scripting(Groovy语言的拓展)
- 语言基础:Variables,Lists,Maps, Multiple assignment, Conditional Execution,Strings,String interpolation,Multi-line strings.
- Implicit variables(内隐变量): Script implicit variables, Configuration implicit variables, Process implicit variables.
- Closures(闭包): 当某个函数被当成对象返回时,夹带了外部变量,就形成了一个闭包(参考:https://www.cnblogs.com/yycc/p/10904416.html)。
- Regular expressions(正则表达式):文本处理的瑞士军刀。
- Files and I/O(文件输入与输出):
file
方法,可以由路径字符串返回一个系统文件对象。file
的可用选项包括:glob
,type
,hidden
,maxDepth
,followLinks
,checkIfExists
。
3.1.4 Process(进程)
在Nextflow中,流程是执行用户脚本的基本处理原语。 Nextflow中的进程使用关键字process
定义,然后是进程名,最终是花括号中定义的process body。进程体中包含表示命令行,或者更广泛的执行脚本。基本的进程如下所示:
process sayHello {
"""
echo 'Hello world!' > file
"""
}
进程相关关键词包括:Script, Stub, Inputs, Outputs, When, Directives.
3.1.5 Channels(通道)
理解同步、异步、阻塞和非阻塞: https://blog.youkuaiyun.com/GJQI12/article/details/105118571
Nextflow是基于Dataflow的编程模型,即process通过channels来交流。
A channel有两个主要的性质:
- Sending a message is an asynchronous operation(异步操作:调用者不会立刻得到结果) which completes immediately, without having to wait for the receiving process.
- Receiving data is a blocking operation(阻塞操作:调用结果返回之前,当前线程会被挂起) which stops the receiving process until the message has arrived.
- Channel types(通道类型):queue channels(队列通道:将一个流程输出通道连接到多个流程的情况) and value channels (值通道:绑定单个值)。
- Channel factory(通道工厂):通道可以由流程输出(s)声明隐式创建,或者显式地使用下面的通道工厂方法(create, empty, from, fromPath, fromFilePairs, fromSRA, of, value, watchPath)。【说白了,通道就是定义了各种输入列表、文件内容的方式】
- Binding values(绑定值):发送消息等价于将值绑定到交流通道中的对象,相关函数包括:bind, operator <<。【类似编程中常见的赋值操作】
- Observing events(观测事件):
subscribe
方法允许您在每次源通道发出新值时执行用户定义的函数。
// define a channel emitting three values
source = Channel.from ( 'alpha', 'beta', 'delta' )
// subscribe a function to the channel printing the emitted values
source.subscribe { println "Got: $it" }
结果为:
Got: alpha
Got: beta
Got: delta
3.1.6 Operators
Nextflow operators允许你连接channels, 或者根据用户提供的规则对通道发出的值进行转换。操作符可以分为7种类型:
- Filtering operators: 选择符合给定规则的条目。包括filter, unique, distinct, first, randomSample, take, last, until。
- Transforming operators: 将通道发出的条目转化为新的值,包括map, flatMap, reduce,groupBy,groupTuple,buffer,collate,collect,flatten,toList,toSortedList,transpose。
- Splitting operators:用于将通道发出的项分割为可由下游操作符或流程处理的块。包括splitCsv, splitFasta, splitFastq, splitText。
- Combining operators:join(表连接), merge, mix(合并为单个channel), phase, cross(类似表格全连接), collectFile(收集channel发出的内容,保存到文件), combine, concat, spread。
- Forking operators:branch(将一个源通道发出的条目推向一个或多个通道),choice(构建分支),multiMap,into,tap,separate。
- Maths operators:count,countBy,min(最小值), max(最大值), sum(求和),toInteger。
- Other operators:dump, set, ifEmpty, print, println, view, close。
3.1.7 Executors
在Nextflow框架体系结构中,**执行器(executor)**是决定管道进程运行的系统并监督其执行的组件。
**执行器(executor)**提供了管道进程和底层执行系统之间的抽象。这允许您独立于实际的处理平台编写管道函数逻辑。
换句话说,只需修改Nextflow配置文件中的执行器定义,就可以编写管道脚本,并让它在计算机、集群资源管理器或云上运行。
常见执行器包括:Local, SGE(Sun Grid Engine), LSF(Platform Load Sharing Facility), SLURM,PBS/Torque,PBS Pro,Moab,NQSII,HTCondor,Ignite,AWS Batch, Azure Batch,Google Life Sciences,GA4GH TES,OAR。
3.1.8 Configuration
- Configuration file:参数的优先级(命令行参数
--something value
>-params-file
>config文件指定-c my_config
>当前目录的nextflow.config
>流程工程目录下的nextflow.config
>config文件$HOME/.nextflow/config
>流程脚本自身定义的值main.nf
。 - Config scopes(配置文件范围):有很多相关的命令。
- Config profiles: configuration file可以包含一个或多个profiles的定义。profile文件可以在流程执行时通过
-profile
命令行选项来激活。 - Environment variables:环境变量控制Nextflow运行时及其使用的Java虚拟机的配置。
3.1.9 运行geneshot需要注意的问题
安装Nextflow后,使用geneshot
需要解决如下问题:
- What computational resources (“executor”) am I going to use to process the data?
- Where is my input data located?
- Where do I want to store the results?
- Where can I keep temporary data generated during the course of the workflow? (note: the
-work-dir
used during execution can be safely deleted once the workflow has completely finished).