【NextPolish】【1】【2】

原创

已于 2024-07-24 14:57:12 修改 · 931 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#经验分享

于 2024-07-23 10:51:28 首次发布

NextPolish is used to fix base errors (SNV/Indel) in the genome generated by noisy long reads, it can be used with short read data only or long read data only or a combination of both. It contains two core modules, and use a stepwise fashion to correct the error bases in reference genome.
To correct/assemble the raw third-generation sequencing (TGS) long reads with approximately 10-15% sequencing errors, please use NextDenovo.
To further improve the consensus accuracy of genomes assembled using HiFi long-reads, please use NextPolish2.

Nextpolish1

Paper:https://academic.oup.com/bioinformatics/article/36/7/2253/5645175

Motivation

Although long-read sequencing technologies can produce genomes with long contiguity, they suffer from high error rates. Thus, we developed NextPolish, a tool that efficiently corrects sequence errors in genomes assembled with long reads. This new tool consists of two interlinked modules that are designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors.虽然长读测序技术可以产生具有长连接的基因组，但其误差率较高。因此，我们开发NextPolish，这是一种有效纠正长reads基因组序列错误的工具。这个新工具由两个相互连接的模块组成，旨在对高质量短读的K-mers进行评分和计数，并对含有大量碱基错误的基因组片段进行修饰。

Results

When evaluated for the speed and efficiency using human and a plant (Arabidopsis thaliana) genomes, NextPolish outperformed Pilon by correcting sequence errors faster, and with a higher correction accuracy.

Availability and implementation

NextPolish is implemented in C and Python. The source code is available from https://github.com/Nextomics/NextPolish.

https://www.jianshu.com/p/8d040fda7261

NextPolish可以使用二代短读序列或者三代序列或者两者结合去纠正三代长读长序列在组装时导致的碱基错误(SNV/Indel)

第一步：创建一个文件，用于记录二代序列的位置信息
realpath ERR2173372_1.fastq ERR2173372_2.fastq  > sgs.fofn
第二步：配置run.cfg文件
# 从NextPolish目录下复制配置文件
cp ~/opt/biosoft/NextPolish/doc/run.cfg run2.cfg
修改配置文件
[General]
job_type = local
job_prefix = nextPolish
task = default
rewrite = 1212
rerun = 3
parallel_jobs = 2
multithread_jobs = 10
genome = ./nextgraph.assembly.contig.fasta
genome_size = auto
workdir = ./01_rundir
polish_options = -p {
    
    multithread_jobs}

[sgs_option]
sgs_fofn = ./sgs.fofn
sgs_options = -max_depth 100
其中需要修改的参数为，其余参数查看官方的参数配置说明:

job_type: 任务类型，local表示单个节点运行。由于NextPolish使用DRMAA进行任务投递，因此还支持，SGE, PBS和SLURM

task: 任务类型，用12,1212,121212,12121212来设置polish的轮数，建议迭代2轮就可以了。

parallel_jobs和multithread_jobs表示同时投递的任务数和每个任务的线程数，此处2 X 10=20

genome: 表示组装基因组的位置

workdir: 输出文件所在目录

sgs_options: 该选项设置二代测序polish的参数，包括-use_duplicate_reads, -unpaired, -max_depth, -bwa, -minimap2(