Performing a DELTA-BLAST search(执行DELTA-BLAST搜索)
Created: June 23, 2008; Updated: January 7, 2021.
DELTA-BLAST searches a protein sequence database using a PSSM constructed from conserved domains matching a query. It first searches the NCBI CDD database to construct the PSSM.
DELTA-BLAST使用由匹配查询的保守域构建的PSSM搜索蛋白质序列数据库。它首先搜索NCBI CDD数据库以构建PSSM。
Download the cdd_delta database(下载cdd_ delta数据库)
Obtain this database from ftp://ftp.ncbi.nlm.nih.gov/blast/db using the update_blastdb.pl tool (provided as part of the BLAST+ package). Note that the cdd_delta database must be downloaded and installed to the standard BLAST database directory (see Configuring BLAST) or in the current working directory.
从以下位置获取此数据库:ftp://ftp.ncbi.nlm.nih.gov/blast/db
使用update_blastdb.pl工具(作为BLAST+包的一部分提供)。请注意,必须将cdd_delta数据库下载并安装到标准BLAST数据库目录(请参阅配置BLAST)或当前工作目录中。
Execute the deltablast search(执行deltablast搜索)
deltablast –query query.fsa –db pataa
Indexed megaBLAST search(索引megaBLAST搜索)
The indexed megaBLAST search requires both BLAST databases as well as an index of the words found in the database. The index of words may be produced with makembindex. The example below demonstrates how to produce the index as well as perform an indexed megaBLAST search. This example assumes that the nt.00 BLAST database has been placed in the current directory (before makembindex is run) and that QUERY is a file containing a nucleotide query. Results will appear in OUTPUT. See tables C2 and C11 for information on command-line options.
索引的megaBLAST搜索既需要BLAST数据库,也需要数据库中找到的单词的索引。可以使用makembindex生成单词索引。下面的示例演示了如何生成索引以及执行索引megaBLAST搜索。本例假设nt.00 BLAST数据库已放置在当前目录中(在运行makembindex之前),该查询是一个包含查询的文件。结果将显示在输出中。有关命令行选项的信息,请参见表C2和C11。
makembindex -input nt.00 -iformat blastdb -old_style_index false
blastn -db ./nt.00 -query QUERY -use_index true –out OUTPUT
The BLAST databases may contain filtering (or masking) information for the database sequences. Makembindex can access this information and exclude the masked regions of the database from the index. This is demonstrated below. The first command shows how to discover the masking “Algorithm ID” from the BLAST database using blastdbcmd. In this case, the ID is 30. The second command demonstrates how to build an index that excludes the masked regions. Once the index has been built, it can be used as shown above. In the example below, the
ref_contig BLAST database had been placed in the directory before makembindex was run.
BLAST数据库可以包含用于数据库序列的过滤(或屏蔽)信息。MakeMIndex可以访问此信息,并从索引中排除数据库的屏蔽区域。这将在下文中演示。第一个命令显示了如何使用blastdbcmd从BLAST数据库中发现掩蔽“算法ID”。在这种情况下,ID为30。第二个命令演示了如何构建一个不包括屏蔽区域的索引。一旦建立了索引,就可以如上所示使用它。在下面的示例中,在运行makembindex之前,ref_contig BLAST数据库已放置在目录中。
BLAST+ remote service(BLAST+远程服务)
The BLAST+ applications can perform a search on the NCBI servers if invoked with the “–remote” flag. All other command-line options are the same as for a stand-alone search.
如果使用“-remote”标志调用,BLAST+应用程序可以在NCBI服务器上执行搜索。所有其他命令行选项与独立搜索相同。
The box below shows an example BLAST+ remote search using the blastn application. First, blastn searches the query against the nt database and produces a standard BLAST report. The query file (nt.u00001) contains the sequence for accession u00001 as FASTA. Second, the UNIX grep utility is used to find the RID for the search. Note that the RID can simply be found near the top of the BLAST report. Third, the RID is then used with blast_formatter to print out the results as a tabular report. Finally, the results are formatted as XML. The RID is only printed as an example and is no longer valid.
下面的框显示了使用blastn应用程序的BLAST+远程搜索示例。首先,blastn根据nt数据库搜索查询并生成标准BLAST报告。查询文件(nt.u00001)包含accession序列u00001作为FASTA。其次,UNIX grep实用程序用于查找用于搜索的RID。请注意,可以在BLAST报告顶部附近找到RID。第三,然后将RID与blast_formatter一起使用,以表格形式输出结果。最后,将结果格式化为XML。RID仅作为示例打印,不再有效。
Query a BLAST database with an accession, but exclude that accession from the results(查询具有登录的BLAST数据库,但从结果中排除该登录)accession 加入 ; 就职 ; 登基 ; 正式加入 ; 新增书目 ; 新增藏品
Faster sequence lookups by accession(通过加入更快的序列查找)
Created: June 23, 2008; Updated: January 7, 2021.
Starting with BLASTDB version 5, blastdbcmd has two additional parameters (-taxids and -taxidlists) to efficiently retrieve sequences by taxid.
从BLASTDB版本5开始,blastdbcmd有两个额外的参数(-taxids和-taxidlists),可以通过taxid有效地检索序列。
Note: -target_only is used to ensure that only accessions for the human entries are present. Otherwise, it will present all accessions on any sequence with at least one human sequence. This is important since nr is a non-redundant database.
注:-target_only用于确保仅存在人类条目的访问。否则,它将显示任何序列上至少有一个人类序列的所有加入。这很重要,因为nr是一个非冗余数据库。
Threading By Query(查询线程)
Tom Madden
Created: June 23, 2008; Updated: June 28, 2021.
Starting in the 2.12.0 release, BLAST supports a different threading model that can be more efficient in certain situations. This section describes how and when to use that option.
从2.12.0版本开始,BLAST支持不同的线程模型,在某些情况下可以更高效。本节介绍如何以及何时使用该选项。
Traditionally, BLAST multi-threads by having each thread work on a subset of the database. Some parts of this process are single threaded (such as setup and formatting), but for large databases such as nt or nr, this is not significant. For smaller databases, performance can be poor. We have implemented a multi-threading option that can be advantageous for batch processing. It works well if you have large numbers of queries with a small database or limit your search by taxid or by gilist. This option takes many query sequences and assigns them to a thread, which performs the entire search and formatting of results within that thread. All threads work independently, eliminating bottlenecks. The order of the query sequences is preserved in the results. We call this
“threading by query”.
传统上,通过让每个线程处理数据库的一个子集来BLAST多线程。???这个过程的某些部分是单线程的(如设置和格式化),但对于大型数据库(如nt或nr),这并不重要。对于较小的数据库,性能可能很差。我们实现了一个多线程选项,这对批处理是有利的。如果您在一个小数据库中有大量查询,或者通过taxid或gilist限制您的搜索,那么它工作得很好。此选项接受许多查询序列并将其分配给一个线程,该线程在该线程中执行整个搜索和结果格式化。所有线程都独立工作,消除了瓶颈。查询序列的顺序保留在结果中。我们称之为“查询线程”。
You may enable threading by query with the -mt_mode option by providing a value of 1, as shown below. The default value for the -mt_mode option is 0. You still specify the number of threads with the “-num_threads” option.
通过提供值1,可以使用 -mt_mode 选项启用查询线程,如下所示。-mt_mode选项的默认值为0。您仍然可以使用 “-num_threads” 选项指定线程数。
blastp –db swissprot -query BIGFASTA.fsa –out test.out -num_threads 32 -mt_mode 1
As noted above, the threading by query option works well with relatively small databases if you have a lot of queries to process. For BLASTP, your input FASTA should contain at least 10,000 residues per thread, or 320,000 residues for 32 threads. For BLASTX, the FASTA should contain 30,000 bases per thread. For both BLASTP and BLASTX, we find threading by query works best if the database has fewer than 2 billion residues (e.g, swissprot). For BLASTN, your FASTA file should contain at least 2.5 million bases per thread and the database should contain fewer than 14 billion residues. Threading by query also works well if you limit the search, such as by -taxid or -gilist, regardless of the size of the database.
如上所述,如果需要处理大量查询,则“按查询线程化”选项适用于相对较小的数据库。对于BLASTP,您的输入FASTA每个线程至少应包含10000个残基,或者32个线程至少包含320000个残基。对于BLASTX,FASTA每个线程应包含30000个碱基。对于BLASTP和BLASTX,我们发现如果数据库的残基数少于20亿(例如swissprot),则按查询执行线程的效果最好。对于BLASTN,您的FASTA文件每个线程应至少包含250万个碱基,数据库应包含少于140亿个残基。如果您限制搜索,例如按-taxid或-gilist,而不管数据库的大小,则查询线程也可以很好地工作。
Here, we show some examples where this strategy works well. Below, the orange lines shows results for threading by query (mt_mode=1), where each thread works independently on a batch of sequences. The blue line shows the default method appropriate for large databases or few queries.
在这里,我们展示了一些例子,说明了这种策略的有效性。下面,橙色线显示了查询线程化(mt_mode=1)的结果,其中每个线程独立处理一批序列。蓝线显示适用于大型数据库或少量查询的默认方法。
MegaBLAST search of 1,966 sequences (899,809,898 bp) from the Sander lucioperca betanodavirus, whole genome shotgun sequencing project (https://www.ncbi.nlm.nih.gov/nuccore/CABPSU000000000.1) against the refseq viral representative genomes database. For 32 threads, the search can be completed more than four times faster (84 vs. 368 seconds) when threading by query.
Sander lucioperca betanodavirus全基因组鸟枪测序项目1966个序列(899809898bp)的MegaBLAST搜索(https://www.ncbi.nlm.nih.gov/nuccore/CABPSU000000000.1)对照refseq病毒代表基因组数据库。对于32个线程,当通过查询执行线程时,搜索可以更快地完成四倍多(84秒对368秒)。
BLASTP search of 20,410 proteins from the human gut metagenome WGS project (https://
www.ncbi.nlm.nih.gov/Traces/wgs/AJWY01) against swissprot. For 32 threads, the search is 2.8 times faster (3.4 minutes vs 9.5 minutes) when threading by query. The task blastp-fast was used.
人类肠道亚基因组WGS项目20410种蛋白质的BLASTP搜索(https://www.ncbi.nlm.nih.gov/Traces/wgs/AJWY01)against swissprot。对于32个线程,当通过查询执行线程时,搜索速度提高了2.8倍(3.4分钟vs 9.5分钟)。使用任务blastp-fast。
BLASTP search of 9200 proteins from the refseq_select protein database against refseq_protein limited to human entries with -taxid flag. For 32 threads, the search is 5.7 times faster (14 minutes vs 80 minutes) when threading by query. The task blastp-fast was used.
从 refseq_select 蛋白质数据库中针对 refseq_protein 的9200种蛋白质的BLASTP搜索仅限于具有-taxid标志的人类条目。对于32个线程,当通过查询执行线程时,搜索速度要快5.7倍(14分钟vs 80分钟)。使用任务blastp-fast。