blat analysis

最新推荐文章于 2021-11-25 00:31:13 发布

半个馒头

最新推荐文章于 2021-11-25 00:31:13 发布

阅读量5.6k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： perl学习日记文章标签： database query alignment triggers output file

本文链接：https://blog.youkuaiyun.com/bangemantou/article/details/7726255

perl学习日记专栏收录该内容

20 篇文章

订阅专栏

Blat是一款由W.James Kent开发的快速比对工具，主要用于基因和ESTs的定位。它解决了BLAST的不足，如速度慢和处理结果复杂。Blat的特点是速度快，输出结果简洁，尤其适合小序列对大基因组的比对。它接受fasta格式输入，无需建库即可比对。Blat的输出结果为psl格式，包含详细比对信息。在使用时，需注意‘-t’和‘-q’参数类型一致，以及在不同类型的比对中如何正确设置参数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

usage:

   blat database query [-ooc=11.ooc] output.psl

where:

   database and query are each either a .fa , .nib or .2bit file,

   or a list these files one file name per line.

   -ooc=11.ooc tells the program to load over-occurring 11-mers from

               and external file.  This will increase the speed

               by a factor of 40 in many cases, but is not required

   output.psl is where to put the output.

   Subranges of nib and .2bit files may specified using the syntax:

      /path/file.nib:seqid:start-end

   or

      /path/file.2bit:seqid:start-end

   or

      /path/file.nib:start-end

   With the second form, a sequence id of file:start-end will be used.

options:

   -t=type     Database type.  Type is one of:

   库序列        dna - DNA sequence

                 prot - protein sequence

                 dnax - DNA sequence translated in six frames to protein

               The default is dna

   -q=type     Query type.  Type is one of:

  查询序列       dna - DNA sequence

                 rna - RNA sequence

                 prot - protein sequence

                 dnax - DNA sequence translated in six frames to protein

                 rnax - DNA sequence translated in three frames to protein

               The default is dna

   -prot       Synonymous with -t=prot -q=prot

   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to

               the tileSize

   -tileSize=N sets the size of match that triggers an alignment.

               Usually between 8 and 12

               Default is 11 for DNA and 5 for protein.

   -stepSize=N spacing between tiles. Default is tileSize.

   -oneOff=N   If set to 1 this allows one mismatch in tile and still

               triggers an alignments.  Default is 0.

   -minMatch=N sets the number of tile matches.  Usually set from 2 to 4

               Default is 2 for nucleotide, 1 for protein.

   -minScore=N sets minimum score.  This is the matches minus the

               mismatches minus some sort of gap penalty.  Default is 30

   -minIdentity=N Sets minimum sequence identity (in percent).  Default is

               90 for nucleotide searches, 25 for protein or translated

               protein searches.

   -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually

               set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.

   -noHead     suppress .psl header (so it's just a tab-separated file)

   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.

   -repMatch=N sets the number of repetitions of a tile allowed before

               it is marked as overused.  Typically this is 256 for tileSize

               12, 1024 for tile size 11, 4096 for tile size 10.

               Default is 1024.  Typically only comes into play with makeOoc.

               Also affected by stepSize. When stepSize is halved repMatch is

               doubled to compensate.

   -mask=type  Mask out repeats.  Alignments won't be started in masked region

               but may extend through it in nucleotide searches.  Masked areas

               are ignored entirely in protein or translated searches. Types are

                 lower - mask out lower cased sequence

                 upper - mask out upper cased sequence

                 out   - mask according to database.out RepeatMasker .out file

                 file.out - mask database according to RepeatMasker file.out

   -qMask=type Mask out repeats in query sequence.  Similar to -mask above but for query rather than target sequence.

   -repeats=type Type is same as mask types above.  Repeat bases will not be

               masked in any way, but matches in repeat areas will be reported

               separately from matches in other areas in the psl output.

   -minRepDivergence=NN - minimum percent divergence of repeats to allow

               them to be unmasked.  Default is 15.  Only relevant for

               masking using RepeatMasker .out files.

   -dots=N     Output dot every N sequences to show program's progress

   -trimT      Trim leading poly-T

   -noTrimA    Don't trim trailing poly-A

   -trimHardA  Remove poly-A tail from qSize as well as alignments in

               psl output

   -fastMap    Run for fast DNA/DNA remapping - not allowing introns,

               requiring high %ID

   -out=type   Controls output file format.  Type is one of:

                   psl - Default.  Tab separated format, no sequence

                   pslx - Tab separated format with sequence

                   axt - blastz-associated axt format

                   maf - multiz-associated maf format

                   sim4 - similar to sim4 format

                   wublast - similar to wublast format

                   blast - similar to NCBI blast format

                   blast8- NCBI blast tabular format

                   blast9 - NCBI blast tabular format with comments

   -fine       For high quality mRNAs look harder for small initial and

               terminal exons.  Not recommended for ESTs

   -maxIntron=N  Sets maximum intron size. Default is 750000

   -extendThroughN - Allows extension of alignment through large blocks of N's

Blat,全称The BLAST-Like Alignment Tool, 可以称为“类BLAST比对工具”，由W.James Kent于2002年开发。当时随着人类基因组计划的进展，把大量的基因和ESTs快速定位到较大的基因组上称为一种迫切需要。blast相对于这种比对有几个缺陷：速度偏慢、结果难于处理、无法表示包含intron的基因定位。Blat就是再这种形势下应运而生了。

Blat的主要特点是：速度快，共线性输出结果简单易读。对于比较小的序列（如cDNA等）对大基因组的比对，blat无疑是首选。Blat把相关的呈共线性的比对结果连接成更大的比对结果，从中也可以很容易的找到exons和introns。因此，在相近物种的基因同源性分析和EST分析中，blat得到了广泛的应用。

如下图所示，blast会把每一个比对作为一个输出，而blat会把一些符合共线性关系的比对连接起来作为一个输出。

Blat的输入文件必须满足fasta格式，运行时非常的简单，不需要进行建库就可以直接比对。Blat的基本命令：

blat database query [-参数] output

程序正常运行时，会在读完database中的所有subject序列时在屏幕输出database的统计结果：

Loaded 1493629 letters in 486 sequences###486条序列中有1493629个letters

Searched 1493629 bases in 486 sequences###自己和自己比对

默认的输出结果是列表形式的文本文件，即psl格式。

psl格式的结果包含了详细的比对位置信息，每一列的意义都在文件开头列出。第1~8列是通体的比对统计，包括精确比对碱基数、错配、query和subject上的gap个数与gap总长等；第9~17列是比对位置信息，包括比对方向、query和subject的名字、长度、比对起止位置；18~21列是显示每一个精确比对的block的信息，包括blocks数、每个block的长度和在query、subject上的位置。

对psl输出结果，需要注意一下几点：

1.blat的结果在subject上允许存在很大的gap（intron区域），所以同一个结果在query和subjects上覆盖的区域可能会相差很多，这一点与blast不同。

2.在基因对基因组的比对中，block的个数不能等同于exon的个数。因为blat对block的定义是一个没有插入缺失的比对，任何插入或者缺失的碱基都会使一个block终止，所以一个exon很可能是有很多block构成的。因此exon和intron的个数要通过足够大的gap来判断。

3.psl结果里面碱基位置的计算是从0开始的而不是1.

做不同类型的比对时候需要注意一个问题，就是 “-t”和“-q”的定义必须为同一类型。比如database和query都是蛋白序列，并且两者同时定义为 “prot”的时候，比对能够正常进行；如果database是DNA序列而query序列是蛋白序列，那么在定义 “-q=prot”的同时还需要定义 “-tdnax”.下面就用同一个基因的DNA和蛋白序列举几个例子。

运行命令1：

blat cdna.seq pro.seq -q=prot out.psl

程序报错退出：

d and q must both be either protein or dna

运行命令2：

blat cdna.seq pro.seq -t=dnax -q=prot -noHead out.psl

ok, right

注意蛋白比对和核酸比对在输出上的不同点，在显示方向的位置显示了2个“+”，表示query和subject都是正向比对。

运行命令3，核酸序列的蛋白级别比对：

blat cdna.seq cdna.seq -t=dnax -q=dnax -noHead out.psl