二、submit assembly to NCBI
1、prepare data
首先要具有fasta格式数据(NO .gz),这是处理的基础,具体格式如下:
>Scaffold633
TCATTTCTCCACTCTCGATGAACAAATCTGGAGGGATTTTTTTTCATTCC
ACTCAATAGGTTGTCTATAAAGGTGTGATTCGTGGAACTTCTTCACACAG
CAGCTAGTCTATATAATACAGAAGATCG
>Scaffold553
AAAAAATTTTTTTTTTAAACTATCATCTCATGGATCAGCAGCAATTCTGA
GTGTAACGTCTTCATTAAATGCGTATATAAATTTGCATAAAGATATGCGA
CCAATATTGAGCCTGGAAATATATGCGCAGAGTGCAAAATTGTGTTTTTT
GATCGGTTAATTAAAGG
>Scaffold641
GTTTCCCAGTAGGTCTCTCCCGCTACGGCGTCCGCACGAACGCGATCTGC
CCTCGTGCCCGCACCGCCATGACGGCAGAAGCCTTCGGCGAGAACAACAC
CGGCGTCGTCGGCCTCGATCCGCTTGCACCCGAGCGCGTCGCGACCCTGG
TCAGCTACCTCGCATCCCCCGATTCCGACGAGATCAACGGACAGGTCTTC
GTCGTCTACGGCAAGATGGTGGCGTTGATGGAAGCACCCAAGGTCGAGAA
CCGTTTCGACGCAGCCGGATCCGCGTTCACCGTCGAAGAACTCGGTGGCC
AGCTCTCGTCTTACTTCTCCGGCCGTGGGCCGTACGAGACCTACTGGGAA
AC
2、处理数据
分为几步:
(1)生成.greater, short.list和ZERO_BASE_COUNT文件
perl ../ scaf_filter_2k.pl Ascaris_suum.scaf.fa
scaf_filter_2k.pl代码
#!/usr/bin/perl
use strict;
use warnings;
my $file=shift;
#my $cutoff=shift;
my $outfile="short.list";
my $outfile2="$file.greater";
my $outfile3="ZERO_BASE_COUNT";
open IN,"< $file" or die $!;
open OUT,"> $outfile" or die $!;
open OUT1,"> $outfile2" or die $!;
open OUT2, "> $outfile3" or die $!;
$/='>';<IN>;$/="\n";
while(<IN>){
chomp;
my $id=$1 if(/^(\S+)/);
$/='>';
my $seq=<IN>;
chomp($seq);
$/="\n";
$seq=~s/\s//g;
my $len=length($seq);
if ($len < 200){
print OUT "$id\t$len\n";
next;
}
else{
my $a=$seq=~tr/aA/aA/;
my $t=$seq=~tr/tT/tT/;
my $c=$seq=~tr/cC/cC/;
my $g=$seq=~tr/gG/gG/;
if($a==0 || $c==0 || $g==0 || $t==0){
print OUT2 "$id\t$len\n";
next;
}
print OUT1 ">$id\n$seq\n";
}
}
close IN;
close OUT;
close OUT1;
close OUT2;
(2)生成.Nchange文件。
perl ../ info_N_plus.pl Ascaris_suum.scaf.fa.greater > Ascaris_suum.scaf.fa.greater.Nchange
info_N_plus.pl代码:
#!/usr/bin/perl -w
use strict;
#use Getopt::Long;
sub usage{
print STDERR <<USAGE;
############################################
Version 1.1 by Wing-L 2011.07.15
usage: $0 <sequence.fa> <len> >STDOUT
############################################
USAGE
exit;
}
&usage if(@ARGV <1);
my ($fa,$len)=@ARGV;
$len||=10;
open IN,"<$fa" or die("$!\n");
$/='>';<IN>;$/="\n";
while(my $line=<IN>){
my @block;
$line=~/^\S+/;
my $tag={1};
$/='>';
my $seq=<IN>;
chomp $seq;
$/="\n";
$seq=~s/\s//g;
my $chr_length=length $seq;
while($seq=~/[^N]N{1,9}[^N]/g){
substr ($seq,$-[0]+1,$len)=~s/\S/N/g;
}
while($seq=~/N([^N]{1,49})N/g){
my $tlen=length $1;
substr ($seq,$-[0]+1,$tlen)=~s/\S/N/g;
}
if($seq=~/^N?[^N]{0,49}N+/){
print STDERR "$tag\t1\t{1}[0]\n";
substr($seq,0,{1}[0]-$-[0])='';
}
if($seq=~/N+[^N]{0,49}N{0,}$/){
print STDERR "$tag\t$-[0]\t$chr_length\n";
substr($seq,$-[0],$chr_length-$-[0])='';
}
print ">$tag\n$seq\n";
}
close IN;
#open IN,"<" or die("$!\n");
#while(my $line=<IN>){}
#foreach my $e (){}
#(split /\s+/,$line)[0]
#open OUT,">" or die("$!\n");
############### sub ###############
(3)生成分割文件
perl ../get_scaftig.pl Ascaris_suum.scaf.fa.greater.Nchange > Ascaris_suum.scaf.fa.greater.Nchange.split
get_scaftig.pl代码:
#!/usr/bin/perl -w # #Author: Ruan Jue <ruanjue@genomics.org.cn> # use warnings; use strict; my $min_length = 0; my $name = ''; my $seq = ''; while(<>){ if(/^>(\S+)/){ &print_scafftig($name, $seq) if($seq); $name = $1; $seq = ''; } else { chomp; $seq .=
{1}
; }}&print_scafftig($name, $seq) if($seq); 1; sub print_scafftig { my $name = shift; my $seq = shift; my $temp = $seq; my $id = 1; my $flag = 0; my $pos = 1; while($seq=~/([ATGCatgc]+)/g){ my $s = $1; if($flag==1){ if($temp=~/([ATGCatgc]+[Nn]+)/g){ my $g = $1; $pos+=length($g); } } else{$flag=1;} next if(length($s) < $min_length); my $length=length($s); my $end=$pos+$length-1;# print ">$name\_$id start=$pos length=".length($s)."\n"; print ">$name\_$id\t$pos\t$end\t$length\n"; while($s=~/(.{1,60})/g){ print "$1\n"; } $id++;}}(4)生成.agp文件
perl ../AGP.pl Ascaris_suum.scaf.fa.greater.Nchange.split > Ascaris_suum.scaf.fa.agp
AGP.pl 代码:
#!/usr/bin/perl use strict; my $fasta=shift; my %Scaf; open IN,$fasta or die "$!"; while(<IN>){ if (/^>(\S+)(_\d+)\s+(\d+)\s+(\d+)\s+(\d+)/){ my $scaf=$1; my $contig=$1.$2; my $start=$3; my $end=$4; my $len=$5; push @{$Scaf{$scaf}},[$contig,$start,$end,$len]; } } close IN; foreach my $id ( sort keys %Scaf ){ @{$Scaf{$id}} = sort {$a->[1]<=>$b->[1]} @{$Scaf{$id}}; } foreach my $id ( sort { $Scaf{$b}[-1][2] <=> $Scaf{$a}[-1][2] } keys %Scaf ){ my $p=$Scaf{$id}; my $idx=1; for(my $i=0;$i<@$p;$i++){ if ($i>0){ print join("\t",$id,$p->[$i-1][2]+1,$p->[$i][1]-1,$idx,'N',($p->[$i][1]-1)-($p->[$i-1][2]+1)+1,'fragment','yes',"\t")."\n"; $idx++; } print join("\t",$id,$p->[$i][1],$p->[$i][2],$idx,'W',$p->[$i][0],1,$p->[$i][3],'+')."\n"; $idx++; } }
至此为止,第一步准备数据的过程已经完成。下一步就是向3811提交任务。
这个过程需要的东西:
l Ascaris_suum.scaf.fa.greater.Nchange.gz
l Ascaris_suum.agp.gz
和NCBI沟通的文件,主要是确定的Project ID的准确性。
未完待续.........