Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complementof any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
就是算一个fasta 文件里每条序列的GC值,输出id 和最大值
难点在于,id行可以有>做表示,但是序列不止一行结束。我不知道怎么用代码表述,直到下一个>前,都是一条序列这个意思。折腾了一会。
还有就是技巧,原来list.index()可以直接找到你要找的字符的位置。之前有看到过endwith() ,所以猜肯定有starswith了,对于fasta文件还是很好用。还有就是连接两个字符串,用+也可以,但是当成list 用join的话就出问题了,不知道为什么。
上代码:
def Compute_GC (seq):
GC = 0
total =0
for i in range(0, len(seq)):
if seq[i] in ['A','T','G','C']:
total = total +1
if seq[i] in ['G', 'C']:
GC = GC+1
return float(GC)/total*100
def getMaxIndex(my_list):
max = my_list[0]
for i in my_list:
if i > max:
max = i
return my_list.index(max)
file=open('rosalind_gc.txt','r')
id=list()
GC=list()
long_seq=''
nowlen=1
for seq in file.readlines():
#print(seq)
if(seq.startswith('>')):
id.append(seq.lstrip('>'))
else:
seq=seq.rstrip('\n')
long_seq=long_seq+seq
#print(long_seq)
if(len(id)>nowlen):
nowlen=len(id)
GC.append(Compute_GC(long_seq))
long_seq=''
GC.append(Compute_GC(long_seq))
#print(long_seq)
no=getMaxIndex(GC)
print(id[no])
print("%.6f"%GC[no])