Computing GC Content

最新推荐文章于 2024-01-21 11:57:07 发布

原创最新推荐文章于 2024-01-21 11:57:07 发布 · 1.6k 阅读

0 ·

CC 4.0 BY-SA版权

rosalind road 专栏收录该内容

6 篇文章

订阅专栏

本文介绍了一种计算FASTA格式DNA序列中GC含量的方法，并提供了实现这一计算的Python代码。文章解决了如何处理多行序列及如何计算最高GC含量序列的问题。

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complementof any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

就是算一个fasta 文件里每条序列的GC值，输出id 和最大值

难点在于，id行可以有>做表示，但是序列不止一行结束。我不知道怎么用代码表述，直到下一个>前，都是一条序列这个意思。折腾了一会。

还有就是技巧，原来list.index()可以直接找到你要找的字符的位置。之前有看到过endwith() ，所以猜肯定有starswith了，对于fasta文件还是很好用。还有就是连接两个字符串，用+也可以，但是当成list 用join的话就出问题了，不知道为什么。

上代码：

def Compute_GC (seq):
	GC = 0
	total =0
	for i in range(0, len(seq)):
		if seq[i] in ['A','T','G','C']:
			total = total +1
		if seq[i] in ['G', 'C']:
			GC = GC+1
	return float(GC)/total*100

def getMaxIndex(my_list):
    max = my_list[0]
    for i in my_list:
        if i > max:
            max = i
    return my_list.index(max)

file=open('rosalind_gc.txt','r')
id=list()
GC=list()
long_seq=''
nowlen=1
for seq in file.readlines():
		#print(seq)
		if(seq.startswith('>')):
			id.append(seq.lstrip('>'))
		else:
			seq=seq.rstrip('\n')
			long_seq=long_seq+seq
			#print(long_seq)
		if(len(id)>nowlen):
			nowlen=len(id)
			GC.append(Compute_GC(long_seq))
			long_seq=''
GC.append(Compute_GC(long_seq))
#print(long_seq)
no=getMaxIndex(GC)
print(id[no])
print("%.6f"%GC[no])