Computing GC Content

本文介绍了一种计算FASTA格式DNA序列中GC含量的方法,并提供了实现这一计算的Python代码。文章解决了如何处理多行序列及如何计算最高GC含量序列的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complementof any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.


就是算一个fasta 文件里每条序列的GC值,输出id 和最大值

难点在于,id行可以有>做表示,但是序列不止一行结束。我不知道怎么用代码表述,直到下一个>前,都是一条序列这个意思。折腾了一会。

还有就是技巧,原来list.index()可以直接找到你要找的字符的位置。之前有看到过endwith() ,所以猜肯定有starswith了,对于fasta文件还是很好用。还有就是连接两个字符串,用+也可以,但是当成list 用join的话就出问题了,不知道为什么。

上代码:

def Compute_GC (seq):
	GC = 0
	total =0
	for i in range(0, len(seq)):
		if seq[i] in ['A','T','G','C']:
			total = total +1
		if seq[i] in ['G', 'C']:
			GC = GC+1
	return float(GC)/total*100

def getMaxIndex(my_list):
    max = my_list[0]
    for i in my_list:
        if i > max:
            max = i
    return my_list.index(max)

file=open('rosalind_gc.txt','r')
id=list()
GC=list()
long_seq=''
nowlen=1
for seq in file.readlines():
		#print(seq)
		if(seq.startswith('>')):
			id.append(seq.lstrip('>'))
		else:
			seq=seq.rstrip('\n')
			long_seq=long_seq+seq
			#print(long_seq)
		if(len(id)>nowlen):
			nowlen=len(id)
			GC.append(Compute_GC(long_seq))
			long_seq=''
GC.append(Compute_GC(long_seq))
#print(long_seq)
no=getMaxIndex(GC)
print(id[no])
print("%.6f"%GC[no])





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值