rosalind练习题十八

最新推荐文章于 2025-05-31 16:10:35 发布

jkl_bio

最新推荐文章于 2025-05-31 16:10:35 发布

阅读量67

点赞数

文章标签： python

本文链接：https://blog.youkuaiyun.com/weixin_44619692/article/details/130984463

版权

该文讨论了如何从DNA双螺旋的两条链中找出开放阅读框（ORF），并将其翻译成蛋白质序列。给定一个最多1kbp长的DNA序列，程序会寻找所有可能的ORF，并返回由这些ORF翻译得到的不同蛋白质字符串。过程包括读取FASTA格式的DNA序列，识别起始和终止密码子，以及翻译ORF得到蛋白质序列。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

# Problem
# Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.
# An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

# Given: A DNA string s of length at most 1 kbp in FASTA format.
# Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

# Sample Dataset
# >Rosalind_99
# AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
# Sample Output
# MLLGSFRLIPKETLIQVAGSSPCNLS
# M
# MGMTPRLGLESLLE
# MTPRLGLESLLE

# 给定一个长度不超过1 kbp 的 DNA 序列 s（FASTA 格式），求 s 的所有开放阅读框（ORF）所翻译出的蛋白质序列，其中一个开放阅读框指的是从起始密码子开始（ATG），以终止密码子（TAA、TAG 或 TGA）结尾，中间没有其它终止密码子。

from Bio.Seq import Seq

# 读取 FASTA 文件中的 DNA 序列
dna_seq = Seq("AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG")

# 定义函数，用于将 DNA 序列翻译成蛋白质序列
def translate(dna_seq):
protein_seq = ""
for i in range(0, len(dna_seq) - 2, 3):
codon = dna_seq[i:i + 3]
aa = codon.translate(table=1)
if aa == "*":
break
protein_seq += aa
return protein_seq

# 初始化结果集
orf_set = set()

# 在正向序列和反向互补序列中分别查找可能的 ORF
for seq in [dna_seq, dna_seq.reverse_complement()]:
for i in range(len(seq)):
if seq[i:i + 3] == "ATG":
for j in range(i + 3, len(seq), 3):
if seq[j:j + 3] in {"TAA", "TAG", "TGA"}:
orf = seq[i:j + 3]
protein = translate(orf)
if protein:
orf_set.add(protein)

# 输出结果集中的所有 ORF
for orf in sorted(list(orf_set), key=len, reverse=True):
print(orf)