生信序列基本操作算法
建议在Jupyter实践,python版本3.9
1. 获取overlap序列索引和序列的算法实现
# min_length 最小overlap碱基数量3个
def getOverlapIndexAndSequence(a, b, min_length=3):
""" Return length of longest suffix of 'a' matching
a prefix of 'b' that is at least 'min_length'
characters long. If no such overlap exists,
return 0. """
# 开始位置
start = 0
while True:
# 在序列a中查找b的最小长度后缀
start = a.find(b[:min_length], start)
# 如果没有匹配到则返回0
if start == -1:
return 0
# 如果存在overlap序列,则输出a序列开始索引以及overlap序列
# 即序列b的开始 min_length 个碱基与a序列的 min_length 个碱基的后缀序列相同
if b.startswith(a[start:]):
return len(a)-start, a[start:]
# 右移1个碱基
start += 1
2. 算法测试
getOverlapIndexAndSequence('TTACGT', 'CGTGTGC')
# (3, 'CGT') overlap序列开始索引和对应序列碱基
getOverlapIndexAndSequence('TTACGT', 'GTGTGC')
# 0