最长公共子序列(LCS)是比较两个符号序列相似性时很常用的方法。
比如在文章查重方面,比较一下公共部分的占比就可以判定两篇文章有没有抄袭。
下面来看看解法:
暴力穷举
虽说是暴力法,写出来还是很精炼的,乍眼一看好像没毛病
def LCS(X, Y):
if not X or not Y:
return ""
x, m, y, n = X[0], X[1:], Y[0], Y[1:]
if x == y:
return x+LCS(m, n)
else:
# Use key=len to select the maximum string in a list efficiently
return max(LCS (X, n), LCS(m, Y), key=len)
#assert(LCS('ABCBDAB', 'BDCABA')=='BCBA')
如果只需要计算子序列长度
def LCS_length(X, Y):
if not X or not Y:
return 0
x, m, y, n = X[0], X[1:], Y[0], Y[1:]
if x == y:
return 1+LCS_length(m, n)
else:
# Use key=len to select the maximum string in a list efficiently
return max(LCS_length(X, n), LCS_length(m, Y))
时间复杂度有多少呢?
假设 X 长度为 m,Y 长度为 n,则有递归式:
T
(
m
,
n
)
=
T
(
m
−
1
,
n
)
+
T
(
m
,
n
−
1
)
T(m,n) = T(m-1,n) + T(m,n-1)
T(m,n)=T(m−1,n)+T(m,n−1)
感受一下对角元的增速
动态规划
开辟 O ( m n ) O(mn) O(mn) 的数组
def LCS_dp(X, Y):
Table = [[0 for j in range(len(Y) + 1)] for i in range(len(X) + 1)]
# row 0 and column 0 are initialized to 0 already
for i, x in enumerate(X):
for j, y in enumerate(Y):
if x == y:
Table[i + 1][j + 1] = Table[i][j] + 1
else:
Table[i + 1][j + 1] = max(Table[i + 1][j], Table[i][j + 1])
# read the substring out from the matrix
result = ""
x, y = len(X), len(Y)
while x != 0 and y != 0:
if Table[x][y] == Table[x - 1][y]:
x -= 1
elif Table[x][y] == Table[x][y - 1]:
y -= 1
else:
assert X[x - 1] == Y[y - 1]
result = X[x - 1] + result
x -= 1
y -= 1
return result