关于本博客的说明: 本次博客主要分享近似熵(Approximate Entropy, AE)的理论相关知识及其代码实现.
一、理论基础
**近似熵(ApEn)**是一种用于量化时间序列波动的规律性和不可预测性的非线性动力学参数,它用一个非负数来表示一个时间序列的复杂性,反映了时间序列中新信息发生的可能性,越复杂的时间序列对应的近似熵越大[1].
算法表述如下:
-
设存在一个以等时间间隔采样获得的NNN维的时间序列u(1),u(2),...,u(N)u(1),u(2),...,u(N)u(1),u(2),...,u(N).
-
定义算法相关参数mmm, rrr,其中,mmm为整数,表示比较向量的长度,rrr为实数,表示“相似度”的度量值.
-
重构mmm维向量X(1),X(2),...,X(N−m+1)X(1),X(2),...,X(N-m+1)X(1),X(2),...,X(N−m+1),其中X(i)=[u(i),u(i+1),...,u(i+m−1)]X(i)=[u(i),u(i+1),...,u(i+m-1)]X(i)=[u(i),u(i+1),...,u(i+m−1)].
-
对于1≤i≤N−m+11 \leq i \leq N-m+11≤i≤N−m+1,统计满足以下条件的向量个数Cim(r)=(number of X(j) such that d[X(i),X(j)]≤r)/(N−m+1)C_i^m(r)=(number\ of\ X(j)\ such\ that\ d[X(i),X(j)]\le r )/(N-m+1)Cim(r)=(number of X(j) such that d[X(i),X(j)]≤r)/(N−m+1)
其中,d[X,X∗]d[X,X^*]d[X,X∗]定义为d[X,X∗]=maxa∣u(a)−u∗(a)∣d[X,{X^*}] = \mathop {\max }\limits_a |u(a) - {u^*}(a)|d[X,X∗]=amax∣u(a)−u∗(a)∣
u(a)u(a)u(a)为向量XXX的元素,ddd表示向量X(i)X(i)X(i)与X(j)X(j)X(j)的距离,由对应元素的最大差值决定,jjj的取值范围为[1,N−m+1][1,N-m+1][1,N−m+1],包括j=ij=ij=i. -
定义Φm(r)=(N−m+1)−1∑i=1N−m+1log(Cim(r))\Phi^m(r)=(N-m+1)^{-1}\sum_{i=1}^{N-m+1}log(C_i^m(r))Φm(r)=(N−m+1)−1∑i=1N−m+1log(Cim(r)).
-
则近似熵(ApEn)定义为ApEn=Φm(r)−Φm+1(r)ApEn=\Phi^m(r)-\Phi^{m+1}(r)ApEn=Φm(r)−Φm+1(r)
logloglog表示自然对数,mmm和rrr由第2步定义.
参数选择:通常选择参数m=2m=2m=2或m=3m=3m=3;rrr的选择在很大程度上取决于实际应用场景,通常选择r=0.2∗stdr=0.2*stdr=0.2∗std,其中stdstdstd表示原时间序列的标准差.
根据相关文献[1],在实际应用中选择d[X(i),X(j)]<rd[X(i),X(j)]<rd[X(i),X(j)]<r或d[X(i),X(j)]≤rd[X(i),X(j)]\leq rd[X(i),X(j)]≤r都是可行的.
如果一个时间序列的规律性比较强,则其近似熵值(ApEn)比较小,对应地,一个比较复杂的时间序列则对应一个较大的熵值.
二、示例
(备注:本示例翻译自Wikipedia关于Approximate entropy的示例[2])
- 假设时间序列SNS_NSN是从心率数据中经过等间隔采样得到的,并且是一个以3为周期的周期时间序列,维度N=51N=51N=51
SN={85,80,89,85,80,89,...}S_N=\left \{85,80,89,85,80,89,... \right \}SN={85,80,89,85,80,89,...} - 选择参数:m=2m=2m=2,r=3r=3r=3
- 重构mmm维向量:X(1)=[u(1) u(2)]=[85 80]X(1)=[u(1)\ u(2)]=[85\ 80]X(1)=[u(1) u(2)]=[85 80] X(2)=[u(2) u(3)]=[80 89]X(2)=[u(2)\ u(3)]=[80\ 89]X(2)=[u(2) u(3)]=[80 89] X(3)=[u(3) u(4)]=[89 85]X(3)=[u(3)\ u(4)]=[89\ 85]X(3)=[u(3) u(4)]=[89 85] X(4)=[u(4) u(5)]=[85 80]X(4)=[u(4)\ u(5)]=[85\ 80]X(4)=[u(4) u(5)]=[85 80] ............
- 计算向量之间的距离:
d[X(1),X(1)]=maxa∣u(a)−u∗(a)∣=0<r=3d[X(1),X(1)]= \mathop {\max }\limits_a|u(a)-u^*(a)|=0<r=3d[X(1),X(1)]=amax∣u(a)−u∗(a)∣=0<r=3 d[X(1),X(2)]=maxa∣u(a)−u∗(a)∣=max∣u(1)−u(2)∣,∣u(2)−u(3)∣)=∣u(2)−u(3)∣=9>r=3d[X(1),X(2)]= \mathop {\max }\limits_a|u(a)-u^*(a)|= \mathop {\max } |u(1)-u(2)|,|u(2)-u(3)|)=|u(2)-u(3)|=9>r=3d[X(1),X(2)]=amax∣u(a)−u∗(a)∣=max∣u(1)−u(2)∣,∣u(2)−u(3)∣)=∣u(2)−u(3)∣=9>r=3
类似地可以计算d[X(1),X(3)]=∣u(2)−u(4)∣=5>rd[X(1),X(3)]=|u(2)-u(4)|=5>rd[X(1),X(3)]=∣u(2)−u(4)∣=5>r,d[X(1),X(4)]=∣u(1)−u(4)∣=∣u(2)−u(5)∣=0<rd[X(1),X(4)]=|u(1)-u(4)|=|u(2)-u(5)|=0<rd[X(1),X(4)]=∣u(1)−u(4)∣=∣u(2)−u(5)∣=0<r
因此,满足条件d[X(1),X(j)]≤rd[X(1),X(j)]\leq rd[X(1),X(j)]≤r的X(j)X(j)X(j)包括X(1),X(4),...,X(49)X(1),X(4),...,X(49)X(1),X(4),...,X(49),共17个向量.
此外,因为1≤i≤N−m+11 \leq i \leq N-m+11≤i≤N−m+1,所以满足条件d[X(3),X(j)]≤rd[X(3),X(j)]\leq rd[X(3),X(j)]≤r的X(j)X(j)X(j)包括X(3),X(6),...,X(48)X(3),X(6),...,X(48)X(3),X(6),...,X(48),共16个向量.
因此有:C12(3)=1750C_1^2(3)=\frac{17}{50}C12(3)=5017、C22(3)=1750C_2^2(3)=\frac{17}{50}C22(3)=5017、C32(3)=1650C_3^2(3)=\frac{16}{50}C32(3)=5016、C42(3)=1750C_4^2(3)=\frac{17}{50}C42(3)=5017、............
5. 计算Φ2(3)=(50)−1∑i=150log(Ci2(3))≈−1.0982095403531886\Phi^2(3)=(50)^{-1}\sum_{i=1}^{50}log(C_i^2(3))\approx -1.0982095403531886Φ2(3)=(50)−1i=1∑50log(Ci2(3))≈−1.0982095403531886.
令m=m+1=3m=m+1=3m=m+1=3,重复步骤3~4,获得C13(3)=1749C_1^3(3)=\frac{17}{49}C13(3)=4917、C23(3)=1649C_2^3(3)=\frac{16}{49}C23(3)=4916、C33(3)=1649C_3^3(3)=\frac{16}{49}C33(3)=4916、C43(3)=1749C_4^3(3)=\frac{17}{49}C43(3)=4917、............,因此,有Φ3(3)=(49)−1∑i=149log(Ci3(3))≈−1.0981985438120834\Phi^3(3)=(49)^{-1}\sum_{i=1}^{49}log(C_i^3(3))\approx -1.0981985438120834Φ3(3)=(49)−1i=1∑49log(Ci3(3))≈−1.0981985438120834
6. 计算近似熵(ApEn):
ApEn=Φ2(3)−Φ3(3)≈1.0996541105257052e−05ApEn=\Phi^2(3)-\Phi^3(3)\approx 1.0996541105257052e-05ApEn=Φ2(3)−Φ3(3)≈1.0996541105257052e−05
该值非常小,因此它意味着原时间序列是规则、可预测的,这与观察结果一致.
三、代码实现
Python
(备注:示例程序来自Wikipedia关于Approximate entropy的python实现[2])
import numpy as np
def ApEn(U, m, r):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m+1) - _phi(m))
# Usage example
U = np.array([85, 80, 89] * 17)
print ApEn(U, 2, 3)
1.0996541105257052e-05
MATLAB
(备注:示例程序来自matlab社区File Exchange板块,由Kijoon Lee分享[3],在此表示感谢!)
function apen = ApEn( dim, r, data, tau )
%ApEn
% dim : embedded dimension
% r : tolerance (typically 0.2 * std)
% data : time-series data
% tau : delay time for downsampling
% Changes in version 1
% Ver 0 had a minor error in the final step of calculating ApEn
% because it took logarithm after summation of phi's.
% In Ver 1, I restored the definition according to original paper's
% definition, to be consistent with most of the work in the
% literature. Note that this definition won't work for Sample
% Entropy which doesn't count self-matching case, because the count
% can be zero and logarithm can fail.
%
% A new parameter tau is added in the input argument list, so the users
% can apply ApEn on downsampled data by skipping by tau.
%---------------------------------------------------------------------
% coded by Kijoon Lee, kjlee@ntu.edu.sg
% Ver 0 : Aug 4th, 2011
% Ver 1 : Mar 21st, 2012
%---------------------------------------------------------------------
if nargin < 4, tau = 1; end
if tau > 1, data = downsample(data, tau); end
N = length(data);
result = zeros(1,2);
for j = 1:2
m = dim+j-1;
phi = zeros(1,N-m+1);
dataMat = zeros(m,N-m+1);
% setting up data matrix
for i = 1:m
dataMat(i,:) = data(i:N-m+i);
end
% counting similar patterns using distance calculation
for i = 1:N-m+1
tempMat = abs(dataMat - repmat(dataMat(:,i),1,N-m+1));
boolMat = any( (tempMat > r),1);
phi(i) = sum(~boolMat)/(N-m+1);
end
% summing over the counts
result(j) = sum(log(phi))/(N-m+1);
end
apen = result(1)-result(2);
end
以示例数据作为输入,计算得近似熵aepn=−1.099654110681136e−05aepn=-1.099654110681136e-05aepn=−1.099654110681136e−05,此结果与Python计算结果一致.
近似熵的优势:[2]
a. 对数据长度的依赖性较小. ApEn可以用于小数据样本(n<50n<50n<50),并可实现实时计算.
b. 抗噪声能力较强. 如果数据含有噪声,则可以将ApEn与噪声水平进行比较,以确定原始数据中真实信息的表达程度.
主要参考资料:
[1]. Pincus, S. M. (1991). “Approximate entropy as a measure of system complexity”. Proceedings of the National Academy of Sciences. 88 (6): 2297–2301. doi:10.1073/pnas.88.6.2297. PMC 51218 Freely accessible. PMID 11607165.
[2]. https://en.wikipedia.org/wiki/Approximate_entropy#cite_note-Pincus21991-23
[3]. https://cn.mathworks.com/matlabcentral/fileexchange/32427-fast-approximate-entropy?s_tid=srchtitle.
后注:由于这是笔者第一次在Markdown编辑器内进行文档编辑,初次采用LaTeX格式进行公式编辑,所以文章中难免出现失误、疏漏之处,还请各位不吝赐教!