1.softmax函数
softmax函数相当于将输出值转化为一个概率分布:
f
(
x
)
=
e
x
∑
e
x
f(x)=\frac{e^{x} }{\sum e^{x}}
f(x)=∑exex
因此,
f
(
x
)
∈
[
0
,
1
]
,
∑
f
(
x
)
=
1
f(x) \in [0,1],\sum f(x)=1
f(x)∈[0,1],∑f(x)=1。
2.softmax多分类模型
对于多分类任务,有
m
m
m个样本,每个样本有
n
n
n个维度和一个标签
y
y
y,
y
y
y有
(
1
,
2
,
3
,
.
.
.
,
C
)
(1,2,3,...,C)
(1,2,3,...,C)共
C
C
C个类别,
w
c
∈
R
1
×
n
\mathbf{w}_{c}\in R^{1\times n}
wc∈R1×n为第
c
c
c个类别的权重系数,第
i
i
i个样本
x
i
∈
R
n
×
1
\mathbf{x}^{i}\in R^{n\times 1}
xi∈Rn×1的类别
y
i
\mathbf{y}^{i}
yi为类别
c
c
c的概率为:
p
(
y
i
∣
x
i
)
=
e
w
c
x
i
∑
c
=
1
C
e
w
c
x
i
p(y^{i}|x^{i})=\frac{e^{\mathbf{w^{c}\mathbf{x}^{i}}}}{\sum_{c=1}^{C}e^{\mathbf{w^{c}}}\mathbf{x^{i}}}
p(yi∣xi)=∑c=1Cewcxiewcxi
用one-hot向量
y
^
i
\mathbf{\widehat{y}}^{i}
y
i表示样本
x
i
\mathbf{x^{i}}
xi的预测结果:
y
^
i
=
e
W
x
i
E
T
e
W
x
i
\mathbf{\widehat{y}}^{i}=\frac{e^{\mathbf{W}\mathbf{x^{i}}}}{E^{T}{e^{\mathbf{W}\mathbf{x^{i}}}}}
y
i=ETeWxieWxi
其中,
W
∈
R
c
×
n
,
x
i
∈
R
n
×
1
,
E
∈
R
c
×
1
,
y
^
i
∈
R
c
×
1
\mathbf{W}\in R^{c\times n},\mathbf{x}^{i}\in R^{n\times 1},E \in R^{c\times 1},\mathbf{\widehat{y}}^{i}\in R ^{c\times 1}
W∈Rc×n,xi∈Rn×1,E∈Rc×1,y
i∈Rc×1。
3.损失函数
softmax模型采用交叉熵作为损失函数:
J
(
W
)
=
−
1
m
∑
i
=
1
m
(
y
i
)
T
log
y
^
i
=
−
1
m
∑
i
=
1
m
(
y
i
)
T
log
e
W
x
i
E
T
e
W
x
i
J(\mathbf{W})=-\frac{1}{m}\sum_{i=1}^{m}(\mathbf{y}^{i})^{T}\log \widehat{\mathbf{y}}^{i}=-\frac{1}{m}\sum_{i=1}^{m}(\mathbf{y}^{i})^{T}\log \frac{e^{\mathbf{Wx^{i}}}}{E^{T}e^{\mathbf{Wx^{i}}}}
J(W)=−m1i=1∑m(yi)Tlogy
i=−m1i=1∑m(yi)TlogETeWxieWxi
4.参数学习
使用梯度下降迭代求解参数值,令:
L
=
−
(
y
i
)
T
log
e
W
x
i
E
T
e
W
x
i
=
−
(
y
i
)
T
[
W
x
i
−
E
log
E
T
e
W
x
i
]
L=-(\mathbf{y}^{i})^{T}\log \frac{e^{\mathbf{Wx^{i}}}}{E^{T}e^{\mathbf{Wx^{i}}}}=-(\mathbf{y}^{i})^{T}[\mathbf{Wx^{i}}-E\log E^{T}e^{\mathbf{Wx^{i}}}]
L=−(yi)TlogETeWxieWxi=−(yi)T[Wxi−ElogETeWxi]
其中:
W
∈
R
c
×
n
,
x
i
∈
R
n
×
1
,
E
∈
R
c
×
1
,
y
i
∈
R
c
×
1
,
L
∈
R
\mathbf{W}\in R^{c\times n},\mathbf{x}^{i}\in R^{n\times 1},E \in R^{c\times 1},\mathbf{y}^{i}\in R^{c\times 1},L\in R
W∈Rc×n,xi∈Rn×1,E∈Rc×1,yi∈Rc×1,L∈R
又由于
(
y
i
)
T
E
=
1
(\mathbf{y}^{i})^{T}E=1
(yi)TE=1,所以:
L
=
−
(
y
i
)
T
W
x
i
−
log
E
T
e
W
x
i
L=-(\mathbf{y}^{i})^{T}\mathbf{Wx^{i}}-\log E^{T}e^{\mathbf{Wx^{i}}}
L=−(yi)TWxi−logETeWxi
要求
∂
L
∂
W
\frac{\partial L}{\partial \mathbf{W}}
∂W∂L,标量对矩阵求导,使用矩阵微分(矩阵微分乘法,逐元素函数微分):
d
L
=
(
−
y
i
)
T
d
W
x
i
+
E
T
(
e
W
x
i
⊙
d
W
x
i
)
E
T
e
W
x
i
dL=(-\mathbf{y}^{i})^{T}d\mathbf{Wx}^{i}+\frac{E^{T}(e^{\mathbf{Wx}^{i}}\odot d\mathbf{Wx}^{i})}{E^{T}e^{\mathbf{Wx}^{i}}}
dL=(−yi)TdWxi+ETeWxiET(eWxi⊙dWxi)
又
E
T
(
U
⊙
V
)
=
U
T
V
E^{T}(\mathbf{U\odot V})=\mathbf{U^{T}V}
ET(U⊙V)=UTV,所以:
E
T
(
e
W
x
i
⊙
d
W
x
i
)
=
(
e
W
x
i
)
T
d
W
x
i
E^{T}(e^{\mathbf{Wx}^{i}}\odot d\mathbf{Wx}^{i})=(e^{\mathbf{Wx}^{i}})^{T} d\mathbf{Wx}^{i}
ET(eWxi⊙dWxi)=(eWxi)TdWxi
综上:
d
L
=
(
−
y
i
)
T
d
W
x
i
+
E
T
(
e
W
x
i
⊙
d
W
x
i
)
E
T
e
W
x
i
=
(
−
y
i
)
T
d
W
x
i
+
(
e
W
x
i
)
T
d
W
x
i
E
T
e
W
x
i
=
[
(
−
y
i
)
T
+
(
e
W
x
i
)
T
E
T
e
W
x
i
]
d
W
x
i
=
[
(
y
^
i
)
T
−
(
y
i
)
T
]
d
W
x
i
=
t
r
[
(
y
^
i
−
y
i
)
T
d
W
x
i
]
=
t
r
[
x
i
(
y
^
i
−
y
i
)
T
d
W
]
\begin{aligned} dL&=(-\mathbf{y}^{i})^{T}d\mathbf{Wx}^{i}+\frac{E^{T}(e^{\mathbf{Wx}^{i}}\odot d\mathbf{Wx}^{i})}{E^{T}e^{\mathbf{Wx}^{i}}} \\&=(-\mathbf{y}^{i})^{T}d\mathbf{Wx}^{i}+\frac{(e^{\mathbf{Wx}^{i}})^{T}d\mathbf{Wx}^{i}}{E^{T}e^{\mathbf{Wx}^{i}}} \\&=[(-\mathbf{y}^{i})^{T}+\frac{(e^{\mathbf{Wx}^{i}})^{T}}{E^{T}e^{\mathbf{Wx}^{i}}}]d\mathbf{Wx}^{i} \\&=[(\widehat{\mathbf{y}}^{i})^{T}-(\mathbf{y}^{i})^{T}]d\mathbf{Wx}^{i} \\&=tr[(\widehat{\mathbf{y}}^{i}-\mathbf{y}^{i})^{T}d\mathbf{Wx}^{i}] \\&=tr[\mathbf{x}^{i}(\widehat{\mathbf{y}}^{i}-\mathbf{y}^{i})^{T}d\mathbf{W}] \end{aligned}
dL=(−yi)TdWxi+ETeWxiET(eWxi⊙dWxi)=(−yi)TdWxi+ETeWxi(eWxi)TdWxi=[(−yi)T+ETeWxi(eWxi)T]dWxi=[(y
i)T−(yi)T]dWxi=tr[(y
i−yi)TdWxi]=tr[xi(y
i−yi)TdW]
最终:
∂
L
∂
W
=
[
x
i
(
y
^
i
−
y
i
)
T
]
T
=
(
y
^
i
−
y
i
)
(
x
i
)
T
∂
J
∂
W
=
∑
i
=
1
m
(
y
^
i
−
y
i
)
(
x
i
)
T
\frac{\partial L}{\partial \mathbf{W}}=[\mathbf{x}^{i}(\widehat{\mathbf{y}}^{i}-\mathbf{y}^{i})^{T}]^{T}=(\widehat{\mathbf{y}}^{i}-\mathbf{y}^{i})(\mathbf{x}^{i})^{T} \\ \frac{\partial J}{\partial \mathbf{W}}=\sum_{i=1}^{m}(\widehat{\mathbf{y}}^{i}-\mathbf{y}^{i})(\mathbf{x}^{i})^{T}
∂W∂L=[xi(y
i−yi)T]T=(y
i−yi)(xi)T∂W∂J=i=1∑m(y
i−yi)(xi)T
第
k
+
1
k+1
k+1次的迭代公式为:
W
k
+
1
=
W
k
−
λ
∑
i
=
1
m
(
y
^
i
−
y
i
)
(
x
i
)
T
\mathbf{W}^{k+1}=\mathbf{W}^{k}-\lambda\sum_{i=1}^{m}(\widehat{\mathbf{y}}^{i}-\mathbf{y}^{i})(\mathbf{x}^{i})^{T}
Wk+1=Wk−λi=1∑m(y
i−yi)(xi)T
其中:
y
^
i
=
e
W
x
i
E
T
e
W
x
i
\mathbf{\widehat{y}}^{i}=\frac{e^{\mathbf{W}\mathbf{x^{i}}}}{E^{T}{e^{\mathbf{W}\mathbf{x^{i}}}}}
y
i=ETeWxieWxi
5.sklearn实现多分类
(1)数据集简介
鸢尾花数据集是一个经典数据集。数据集内包含 3 类共 150 条记录,每类各 50 个数据,每条记录都有 4 项特征:花萼长度、花萼宽度、花瓣长度、花瓣宽度,可以通过这4个特征预测鸢尾花卉属于4种中的哪一品种。
(2)准备数据
导入数据集,划分训练集和测试集
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
x,y=iris.data,iris.target
x_train,x_test,y_train,y_test = train_test_split(x,y)
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)
输出:
(112, 4) (38, 4) (112,) (38,)
(3)数据标准化
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.fit_transform(x_test)
(4)训练、预测和评价
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_pred,y_test))
输出:
0.9736842105263158