CVX实现逻辑回归,并测试鸢尾花分类

CVX是凸优化的一个工具库,本文利用CVX实现一个逻辑回归,用作CVX入门.其中CVX安装可以参考CVX安装,另外强烈推荐去Github获取所有源代码.

鸢尾花数据集(Iris)是机器学习中一个常见的数据集,其用于鸢尾花卉分类,数据集共包含150个样本,共具有3种花卉类别,分别为山鸢尾(Iris Setosa)、杂色鸢尾(Iris Versicolour)以及弗吉尼亚鸢尾(Iris Virginica),每种鸢尾花有50个样本。每个样本包括花萼长度(sepal length)、花萼宽度(sepal width)、花瓣长度(petal length)、花瓣宽度(petal width)4种特征,每种特征都是正实数,单位为厘米,原始数据如图1-1所示。

iris data
图1-1 Iris原始数据示例

数据预处理

Iris原始数据包含逗号和字符串,不能直接处理,需要进行数据预处理。处理程序如下,处理后特征矩阵变为样本数*特征数的矩阵,三类鸢尾花变为1、2、3,存于长度样本数的向量中,便于之后处理。

% raw data in bezdekIris.data

% read raw data, separate by ','
[sepal_length, sepal_width, petal_length, petal_width, class] = textread('bezdekIris.data', '%f%f%f%f%s', 'delimiter', ',');

% put features into matrix
attributes = [sepal_length sepal_width petal_length petal_width];

% replace class characters with numbers
labels = zeros(150, 1);
labels(strcmp(class, 'Iris-setosa')) = 1;
labels(strcmp(class, 'Iris-versicolor')) = 2;
labels(strcmp(class, 'Iris-virginica')) = 3;

clear sepal_length
clear sepal_width
clear petal_length
clear petal_width
clear class

之后将所有150个样本随机分为100个训练集和50个测试集,其中规定了随机数种子,保证每次分类效果一致,可完全复现,程序如下。

% test set number
num_test = size(labels, 1)/3;

% random sort
rand('seed', 0);
R = randperm(size(labels, 1));

% test set
attributes_test = attributes(R(1:num_test), :);
labels_test = labels(R(1:num_test), :);

% train set
R(1:num_test) = [];
attributes_train = attributes(R, :);
labels_train = labels(R, :);

clear num_test
clear R

将训练集的三种样本可视化如图1-2,以sepal length和sepal width为横纵坐标,程序如下所示,可见这两种特征对样本可区分性不强,因为杂色鸢尾和弗吉尼亚鸢尾完全混合在一起。

Iris_Setosa = (labels_train == 1);
Iris_Versicolor = (labels_train == 2);
Iris_Virginica = (labels_train == 3);

scatter(attributes_train(Iris_Setosa, 1), attributes_train(Iris_Setosa, 2), 'red');
hold on
scatter(attributes_train(Iris_Versicolor, 1),attributes_train(Iris_Versicolor, 2), 'g', '+');
hold on
scatter(attributes_train(Iris_Virginica, 1),attributes_train(Iris_Virginica, 2), 'b', '*');
legend('Iris Setosa', 'Iris Versicolor', 'Iris Virginica', 'location', 'northeastoutside')
xlabel('sepal length')
ylabel('sepal width')

clear Iris_Setosa
clear Iris_Versicolor
clear Iris_Virginica

analysis图1-2 训练集可视化

PCA降维

进一步对训练集进行PCA降维,选取最具代表性的2维特征,程序如下。先对训练集做减均值除方差操作,之后进行矩阵特征值分解,选取特征值占比大于95%的维度,经测试前两维特征即可达到96.331%,对测试集减去训练集的均值,除以训练集的方差,然后乘以训练集得到的系数矩阵,得到测试集特征。降维后的训练集和测试集2维特征可视化如图1-3,相比图1-2中特征更具区分性.

% pre-process
% mu: mean of each feature, sigma: std of each feature
[Z, mu, sigma] = zscore(attributes_train);

% PCA
[coeff, score, latent] = pca(Z);

% select top features according to threshold
latent = 100 * latent / sum(latent);
A = length(latent);
percent_threshold = 95;
percents = 0;
for n=1:A
    percents = percents + latent(n);
    if percents > percent_threshold
        break;
    end
end
coeff = coeff(:, 1:n);
score_train = score(:, 1:n);

% same for test set
num_test = size(labels_test, 1);

% (X - mu)/sigma
score_test = (attributes_test - repmat(mu, num_test, 1)) ./ repmat(sigma, num_test, 1);

% PCA features
score_test = score_test * coeff;

clear A
clear coeff
clear latent
clear mu
clear n
clear num_test
clear percent_threshold
clear percents
clear score
clear sigma
clear Z

% visualization
Iris_Setosa = (labels_train == 1);
Iris_Versicolor = (labels_train == 2);
Iris_Virginica = (labels_train == 3);

subplot(1, 2, 1)
scatter(score_train(Iris_Setosa, 1), score_train(Iris_Setosa, 2), 'red');
hold on
scatter(score_train(Iris_Versicolor, 1),score_train(Iris_Versicolor, 2), 'g', '+');
hold on
scatter(score_train(Iris_Virginica, 1),score_train(Iris_Virginica, 2), 'b', '*');
legend('Iris Setosa', 'Iris Versicolor', 'Iris Virginica', 'location', 'northeastoutside')
xlabel('train set')

Iris_Setosa = (labels_test == 1);
Iris_Versicolor = (labels_test == 2);
Iris_Virginica = (labels_test == 3);

subplot(1, 2, 2)
scatter(score_test(Iris_Setosa, 1), score_test(Iris_Setosa, 2), 'red');
hold on
scatter(score_test(Iris_Versicolor, 1),score_test(Iris_Versicolor, 2), 'g', '+');
hold on
scatter(score_test(Iris_Virginica, 1),score_test(Iris_Virginica, 2), 'b', '*');
legend('Iris Setosa', 'Iris Versicolor', 'Iris Virginica', 'location', 'northeastoutside')
xlabel('test set')

clear Iris_Setosa
clear Iris_Versicolor
clear Iris_Virginica

pca图1-3 PCA降维后可视化

优化目标

经过图1-3可以看出,三类鸢尾花可以通过线性分类面进行分类,本文通过逻辑回归进行分类。
设特征为 ( x 1 , x 2 ) (x_1,x_2) (x1,x2),由图中可以看出山鸢尾较其他两种鸢尾容易区分,可以直接根据特征 x 1 = − 1 x_{1}=-1 x1=1进行区分,为了方便,本文仅讨论弗吉尼亚鸢尾的二分类问题,设山鸢尾和杂色鸢尾的类别 y = 0 y=0 y=0,弗吉尼亚鸢尾的类别 y = 1 y=1 y=1。此外可知Sigmoid函数为 g ( x ) = 1 1 + e − x g(x)=\frac1{1+e^{-x}} g(x)=1+ex1,图像如图1-4,函数的值域为 ( 0 , 1 ) (0,1) (0,1),在 x = 0 x=0 x=0处值为0.5。

sigmoid
图1-4 Sigmoid函数

设参数为 w w w b b b,对以上问题来说,最终希望求得一条直线 w 1 x 1 + w 2 x 2 + b = 0 w_1x_1+w_2x_2+b=0 w1x1+w2x2+b=0分类弗吉尼亚鸢尾,也就是正类对于假设函数 h ( x ) = g ( w T x + b ) h(x)=g(w^Tx+b) h(x)=g(wTx+b)尽可能接近1,负类对于假设函数 h ( x ) = g ( w T x + b ) h(x)=g(w^Tx+b) h(x)=g(wTx+b)尽可能接近0.假设函数预测的结果为 h ^ \widehat h h ,似然函数为 h ^ y ( 1 − h ^ ) 1 − y \widehat h^y(1-\widehat h)^{1-y} h y(1h )1y,希望似然函数越大越好.取对数函数不影响其单调性,变为 y log ⁡ ( h ^ ) + ( 1 − y ) log ⁡ ( 1 − h ^ ) y\log(\widehat h)+(1-y)\log(1-\widehat h) ylog(h )+(1y)log(1h ),取反变为损失函数 L ( h ^ , y ) − y log ⁡ ( h ^ ) − ( 1 − y ) log ⁡ ( 1 − h ^ ) L(\widehat h,y)-y\log(\widehat h)-(1-y)\log(1-\widehat h) L(h ,y)ylog(h )(1y)log(1h ),对该损失函数求二阶导得到 h ^ ( 1 − h ^ ) > 0 , h ^ ∈ ( 0 , 1 ) \widehat h(1-\widehat h)>0,\widehat h\in(0,1) h (1h )>0,h (0,1),可证其是凸函数.

由于样本独立同分布,所以最终希望最小化训练中所有样本的损失,假设样本数为m,优化目标变为

m i n i m i z e      − 1 m ∑ i = 1 m y ( i ) log ⁡ ( h ( i ) ^ ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h ( i ) ^ ) minimize\;\;-\frac1m{\textstyle\sum_{i=1}^m}y^{(i)}\log(\widehat{h^{(i)}})+(1-y^{(i)})\log(1-\widehat{h^{(i)}}) minimizem1i=1my(i)log(h(i) )+(1y(i))log(1h(i) )

根据优化目标可以写出如下CVX程序,最终求得如图1-5的分类面。

% train set size
m = size(labels_train, 1);

% positive samples
y = (labels_train == 3);

cvx_begin
    variables w1
    variables w2
    variables b
    minimize(-1/m * sum(y .* log(sigmoid(w1*score_train(:, 1) + w2*score_train(:, 2) + b)) + ...
        + (1 - y) .* (-w1*score_train(:, 1) - w2*score_train(:, 2) - b) - ...
        (1 - y) .* log(1 + exp(-w1*score_train(:, 1) - w2*score_train(:, 2) - b))))
cvx_end

% visualization
Iris_Setosa = (labels_train == 1);
Iris_Versicolor = (labels_train == 2);
Iris_Virginica = (labels_train == 3);

subplot(1, 2, 1)
scatter(score_train(Iris_Setosa, 1), score_train(Iris_Setosa, 2), 'red');
hold on
scatter(score_train(Iris_Versicolor, 1),score_train(Iris_Versicolor, 2), 'g', '+');
hold on
scatter(score_train(Iris_Virginica, 1),score_train(Iris_Virginica, 2), 'b', '*');
hold on
x1 = linspace(0.8, 1.5, 100);
x2 = (w1*x1 + b)/(-w2);
plot(x1, x2)
legend('Iris Setosa', 'Iris Versicolor', 'Iris Virginica', 'location', 'northeastoutside')
xlabel('train set')

Iris_Setosa = (labels_test == 1);
Iris_Versicolor = (labels_test == 2);
Iris_Virginica = (labels_test == 3);

subplot(1, 2, 2)
scatter(score_test(Iris_Setosa, 1), score_test(Iris_Setosa, 2), 'red');
hold on
scatter(score_test(Iris_Versicolor, 1),score_test(Iris_Versicolor, 2), 'g', '+');
hold on
scatter(score_test(Iris_Virginica, 1),score_test(Iris_Virginica, 2), 'b', '*');
hold on
x1 = linspace(0.8, 1.5, 100);
x2 = (w1*x1 + b)/(-w2);
plot(x1, x2)
legend('Iris Setosa', 'Iris Versicolor', 'Iris Virginica', 'location', 'northeastoutside')
xlabel('test set')

clear m
clear y
clear Iris_Setosa
clear Iris_Versicolor
clear Iris_Virginica

其中的sigmoid函数为,

function y = sigmoid(x)
y = 1 ./ (1 + exp(-x));

classifier图1-5 收敛结果

参考
http://archive.ics.uci.edu/ml/index.php
http://archive.ics.uci.edu/ml/datasets/Iris

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值