Logisitc Regression 预测员工离职率

最新推荐文章于 2024-07-13 13:50:35 发布

原创最新推荐文章于 2024-07-13 13:50:35 发布 · 1.7k 阅读

3 ·

CC 4.0 BY-SA版权

ML算法代码专栏收录该内容

4 篇文章

订阅专栏

Logistic Regression 基础

Logistic Regression 沿用了 Linear Regression 的思路和想法，通过使用线性关系拟合得到真实的函数关系。同样的，如果模型结果表现不好，可能是超参数没调好，或者是训练集的特征没处理好（可以多构造一些特征，将线性特征构造成为非线性特征之类的）。

由于用Linear Regression 求得的解范围是正无穷到负无穷，而最后得到只是某一分类的概率，其取值范围是 [0,1]，所以我们需要将最后得到的值经过某个合适的投影，投射到 [0,1] 范围内。

这里就引入了 odds 的概念和 sigmoid 函数。

引入 odds 的原因是，想要将概率投射到 [0, 正无穷] 这个范围上，所以要引入 odds。

引入 sigmoid 函数的原因是，想要将 [0, 正无穷] 投射到 [负无穷，正无穷] 这个范围上，所以要引入 sigmoid 函数。

Logistic Regression 预测员工离职率

直接调库运算

from sklearn.linear_model import LogisticRegression
from patsy import dmatrices                    # 作用是将 离散变量变为 哑变量
y,X = dmatrices('left~satisfaction + last_evaluation + number_project 
+ C(sales) + C(salary), data, return_type = 'dataframe')
 # C(sales) 表示 将 sales 变成哑变量
model = LogisticRegression()
model.fit(X,y)
pd.DataFrame(list(zip(X.columns, np.transpose(model.coesf_))))  # 显示系数
pred = model.predict(x)                        # 进行预测

Logistic Regression 理论上是通过梯度下降法来求解的。

LR 的普通 gradient descend 代码如下所示：

# 这需要事先知道导数是什么，计算机可不会帮你去求导
# 我们需要做的就是不断更新这个导数
# 更新的步长是我们自己设定的，或者是 error 到达某一个比较小的值
np.random.seed(1)
alpha = 1 # learning rate 这个值是比较重要的
beta = np.random.randn(X.shape[1])      # 随机初始化一个梯度
for T in range(500):    # 迭代次数
    prob = np.array(1./ ( 1 + np.exp(-np.matmul(X, beta)))).ravel()
    prob_y = list(zip(prob, y))                   # 为了下面计算 loss 用的， 对 gd 没啥作用
    loss = -sum([np.log(p) if y == 1 else np.log(1-p) for p, y in prob)y]) / len(y)  
    # 计算 loss， 目的是为了比较用的，对 gd 没啥作用
    error_rate = 0
    for i in range(len(y)):
        if ((prob[i] > 0.5 and y[i] == 0) or (prob[i] <= 0.5 and y[i] == 1)):
            error_rate += 1
    error_rate /= len(y)
    if T % 5 ==0:
        print('T=' + str(T) + 'loss = ' + str(loss) + 'error = ; + str(error_rate)) # 目的是实时输出一个 loss 和 eoor
    deriv = np.zeros(X.shape[1])

 for i in range(len(y)):                       # 对每一个 instance 都计算其导数
        deriv += np.asarray(X[i,:].ravel() * (prob[i] - y[i])  # 把所有 instance 对应的导数加起来
 deriv /= len(y)                               # 求一个平均值
 beta -= alpha * deriv                         # 更新这个 beta

对于整个更新的过程，最实质的代码是

 for i in range(len(y)):                       # 对每一个 instance 都计算其导数
       deriv += np.asarray(X[i,:].ravel() * (prob[i] - y[i])  # 把所有 instance 对应的导数加起来
 deriv /= len(y)                               # 求一个平均值
 beta -= alpha * deriv                         # 更新这个 beta

这里计算 prob[i] 的原因是导数中有这一项，所以要在这里进行计算。