Machine Learning From Scratch: Logistic Regression

Ref: 

Introduction

After having discussed linear regression in the first part of this series, it’s time to take a look at another building block of more advanced machine learning algorithms: logistic regression. Logistic regression, despite its name, is most widely used for binary classification. In binary classification, you are trying to predict whether an observation belongs either to class 0 or class 1. For instance, one could try to predict whether visitors of a website are going to click on an ad or not.

Preparation

In order to start building logistic regression, we first need to generate some dummy data. To create data from two different classes, we will create two input features, X1 and X2, along with our response variable Y. We’ll draw X1 and X2 from a Gaussian distribution with different means to make them clearly separable.

X1_1 = pd.Series(np.random.randn(5))
X1_2 = pd.Series(np.random.randn(5)+4)
X1 = pd.concat([X1_1, X1_2]).reset_index(drop=True)
X2 = pd.Series(np.random.randn(10))
Y = pd.Series([0,0,0,0,0,1,1,1,1,1])

data = pd.concat([X1, X2, Y], axis=1)
data.columns= ["X1", "X2", "Y"]

Since visualizations are always nice, let’s take a look at our data:

As we can see, our two classes, Y=0 and Y=1, are clearly separable. Thus, fitting a logistic regression might be a good idea. Unfortunately, using a linear regression in this case would not be a very good idea because a linear regression’s outputs are not between zero and one, making it extremely difficult to interpret them as probabilities of Y=0 or Y=1.

Therefore, instead of just sticking to a linear regression with two input features and using Y =beta0 + beta1*X1 + beta2*X2, we are going to transform our linear regression equation using the sigmoid function.

The sigmoid function maps real values to values between zero and one. This way, we’ll be able to interpret our results as probabilities of a specific training instance belonging to class Y=0 or Y=1. The only thing we need to do is define a cutoff probability that’ll separate our classes. For instance, we could, depending on our projects’ requirements, set Y=0 if P≤0.5 and Y=1 if P>0.5.

All that’s left to do now is replacing the x in the sigmoid formula above with our regression equation:

Let’s define a function for that:

def sigmoid(beta0, beta1, beta2, X1, X2):
    """
    Input:
    This function takes the beta parameters beta0, beta1, and beta2,
    as well as X1 and X2 as inputs
    
    Output:
    After having transformed the inputs into a real value between 0
    and 1, the function will return said value that can be interpreted
    as a probability
    """
    prepared = beta0 + beta1*X1 + beta2*X2
    prediction = 1/(1 + np.exp(-prepared))
    return prediction

Gradient Descent

We are going to use stochastic gradient descent to find our optimal parameters. In stochastic gradient descent, as opposed to batch gradient descent, we are only going to use a single observation to update our parameters. Apart from that, the process is basically the same:

  1. Initialize the coefficients with zero or small random values
  2. Evaluate the cost of these parameters by plugging them into a cost function
  3. Calculate the derivative of the cost function
  4. Update the parameters scaled by a learning rate/step size

To get a better understanding of this rough outline of gradient descent, let’s look at the Python code.

The first step of gradient descent consists of initializing the parameters with zero or small random values. In our case, we have to initialize beta0, beta1, and beta2:

#let's assign 0 to our betas to initialize them
beta0 = 0.
beta1 = 0.
beta2 = 0.

Now that we have initialized our betas, we can actually use the sigmoid function we defined earlier and understand what it does. By inputting our first training observation, we get the following result:

#selecting our first training observation
X1, X2, Y = list(data.iloc[0,:])

#making our first prediction
first_prediction = sigmoid(beta0, beta1, beta2, X1, X2)
print("Our first prediction is {}".format(first_prediction))
#Output: Our first prediction is 0.5

What does our output mean? The sigmoid function returns a probability. In our case, we haven’t defined a cutoff probability and our betas are all zeros. Thus, the output probabilities of the first training observation belonging to class 1 or 0 are equal. To get better predictions, we’re going to use stochastic gradient descent. To do so, we’re going to have to update our parameters:

def update_coefficients(beta0, beta1, beta2, X1, X2, Y, prediction):
    """
    This function takes the training instance as well as the betas
    and the previously calculated prediction to update the coefficients
    """
    alpha = 0.3
    bias_of_intercept = 1.0
    beta0 = beta0 + alpha*(Y-prediction)*prediction*(1-prediction)*bias_of_intercept
    beta1 = beta1 + alpha*(Y-prediction)*prediction*(1-prediction)*X1
    beta2 = beta2 + alpha*(Y-prediction)*prediction*(1-prediction)*X2
    
    return beta0, beta1, beta2
  
beta0, beta1, beta2 = update_coefficients(beta0, beta1, beta2, X1, X2, Y, first_prediction)
print("Our updated parameters are:\nbeta0: {}\nbeta1: {}\nbeta2: {}".format(beta0, beta1, beta2))
#Output:
#Our updated parameters are:
#beta0: -0.0375
#beta1: -0.020980870585683948
#beta2: -0.03909196554528134

Making Predictions

Functions make our life easier, however, we would still have to repeat this process manually for each of our observations. That doesn’t sound very fun, does it?

Since we’ve defined a few handy functions, we can just put all of them together and loop through our training observations. Note that this works fine with our small dataset, however, it would very likely be a bottleneck when using larger datasets.

def putting_it_together(epochs=1, learning_rate=0.3, cutoff=0.5):
    """
    This function performs a stochastic gradient descent
    and returns the final coefficients
    """
    beta0 = 0.
    beta1 = 0.
    beta2 = 0.
    bias_of_intercept = 1.0
    alpha = learning_rate
    

    for i in range(1,epochs+1):
        
        correct_predictions = 0
        print('\nThis is epoch number {}:\n'.format(i), '-'*60)
        
        for n in range(len(data.X1)):

            X1 = data.iloc[n,0]
            X2 = data.iloc[n,1]
            Y = data.iloc[n,2]

            prediction = sigmoid(beta0, beta1, beta2, X1, X2)

            beta0, beta1, beta2 = update_coefficients(beta0, beta1, beta2, X1, X2, Y, prediction)
            
            if prediction < cutoff:
                class_ = 0
            else:
                class_ = 1
                
            if class_ == Y:
                correct_predictions+=1
            else:
                continue

            print('(Instance Number {}) Beta0: {:.3f}, Beta1: {:.3f}, Beta2: {:.3f}, prediction: {}, class: {}, actual class: {}'.format(n+1, beta0, beta1, beta2, prediction, class_, Y))
        
        accuracy = (correct_predictions/len(data.X1))*100
        print('\nAccuracy of this epoch: {}%\n'.format(accuracy), '-'*60)
    return beta0, beta1, beta2

Let’s walk through this: the parameters we have to define for our function are the number of epochs, the learning rate, and the cutoff probability. An epoch defines using stochastic gradient descent on each of our training observations once. The learning rate defines by how much we’d like to scale our step size in gradient descent. The bigger, the larger the step size but also the greater the risk of overshooting the minimum. Lastly, the cutoff probability helps us use the outputs of the sigmoid function to make class membership predictions.

First, we initialize the parameters to zero like we did earlier. Inside of the function, there are two for-loops. The first one is iterating over the number of epochs defined by us. Within this for-loop, there’s another for-loop iterating over each of our training observations.

When running this for-loop, we select each training observation’s X1, X2, and Y values one by one to perform our computations. First, we put our parameters into the sigmoid function and get a probability as a result. Then, we update our coefficients and use the cutoff probability to determine whether that specific training observation is class 1 or class 0. Simultaneously, we’re counting all correct predictions by comparing our predictions with the actual Y values for each training observation. The accuracy is then calculated by dividing the number of correct predictions by the total number of predictions. Expressed a little more formally we get:

where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative

Now comes the fun part: running our function. In this example, I’ve set the cutoff probability to 0.51. In Python, we do the following:

putting_it_together(cutoff = 0.51)

#Output:
#This is epoch number 1:
# ------------------------------------------------------------
#(Instance Number 1) Beta0: -0.037, Beta1: -0.021, Beta2: -0.039, prediction: 0.5, class: 0, actual class: 0
#(Instance Number 2) Beta0: -0.074, Beta1: -0.010, Beta2: -0.035, prediction: 0.49318256187851595, class: 0, actual class: 0
#(Instance Number 3) Beta0: -0.110, Beta1: 0.030, Beta2: -0.061, prediction: 0.4777504730569901, class: 0, actual class: 0
#(Instance Number 4) Beta0: -0.144, Beta1: 0.069, Beta2: -0.078, prediction: 0.45645333162008134, class: 0, actual class: 0
#(Instance Number 5) Beta0: -0.181, Beta1: 0.104, Beta2: -0.008, prediction: 0.48480217726825375, class: 0, actual class: 0
#(Instance Number 6) Beta0: -0.148, Beta1: 0.239, Beta2: 0.003, prediction: 0.5623015703558849, class: 1, actual class: 1
#(Instance Number 7) Beta0: -0.130, Beta1: 0.317, Beta2: 0.018, prediction: 0.709114917067449, class: 1, actual class: 1
#(Instance Number 8) Beta0: -0.121, Beta1: 0.361, Beta2: 0.006, prediction: 0.8095023471228494, class: 1, actual class: 1
#(Instance Number 9) Beta0: -0.099, Beta1: 0.410, Beta2: 0.063, prediction: 0.661298222653969, class: 1, actual class: 1
#(Instance Number 10) Beta0: -0.094, Beta1: 0.433, Beta2: 0.070, prediction: 0.8608443171411484, class: 1, actual class: 1

#Accuracy of this epoch: 100.0%
# ------------------------------------------------------------

Thanks to our previously defined function, we get a very clean and informative output. For each training observation, we see how our betas are getting updated as well as the class prediction. Most importantly though, we can compare our predicted class membership with the actual class membership. Within just one epoch, we’ve achieved 100% accuracy!

As always, if you have any feedback or found mistakes, please don’t hesitate to reach out to me.

The complete notebook can be found on my GitHub: https://github.com/lksfr/MachineLearningFromScratch

Individual Assginment 2 (10%) 1 Instruction We have learned that the core concept of machine learning is to use a function to map between input and output. The overall steps are as follows: 1. weights is randomly assigned 2. calculate the output 3. calculate the loss 4. calculate the gradient 5. based on the gradient, update the weight 6. repeat step1 to step5 again until loss is minimized In this assignment, you are required to implement a logistic regression using toy datasets. In the dataset, the first two columns are features and the third column is the class. Your task is to implement a logistic regression from scratch without using the built-in functions from libraries. Your code should have the following functions 1. obj = CalcObj(XTrain, YTrain, wHat) 2. grad = CalcGrad(XTrain, YTrain, wHat) 3. wHat = UpdateParams(weight, grad, lr) 4. hasConverged = CheckConvg(oldObj, newObj, tol) 5. [wHat, objVals] = GradientDescent(XTrain, YTrain) 6. [yHat, numErrors] = PredictLabels(XTest, YTest, wHat) where • XTrain is an n×p dimensional matrix that contains one training instance per row • YTrain is an n×1 dimensional vector containing the class labels for each training instance • wHat is a p+1×1 dimensional vector containing the regression parameter estimates wˆ0,wˆ1, ...,wˆp Assignment 2 • grad is a p+1×1 dimensional vector containing the value of the gradient of the objective function with respect to each parameter in wHat • lr is the gradient descent step size that you should set to lr=0.01 • obj, oldObj and newObj are values of the objective function • tol is the convergence tolerance, which you should set to tol=0.001 • objVals is a vector containing the objective value at each iteration of gradient descent • XTest is an m×p dimensional matrix that contains one test instance per row • yTest is an m×1 dimensional vector containing the true class labels for each test instance • yHat is an m×1 dimensional vector containing your predicted class labels for each test instance • numErrors is the number of misclassified examples,i.e.the differences between yHat and yTest 2 Your Tasks 1. Train your logistic regression using Train_toydata.txt with the relevant functions and test your model with Test_toydata.txt. 2. Train another logistic regression using built-in PyTorch function and test it using testing datasets. 3. Compare the testing accuracy between your own functions and the built-in PyTorch function. 4. Tips: Look up online resources on how to ’Build Logistic Regression from Scratch’ 3 Submission 1. Your source code - ipython notebook. 2. A report consisting of a table that comparing the accuracy between your own function and built-in PyTorch function. Did you get the same results? If no, state why. Where could be go wrong? 3. Deadline - 1 week from the date is assigned. 4 Evaluation 1. 6% - Each function implemented correctly get 1% - total 6 functions 2. 4% - All code run correctly. Results using your own functions and built-in functions match. If there is a mismatch, please specify the reason for the discrepancy. Evaluation will consider your comprehension of where the error might have occurred. 帮我写个作业
最新发布
11-01
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值