Neural Network from Scratch in Cangjie: Part 6 - 仓颉从头开始的神经网络：第六部分-优快云博客

本文链接：https://blog.youkuaiyun.com/sergeyyurkov1/article/details/144570613

Today, we will try to implement backpropagation. We will discuss three main concepts: derivatives and partial derivatives, gradients, and the chain rule.

Backpropagation is a process of adjusting weights and biases to minimize loss in a neural network, which in turn increases its prediction accuracy. There are many ways to do backpropagation, from random and semi-random to intelligent and semi-intelligent. Randomly adjusting weights and biases may work on simple data, but with more complex patterns, this method doesn't work, as there are infinite combinations that would take infinite time to compute.

The first thing that we can do is to find which weights and biases affect the loss the most and how, and adjust those accordingly. We can do that with differentiation, or the metric of how y is affected by x.

With the linear function y = 2x, the impact, or slope, is 2 - y is double the x. With a non-linear function y = 2x^2 - a parabola, this depends on which two points (slopes) we choose to analyze.

For the first 2 points, it is 2 ((y[1]-y[0]) / (x[1]-x[0])); for the third and fourth points it is 10, etc. We can draw a tangent line through these two points.

How do we quantify the impact of any x on y at a given point in a non-linear function that is more granual than that? Because the longer the distance between two points, the larger the margin of error is; we need to use two infinitely close points, but because we cannot calculate infinity, we use a very small number 0.0001. We measure the slope of the tangent line at x, the "instantaneous rate of change", which is called the derivative.

Example of two points at a large distance and the resulting error margin

Example of tangent line at points infinitely close to each other

The tangent line tells us about the impact of x on the y at a particular point. For example, the approximate derivative for f(x) where x = 1 is 4.0001999999987845. This approach is called numerical differentiation where we come up with a number to solve a problem. There is a better method called analytical differentiation where we use a proven set of rules that let us skip a number of steps and simplify deriving inputs from outputs in our network.

Let's examine a single neuron with the ReLU activation function to illustrate those rules. Because it contains multiple inputs x1, x2, x3...xn, weights as well as the bias, all of which affect the output y, we calculate what is called the partial derivative. And because a network consists of a chain of neurons and activation functions where one neuron's output becomes another's input, we need to calculate partial derivatives for every element in the chain. For that, we apply the so-called chain rule.

let x = [1.0, -2.0, 3.0]
let w = [-3.0, -1.0, 2.0]
let b = 1.0

let xw0 = x[0] * w[0]
let xw1 = x[1] * w[1]
let xw2 = x[2] * w[2]
println("Dot product of inputs and weights: ${xw0} ${xw1} ${xw2}")

let z = xw0 + xw1 + xw2 + b
println("Dot product of inputs and weights + bias: ${z}")

let y = max([z, 0.0]).getOrThrow()
println("ReLU output: ${y}")

>>>
Dot product of inputs and weights: -3.000000 2.000000 6.000000
Dot product of inputs and weights + bias: 6.000000
ReLU output: 6.000000

Imagine that a derivative from the next layer is 1. We call it `dvalue`. The rule is that the derivative of the max() function (ReLU) is 1 for values that are > 0 and 0 for others.

let dvalue = 1.0

func f() {
    if (z > 0.0) {
        return 1.0
    } else {
        return 0.0
    }
}
let dreluDz = dvalue * f()
println(dreluDz)

>>> 1.0

Remember that we are moving backwards and finding (deriving) x by the value y.

Because ReLU output is 6 (and the rule says that everything above 0 is 1), the derivative of ReLU function is 1 * 1 (dvalue from the previous layer) = 1.

Next, we deal with partial derivatives, as the neuron's output is the results of the dot product of inputs and weights + bias.

The rule says that the derivative of a sum is always 1. As the result, the sum of dot product and the bias is 1. Multiplying it with the derivative from ReLU (1) gives us 1.

let dsumDxw0 = 1.0
let dreluDxw0 = dreluDz * dsumDxw0
println(dreluDxw0)

>>> 1.0

dsumDxw0 above is "the partial derivative of the sum with respect to the x (input), weighted, for the 0th pair of inputs and weights". We repeat that for dsumDxw1, dsumDxw2, and the bias.

Because bias does not have a preceding operation, its derivative is 1, which we then multiply by the derivative of ReLU: 1.0 * 1.0 = 1.0.

The next step is the derivatives of multiplication of the dot product. The derivative for a product is whatever the input is being multiplied by. We multiply the value of the first weight by the previous derivative, which is 1: -3.0 * 1.0 = -3.0, which gives us the derivative of the dot product respective to x[0]. We repeat that for every input: -1.0 (w[1]) * 1.0 = -1.0; 3.0 (w[2]) * 1.0 = 3.0.

The derivatives of the dot product respective to weights is a similar operation: 1.0 (x[0]) * 1.0 = 1.0 (w[0]); -2.0 (x[1]) * 1.0 = -2.0 (w[1]); 3.0 (x[2]) * 1.0 = 3.0 (w[2]).

The current values for weights are [-3.0, -1.0, 2.0] and the bias is 1.0.

In order to decrease the loss in the full network or in this case the output of the ReLU function, we can apply the derived gradients to these values. The value of -0.001 is called learning rate. It specifies the amount by which the parameters are altered in the direction opposite to the gradient of the loss function. Because we want to decrease the output, the value is negative.

w[0] = -3.0 (w[0]) + -0.001 * 1.0 (dw[0]) = -3.001

w[1] = -1.0 (w[1]) + -0.001 * -2.0 (dw[1]) = -1.0 + 0.002 = -0.998

w[2] = 2.0 (w[2]) + -0.001 * 3.0 (dw[2]) = 2.0 + -0.003 = 1.997

b = 1.0 (b) + -0.001 * 1.0 (db) = 0.999

If we do a forward pass with these new values, we will get 5.985 - a decrease of 0.015.

Now to scale it all up to a single layer:

let dvalues = Matrix([[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]])
let weights = Matrix([[0.2, 0.8, -0.5, 1.0], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).transpose()

let dinputs = dvalues.times(weights.transpose())

println(dinputs.getArray())

>>> [[0.440000, -0.380000, -0.070000, 1.370000], [0.880000, -0.760000, -0.140000, 2.740000], [1.320000, -1.140000, -0.210000, 4.110000]]

What we did manually is basically the dot product, so the above operation where we find derivative values for x can be done like this.

let dvalues = Matrix([[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]])
let inputs = Matrix([1.0, 2.0, 3.0, 2.5], [2.0, 5.0, -1.0, 2.0], [-1.5, 2.7, 3.3, -0.8])

let dweights = inputs.transpose().times(dvalues)

println(dweights.getArray())

>>> [[0.500000, 0.500000, 0.500000], [20.100000, 20.100000, 20.100000], [10.900000, 10.900000, 10.900000], [4.100000, 4.100000, 4.100000]]

The same is done for the weights...

let dvalues = Matrix([[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]])
let biases = Matrix([[2.0, 3.0, 0.5]])

let dbiases = Matrix(Array<Array<Float64>>(1, {_ => Array<Float64>(3, {_ => 0.0})}))
for (i in dvalues.getArray()) {
    dbiases.plusEquals(Matrix(i))
}

println(dbiases.getArray())

>>> [[6.000000, 6.000000, 6.000000]]

...and biases. Because the derivative of a sum is 1, we can ignore the biases and simply do a column-wise sum of the incoming derivatives, which gets us our bias derivatives.

Because we can't do column-wise sum in `matrix4cj`, there is a workaround of splitting the columns into individual matrices and summing them up with the `plusEquals` operation. For that, we create an intermediate array filled with zeroes: 0 + 1 + 2 + 3 = 6.

The last step is the derivative of ReLU.

let z = [[1.0, 2.0, -3.0, -4.0], [2.0, -7.0, -1.0, 3.0], [-1.0, 2.0, 5.0, -1.0]]
let dvalues = [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0]]

let drelu = Array<Array<Float64>>(z.size, {_ => Array<Float64>(z[0].size, {_ => 0.0})})

for (i in 0..drelu.size) {
    for (j in 0..drelu[i].size) {
        if (z[i][j] > 0.0) {
            drelu[i][j] = 1.0
        }
    }
}

println(drelu)

for (i in 0..drelu.size) {
    for (j in 0..drelu[i].size) {
        drelu[i][j] *= dvalues[i][j]
    }
}

print(drelu)

>>>

[[1.000000, 1.000000, 0.000000, 0.000000], [1.000000, 0.000000, 0.000000, 1.000000], [0.000000, 1.000000, 1.000000, 0.000000]]

[[1.000000, 2.000000, 0.000000, 0.000000], [5.000000, 0.000000, 0.000000, 8.000000], [0.000000, 10.000000, 11.000000, 0.000000]]

*z is example layer outputs.

We also create an array filled with zeroes that has the same shape as layer outputs. Then, by looping through the nested structure, we find every value that is > 0, and mark it with 1.0 in the `drelu` array. We then multiply the `drelu` array by the array with derivative values from the previous layer.

let z = [[1.0, 2.0, -3.0, -4.0], [2.0, -7.0, -1.0, 3.0], [-1.0, 2.0, 5.0, -1.0]]
let dvalues = [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0]]

let drelu = dvalues.clone()

for (i in 0..drelu.size) {
    for (j in 0..drelu[i].size) {
        if (z[i][j] <= 0.0) {
            drelu[i][j] = 0.0
        }
    }
}

println(drelu)

>>> [[1.000000, 2.000000, 0.000000, 0.000000], [5.000000, 0.000000, 0.000000, 8.000000], [0.000000, 10.000000, 11.000000, 0.000000]]

We can greatly simplify these operations by doing the inverse - marking values that are less than or equal 0 - with 0. This gives us the same result as before.

Now let's do test forward and backward passes.

let dvalues = Matrix([[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]])
let inputs = Matrix([[1.0, 2.0, 3.0, 2.5], [2.0, 5.0, -1.0, 2.0], [-1.5, 2.7, 3.3, -0.8]])
let weights = Matrix([[0.2, 0.8, -0.5, 1.0], [0.5, -0.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]]).transpose()

// let biases = Matrix([[2.0, 3.0, 0.5]])
let biases = Matrix(Array<Array<Float64>>(inputs.getArray().size, {_ => [2.0, 3.0, 0.5]}))
println(biases.getArray())

// Forward pass
// ===========================
let layer_outputs = inputs.times(weights).plus(biases)
println(layer_outputs.getArray())

// ReLU
func maximum(input: Array<Float64>): Array<Float64> {
    func clip(i: Float64): Float64 {
        if (i > 0.0) {
            return i
        } else {
            return 0.0
        }
    }
    return input |> map {i => clip(i)} |> collectArray
}
let output = ArrayList<Array<Float64>>([])
for (array in layer_outputs.getArray()) {
    output.append(maximum(array))
}
let relu_outputs = output.toArray()
println(relu_outputs)

// Backward pass
// ===========================
// ReLU derivatives
let drelu = relu_outputs.clone()
for (i in 0..drelu.size) {
    for (j in 0..drelu[i].size) {
        if (layer_outputs.getArray()[i][j] <= 0.0) {
            drelu[i][j] = 0.0
        }
    }
}

// Layer output derivatives
let dinputs = Matrix(drelu).times(weights.transpose())

// Layer weight derivatives
let dweights = inputs.transpose().times(Matrix(drelu))

// Layer bias derivatives 
let dbiases = Matrix(Array<Array<Float64>>(1, {_ => Array<Float64>(drelu[0].size, {_ => 0.0})}))
for (i in drelu) {
    dbiases.plusEquals(Matrix(i))
}
println(dbiases.getArray())
let dbiasesPadded = Matrix(Array<Array<Float64>>(biases.getArray().size, {_ => dbiases.getArray()[0]}))
println(dbiasesPadded.getArray())

// Adjusting weights and biases to reduce ReLU output
weights.plusEquals(dweights.times(-0.001))
biases.plusEquals(dbiasesPadded.times(-0.001))

println(weights.getArray())
println(biases.getArray()[0])

Padded biases:

[[2.000000, 3.000000, 0.500000], [2.000000, 3.000000, 0.500000], [2.000000, 3.000000, 0.500000]]

Layer outputs:

[[4.800000, 1.210000, 2.385000], [8.900000, -1.810000, 0.200000], [1.410000, 1.051000, 0.026000]]

ReLU outputs:

[[4.800000, 1.210000, 2.385000], [8.900000, 0.000000, 0.200000], [1.410000, 1.051000, 0.026000]]

Layer bias derivatives:

[[15.110000, 2.261000, 2.611000]]

Layer bias derivatives padded:

[[15.110000, 2.261000, 2.611000], [15.110000, 2.261000, 2.611000], [15.110000, 2.261000, 2.611000]]

New weights and biases (padded):

[[0.179515, 0.500367, -0.262746], [0.742093, -0.915258, -0.275840], [-0.510153, 0.252902, 0.162959], [0.971328, -0.502184, 0.863658]]

[[1.984890, 2.997739, 0.497389], [1.984890, 2.997739, 0.497389], [1.984890, 2.997739, 0.497389]]

If we do a second forward pass with these weights and biases, we get different results.

// Second forward pass
let layer_outputs2 = inputs.times(weights).plus(biases)
println(layer_outputs2.getArray())

let output2 = ArrayList<Array<Float64>>([])
for (array in layer_outputs2.getArray()) {
    output2.append(maximum(array))
}
let relu_outputs2 = output2.toArray()
println(relu_outputs2)

ReLU outputs after the second forward pass:

[[4.546452, 1.170835, 2.330986], [8.507194, 0.000000, 0.157053], [1.258701, 1.012316, 0.000000]] <- [[4.800000, 1.210000, 2.385000], [8.900000, 0.000000, 0.200000], [1.410000, 1.051000, 0.026000]]

We successfully reduced the output.

Now, let's modify Dense Layer and Activation ReLU classes to implement the above changes, such as the backward method and other minor fixes.

class Activation_ReLU {
    var output: Array<Array<Float64>>
    var dinputs: Array<Array<Float64>> // new
    var inputs: Array<Array<Float64>> // new

    public init() {
        this.output = []
        this.dinputs = [] // new
        this.inputs = [] // new
    }

    public func forward(inputs: Array<Array<Float64>>) {
        this.inputs = inputs // new

        let output = ArrayList<Array<Float64>>([])

        for (array in inputs) {
            output.append(maximum(array))
        }

        this.output = output.toArray()
    }

    // new
    public func backward(dvalues: Array<Array<Float64>>) {
        this.dinputs = dvalues.clone()

        for (i in 0..this.dinputs.size) {
            for (j in 0..this.dinputs[i].size) {
                if (this.inputs[i][j] <= 0.0) {
                    this.dinputs[i][j] = 0.0
                }
            }
        }
    }

    private func maximum(input: Array<Float64>): Array<Float64> {
        func clip(i: Float64): Float64 {
            if (i > 0.0) {
                return i
            } else {
                return 0.0
            }
        }

        let output = input |> map {i => clip(i)} |> collectArray

        return output
    }
}

The backward method in the Activation ReLU class is where we calculate and store ReLU derivatives called `dinputs`. We also store the inputs during the forward pass, so that we can access them during the backward pass. Otherwise, the class remains the same.

Next, we modify the Dense Layer class. We again store the inputs to be able to calculate `dweights` in the backward method. We instantiate `dweights`, `dbiases`, and `dinputs` as matrices of size 1x1, as we cannot create empty matrices. Inputs are stored as an array of arrays.

class Layer_Dense {
    var weights: Matrix
    var biases: Matrix
    var output: Array<Array<Float64>>

    var inputs: Array<Array<Float64>> // new

    var dweights: Matrix // new
    var dbiases: Matrix // new
    var dinputs: Matrix // new

    public init(nInputs: Int64, nNeurons: Int64, batchSize: Int64) {
        this.weights = Matrix(
            Array<Array<Float64>>(nNeurons, {_ => Array<Float64>(nInputs, {_ => random.nextFloat64() * 0.01})})).
            transpose()

        this.biases = Matrix(Array<Array<Float64>>(batchSize, {_ => Array<Float64>(nNeurons, {_ => 0.0})}))

        this.output = []

        this.inputs = [] // new

        this.dweights = Matrix(1, 1) // new
        this.dbiases = Matrix(1, 1) // new
        this.dinputs = Matrix(1, 1) // new
    }

    public func forward(inputs: Array<Array<Float64>>) {
        this.inputs = inputs // new

        this.output = Matrix(inputs).times(this.weights).plus(this.biases).getArray()
    }

    // new
    public func backward(dvalues: Array<Array<Float64>>) {
        this.dinputs = Matrix(dvalues).times(this.weights.transpose())

        this.dweights = Matrix(this.inputs).transpose().times(Matrix(dvalues))

        this.dbiases = Matrix(Array<Array<Float64>>(1, {_ => Array<Float64>(dvalues[0].size, {_ => 0.0})})) // fix for the Matrix dimensions must agree exception
        for (i in dvalues) {
            this.dbiases.plusEquals(Matrix(i))
        }
        this.dbiases = Matrix(Array<Array<Float64>>(this.biases.getArray().size, {_ => this.dbiases.getArray()[0]})) // padding the biases

    }
}

The last thing we need to deal with Softmax and CategoricalCrossentropy classes. Instead of implementing the backward method in these classes, we can combine them into one superclass. This will simplify calculations and make backward passes through the 2 classes much faster, as we cut out intermediate steps.

class Activation_Softmax_Loss_CategoricalCrossentropy {
    let activation: Activation_Softmax
    let loss: Loss_CategoricalCrossentropy
    var output: Array<Array<Float64>>
    var dinputs: Array<Array<Float64>>

    public init() {
        this.activation = Activation_Softmax()
        this.loss = Loss_CategoricalCrossentropy()
        this.output = []
        this.dinputs = []
    }

    public func forward(inputs: Array<Array<Float64>>, yTrue: Array<Int64>) {
        this.activation.forward(inputs)

        this.output = this.activation.output

        return this.loss.calculate(this.output, yTrue)
    }

    public func backward(dvalues: Array<Array<Float64>>, yTrue: Array<Int64>) {
        let samples = dvalues.size

        this.dinputs = dvalues.clone()

        for ((i, j) in 0..samples |> zip(yTrue)) {
            this.dinputs[i][j] -= 1.0
        }

        for (i in 0..this.dinputs.size) {
            for (j in 0..this.dinputs[i].size) {
                this.dinputs[i][j] /= Float64(samples)
            }
        }
    }
}

Here, we use composition to instantiate the two classes inside one `Activation_Softmax_Loss_CategoricalCrossentropy` superclass. The forward method now combines two previously separate operations - Softmax's forward pass and loss/accuracy calculation - into one step.

The backward method calculates and normalizes the combined gradient of both activation and loss functions in one step. We do that by zipping together the outputs from the forward pass - the predictions - and the predicted values y. We copy the outputs array and call it `dinputs` so that we can safely modify it without affecting the original outputs. Remember that, outputs is a nested array of arrays - each array is a batch of values X. Because y values are indexes in inner predictions arrays, we can access the values corresponding to the prediction and calculate its derivative by subtracting 1 from the original value. Later, we loop through the dinputs array once again to normalize the values by dividing each value by the number of samples in the original array.

The final thing that we need to implement is an optimizer. The optimizer's function is to adjust weights and biases and control the learning process, such as adjusting the learning rate (the direction and magnitude of weights and biases change) and `decay` that lets us decrease the learning rate as time progresses so as to only allow small adjustments to weights and biases toward the end. Learning rate that is too high or is too low affects the learning process. One needs to experiment with different values for his or her network configuration.

There exist many types of optimizers. The more advanced an optimizer is and the more parameters it can adjust, the more accuracy one can expect from the network.

Today, we will implement one of the basic ones, called Stochastic Gradient Descent (SGD). Its purpose is to update weights and biases with new derived weights and biases and decrease the learning rate with the value of `decay` before each iteration, or forward-backward pass (also called "epoch").

class Optimizer_SGD {
    var learningRate: Float64
    var decay: Float64
    var currentLearningRate: Float64
    var iterations: Int64

    public init(learningRate!: Float64 = 1.0, decay!: Float64 = 0.0) {
        this.learningRate = learningRate
        this.currentLearningRate = learningRate
        this.decay = decay
        this.iterations = 0
    }

    public func preUpdateParameters() {
        if (this.decay > 0.0) {
            this.currentLearningRate = this.learningRate * (1.0 / (1.0 + this.decay * Float64(this.iterations)))
        }
    }

    public func postUpdateParameters() {
        this.iterations += 1
    }

    public func updateParameters(layer: Layer_Dense) {
        layer.weights.plusEquals(layer.dweights.times(-this.learningRate))
        layer.dbiases.plusEquals(layer.dbiases.times(-this.learningRate))
    }
}

Finally, the full network code up to this point:

import matrix4cj.*
import std.collection.*
import std.random.*
import csv4cj.*
import std.os.posix.*
import std.fs.*
import std.convert.*
import std.math.*
import std.reflect.*

let random = Random(0) // seed = 0

let EPOCHS = 30000

main() {
    let X: Array<Array<Float64>>
    let y: Array<Int64>

    (X, y) = getData()

    let dense1 = Layer_Dense(2, 64, X.size)
    let activation1 = Activation_ReLU()

    let dense2 = Layer_Dense(64, 3, X.size)

    let lossActivation = Activation_Softmax_Loss_CategoricalCrossentropy()

    // let optimizer = Optimizer_SGD(learningRate: 1.0)
    let optimizer = Optimizer_SGD(learningRate: 0.85, decay: 1e-3)

    for (epoch in 0..EPOCHS) {
        dense1.forward(X)
        activation1.forward(dense1.output)

        dense2.forward(activation1.output)

        let (loss, acc) = lossActivation.forward(dense2.output, y)

        if (epoch % 100 == 0) {
            println()
            println("epoch: ${epoch}, loss: ${loss}, acc: ${acc}, lr: ${optimizer.currentLearningRate}")
        }

        lossActivation.backward(lossActivation.output, y)
        dense2.backward(lossActivation.dinputs)
        activation1.backward(dense2.dinputs.getArray())
        dense1.backward(activation1.dinputs)

        optimizer.preUpdateParameters()
        optimizer.updateParameters(dense1)
        optimizer.updateParameters(dense2)
        optimizer.postUpdateParameters()
    }
}

class Optimizer_SGD {
    var learningRate: Float64
    var decay: Float64

    var currentLearningRate: Float64
    var iterations: Int64

    public init(learningRate!: Float64 = 1.0, decay!: Float64 = 0.0) {
        this.learningRate = learningRate
        this.currentLearningRate = learningRate
        this.decay = decay
        this.iterations = 0
    }

    public func preUpdateParameters() {
        if (this.decay > 0.0) {
            this.currentLearningRate = this.learningRate * (1.0 / (1.0 + this.decay * Float64(this.iterations)))
        }
    }

    public func postUpdateParameters() {
        this.iterations += 1
    }

    public func updateParameters(layer: Layer_Dense) {
        layer.weights.plusEquals(layer.dweights.times(-this.learningRate))
        layer.dbiases.plusEquals(layer.dbiases.times(-this.learningRate))
    }
}

class Activation_Softmax_Loss_CategoricalCrossentropy {
    let activation: Activation_Softmax
    let loss: Loss_CategoricalCrossentropy
    var output: Array<Array<Float64>>
    var dinputs: Array<Array<Float64>>

    public init() {
        this.activation = Activation_Softmax()
        this.loss = Loss_CategoricalCrossentropy()
        this.output = []
        this.dinputs = []
    }

    public func forward(inputs: Array<Array<Float64>>, yTrue: Array<Int64>) {
        this.activation.forward(inputs)

        this.output = this.activation.output

        return this.loss.calculate(this.output, yTrue)
    }

    public func backward(dvalues: Array<Array<Float64>>, yTrue: Array<Int64>) {
        let samples = dvalues.size

        this.dinputs = dvalues.clone()

        for ((i, j) in 0..samples |> zip(yTrue)) {
            this.dinputs[i][j] -= 1.0
        }

        for (i in 0..this.dinputs.size) {
            for (j in 0..this.dinputs[i].size) {
                this.dinputs[i][j] /= Float64(samples)
            }
        }
    }
}

open class Loses {
    public func calculate(output: Array<Array<Float64>>, y: Array<Int64>): (Float64, Float64) {
        // mean loss
        let sampleLoses = this.forward(output, y)
        let sampleLosesSum = sampleLoses |> reduce(this.sum)
        let dataLoss = sampleLosesSum.getOrThrow() / Float64(sampleLoses.size)

        // argmax
        let result = this.argmax(output)
        // accuracy
        let acc = accuracy(result, y)

        return (dataLoss, acc)
    }

    public open func forward(yPred: Array<Array<Float64>>, yTrue: Array<Int64>): Array<Float64> {
        return []
    }

    private func sum(x: Float64, y: Float64): Float64 {
        return x + y
    }

    private func accuracy(argmax: Array<Int64>, y: Array<Int64>): Float64 {
        let values = ArrayList<Float64>([])

        for ((i, j) in argmax |> zip(y)) {
            if (i == j) {
                values.append(1.0)
            } else {
                values.append(0.0)
            }
        }

        let valuesSum = values |> reduce(this.sum)

        let accuracy = valuesSum.getOrThrow() / Float64(values.size)

        return accuracy
    }

    private func argmax(yPred: Array<Array<Float64>>) {
        let indexes = ArrayList<Int64>([])

        for (y in yPred) {
            var index = 0
            var value = y[0]

            for ((e, v) in enumerate(y)) {
                if (v > value) {
                    index = e
                    value = v
                }
            }

            indexes.append(index)
        }

        return indexes.toArray()
    }
}

class Loss_CategoricalCrossentropy <: Loses {
    public override func forward(yPred: Array<Array<Float64>>, yTrue: Array<Int64>): Array<Float64> {
        var confidencesList = ArrayList<Float64>([])
        for ((targIdx, distribution) in yTrue |> zip(yPred)) {
            confidencesList.append(distribution[targIdx])
        }

        let negativeLogLikelyhoods = confidencesList |> map {i => -log(clamp(i, 1e-7, 1.0 - 1e-7))} |> collectArray

        return negativeLogLikelyhoods
    }
}

class Activation_Softmax {
    var output: Array<Array<Float64>>

    public init() {
        this.output = []
    }

    public func forward(inputs: Array<Array<Float64>>) {
        let output = ArrayList<Array<Float64>>([])

        for (input in inputs) {
            let maxValue = max(input)
            let subtractedInput = input |> map {i: Float64 => i - maxValue.getOrThrow()} |> collectArray

            let exponentiatedInput = subtractedInput |> map {i => exp(i)} |> collectArray

            let normBase = exponentiatedInput |> reduce(sum)

            // 标准化
            let probabilities = exponentiatedInput |> map {i => i / normBase.getOrThrow()} |> collectArray

            output.append(probabilities)
        }

        this.output = output.toArray()
    }

    private func sum(x: Float64, y: Float64): Float64 {
        return x + y
    }
}

class Activation_ReLU {
    var output: Array<Array<Float64>>
    var dinputs: Array<Array<Float64>> // new
    var inputs: Array<Array<Float64>> // new

    public init() {
        this.output = []
        this.dinputs = [] // new
        this.inputs = [] // new
    }

    public func forward(inputs: Array<Array<Float64>>) {
        this.inputs = inputs // new

        let output = ArrayList<Array<Float64>>([])

        for (array in inputs) {
            output.append(maximum(array))
        }

        this.output = output.toArray()
    }

    // new
    public func backward(dvalues: Array<Array<Float64>>) {
        this.dinputs = dvalues.clone()

        for (i in 0..this.dinputs.size) {
            for (j in 0..this.dinputs[i].size) {
                if (this.inputs[i][j] <= 0.0) {
                    this.dinputs[i][j] = 0.0
                }
            }
        }
    }

    private func maximum(input: Array<Float64>): Array<Float64> {
        func clip(i: Float64): Float64 {
            if (i > 0.0) {
                return i
            } else {
                return 0.0
            }
        }

        let output = input |> map {i => clip(i)} |> collectArray

        return output
    }
}

class Layer_Dense {
    var weights: Matrix
    var biases: Matrix
    var output: Array<Array<Float64>>
    var inputs: Array<Array<Float64>> // new
    var dweights: Matrix // new
    var dbiases: Matrix // new
    var dinputs: Matrix // new

    public init(nInputs: Int64, nNeurons: Int64, batchSize: Int64) {
        this.weights = Matrix(
            Array<Array<Float64>>(nNeurons, {_ => Array<Float64>(nInputs, {_ => random.nextFloat64() * 0.01})})).
            transpose() // <-

        this.biases = Matrix(Array<Array<Float64>>(batchSize, {_ => Array<Float64>(nNeurons, {_ => 0.0})}))
        this.output = []

        this.inputs = [] // new

        this.dweights = Matrix(1, 1) // new
        this.dbiases = Matrix(1, 1) // new
        this.dinputs = Matrix(1, 1) // new
    }

    public func forward(inputs: Array<Array<Float64>>) {
        this.inputs = inputs // new

        this.output = Matrix(inputs).times(this.weights).plus(this.biases).getArray()
    }

    // new
    public func backward(dvalues: Array<Array<Float64>>) {
        this.dinputs = Matrix(dvalues).times(this.weights.transpose())

        this.dweights = Matrix(this.inputs).transpose().times(Matrix(dvalues))

        this.dbiases = Matrix(Array<Array<Float64>>(1, {_ => Array<Float64>(dvalues[0].size, {_ => 0.0})})) // fix for the Matrix dimensions must agree exception
        for (i in dvalues) {
            this.dbiases.plusEquals(Matrix(i))
        }
        this.dbiases = Matrix(Array<Array<Float64>>(this.biases.getArray().size, {_ => this.dbiases.getArray()[0]})) // padding the biases

    }
}

func getData() {
    let yIdx: Int64 = 2
    let X = ArrayList<Array<Float64>>([])
    let y = ArrayList<Int64>([])

    let path: String = getcwd()
    let fileStream = File("${path}/test.csv", OpenOption.Open(true, false))

    //打开文件流
    if (fileStream.canRead()) {
        //创建字符读取的解析流
        let stream = UTF8ReaderStream(fileStream)
        let reader = CSVReader(stream)

        //创建格式化的解析参数
        let format: CSVParseFormat = CSVParseFormat.DEFAULT

        //创建解析器
        let csvParser = CSVParser(reader, format)
        for (csvRecord in csvParser) {
            let values = csvRecord.getValues()
            X.append(Array<Float64>(values[..yIdx].size, {j => Float64.parse(values[..yIdx][j].toString())}))
            y.append(Int64.parse(values[yIdx].toString()))
        }

        fileStream.close()
    }

    return (X.toArray(), y.toArray())
}

We can begin training our network.

With the initial learning rate set to 0.85 and decay set to 0.001, after 30,000 epochs, the best result that I saw was the accuracy of 44%. It is an improvement from the randomly initialized 33%, but it is still very low. While the loss is continuously decreasing, it decreases very slowly and also spikes randomly.

The next thing we can do is implement better and smarter optimizers and look into the so-called L1 and L2 regularization. After examining individual weights and biases, it appears that we have exploding gradient - very big values for derived weights and biases. This is a solution for the next tutorial, as I need time to study the book and experiment with different parameters. I will also try a simpler dataset to do some comparison tests.

下次见，谢谢阅读！