cp14_TF_v1_v2_TensorSpec_rank_trainable_variables_autodiff_keras_Model_API_set_label_coords_custom

In cp13_Parallelizing NN Training w TF_printoptions(precision)_squeeze_shuffle_batch_repeat_image处理_map : https://blog.youkuaiyun.com/Linli522362242/article/details/112386820 cp13_2_PNN Training_tfrecord files_image process_mnist_gradient_iris_exponent_Adagrad_Adam_tanh_Relu https://blog.youkuaiyun.com/Linli522362242/article/details/113311720, we covered how to define and manipulate tensors and worked with the tf.data API to build input pipelines. We further built and trained a multilayer perceptron to classify the Iris dataset using the TensorFlow Keras API (tf.keras).

we will use different aspects of TensorFlow's API to implement NNs. In particular, we will again use the Keras API, which provides multiple layers of abstraction to make the implementation of standard architectures very convenient. TensorFlow also allows us to implement custom NN layers, which is very useful in research-oriented projects that require more customization.

To illustrate the different ways of model building using the Keras API, we will also consider the classic exclusive or (XOR异或) problem. Firstly, we will build multilayer perceptrons using the Sequential class. Then, we will consider other methods, such as subclassing tf.keras.Model for defining custom layers. Finally, we will cover tf.estimator, a high-level TensorFlow API that encapsulates the machine learning steps from raw input to prediction.

The topics that we will cover are as follows:

• Understanding and working with TensorFlow graphs and migration to TensorFlow v2
• Function decoration for graph compilation
• Working with TensorFlow variables
• Solving the classic XOR problem and understanding model capacity
• Building complex NN models using Keras' Model class and the Keras functional API
• Computing gradients using automatic differentiation and tf.GradientTape
• Working with TensorFlow Estimators

The key features of TensorFlow

TensorFlow provides us with a scalable, multiplatform programming interface for implementing and running machine learning algorithms. The TensorFlow API has been relatively stable and mature since its 1.0 release in 2017, but it just experienced a major redesign with its recent 2.0 release in 2019, which we are using in this book.

Since its initial release in 2015, TensorFlow has become the most widely adopted deep learning library. However, one of its main friction points was that it was built around static computation graphs. Static computation graphs have certain advantages, such as better graph optimizations behind the scenes and support for a wider range of hardware devices; however, static computation graphs require separate graph declaration and graph evaluation steps, which make it cumbersome for users to develop and work with NNs interactively.

Taking all the user feedback to heart, the TensorFlow team decided to make dynamic computation graphs the default in TensorFlow 2.0, which makes the development and training of NNs much more convenient. In the next section, we will cover some of the important changes from TensorFlow v1.x to v2. Dynamic computation graphs allow for interleaving the graph declaration and graph evaluation steps such that TensorFlow 2.0 feels much more natural for Python and NumPy users compared to previous versions of TensorFlow. However, note that TensorFlow 2.0 still allows users to use the "old" TensorFlow v1.x API via the tf.compat submodule. This helps users to transition their code bases more smoothly to the new TensorFlow v2 API.

A key feature of TensorFlow, which was also noted in Chapter 13, Parallelizing Neural Network Training with TensorFlow https://blog.youkuaiyun.com/Linli522362242/article/details/113311720, is its ability to work with single or multiple graphical processing units (GPUs). This allows users to train deep learning models very efficiently on large datasets and large-scale systems.

While TensorFlow is an open source library and can be freely used by everyone, its development is funded and supported by Google. This involves a large team of software engineers who expand and improve the library continuously. Since TensorFlow is an open source library, it also has strong support from other developers outside of Google, who avidly contribute and provide user feedback.

This has made the TensorFlow library more useful to both academic researchers and developers. A further consequence of these factors is that TensorFlow has extensive
documentation and tutorials to help new users.

Last, but not least, TensorFlow supports mobile deployment, which also makes it a very suitable tool for production.

TensorFlow's computation graphs: migrating to TensorFlow v2

TensorFlow performs its computations based on a directed acyclic graph (DAG). In TensorFlow v1.x, such graphs could be explicitly defined in the low-level API, although this was not trivial for large and complex models. In this section, we will see how these graphs can be defined for a simple arithmetic computation. Then, we will see how to migrate a graph to TensorFlow v2, the eager execution and dynamic graph paradigm, as well as the function decoration for faster computations.

Understanding computation graphs

TensorFlow relies on building a computation graph at its core, and it uses this computation graph to derive relationships between tensors from the input all the way to the output它使用此计算图来得出从输入一直到输出的张量之间的关系. Let's say that we have rank (scalar) tensors a, b, and c and we want to evaluate 𝑧 = 2 × (𝑎 − 𝑏) + 𝑐. This evaluation can be represented as a computation graph, as shown in the following figure:

Note: The rank of a tensor is not the same as the rank of a matrix. The rank of a tensor is the number of indices required to uniquely select each element of the tensor. Rank is also known as "order", "degree", or "ndims."

As you can see, the computation graph is simply a network of nodes. Each node resembles [riˈzemblz]类似于an operation, which applies a function to its input tensor or tensors and returns zero or more tensors as the output. TensorFlow builds this computation graph and uses it to compute the gradients accordingly. In the next subsections, we will see some examples of creating a graph for this computation using TensorFlow v1.x and v2 styles.

Creating a graph in TensorFlow v1.x

In the earlier version of the TensorFlow (v1.x) low-level API, this graph had to be explicitly declared. The individual steps for building, compiling, and evaluating such a computation graph in TensorFlow v1.x are as follows:

1. Instantiate a new, empty computation graph
2. Add nodes (tensors and operations) to the computation graph
3. Evaluate (execute) the graph:
- a. Start a new session
- b. Initialize the variables in the graph
- c. Run the computation graph in this session

Before we take a look at the dynamic approach in TensorFlow v2, let's look at a simple example that illustrates how to create a graph in TensorFlow v1.x for evaluating 𝑧 = 2 × (𝑎 − 𝑏) + 𝑐 , as shown in the previous figure. The variables a, b, and c are scalars (single numbers), and we define these as TensorFlow constants. A graph can then be created by calling tf.Graph().Variables, as well as computations, represent the nodes of the graph, which we will define as follows:

In this code, we first defined graph g via g=tf.Graph(). Then, we added nodes to the graph, g, using with g.as_default(). However, note that if we do not explicitly create a graph, there is always a default graph to which variables and computations will be added automatically.

In TensorFlow v1.x, a session is an environment in which the operations and tensors of a graph can be executed. The Session class was removed from TensorFlow v2; However, for the time being, it is still available via the tf.compat submodule to allow compatibility with TensorFlow v1.x. A session object can be created by calling tf.compat.v1.Session(), which can receive an existing graph (here, g) as an argument, as in Session(graph=g).

After launching a graph in a TensorFlow session, we can execute its nodes, that is, evaluate its tensors or execute its operators. Evaluating each individual tensor involves calling its eval() method inside the current session. When evaluating a specific tensor in the graph, TensorFlow has to execute all the preceding nodes in the graph until it reaches the given node of interest. In case there are one or more placeholder variables, we also need to provide values for those through the session's run method, as we will see later in the chapter.

# TF-v1.x style
g = tf.Graph()
with g.as_default():
    a = tf.constant(1, name='a')
    b = tf.constant(2, name='b')
    c = tf.constant(3, name='c')
    z = 2*(a-b)+c  # a computation graph
    
with tf.compat.v1.Session( graph=g ) as sess:
    print('Result: z =', sess.run(z))
    print('Result: z =', z.eval())

After defining the static graph in the previous code snippet, we can execute the graph in a TensorFlow session and evaluate the tensor, z, as follows:

Migrating a graph to TensorFlow v2

Next, let's look at how this code can be migrated to TensorFlow v2. TensorFlow v2 uses dynamic (as opposed to static) graphs by default (this is also called eager
execution in TensorFlow), which allows us to evaluate an operation on the fly. Therefore, we do not have to explicitly create a graph and a session, which makes the development workflow much more convenient:

# TF v2 style
a = tf.constant(1, name='a')
b = tf.constant(2, name='b')
c = tf.constant(3, name='c')

z = 2*(a-b) + c
tf.print('Result: z = ', z)

Loading input data into a model: TensorFlow v1.x style

Another important improvement from TensorFlow v1.x to v2 is regarding how data can be loaded into our models. In TensorFlow v2, we can directly feed data in the form of Python variables or NumPy arrays. However, when using the TensorFlow v1.x low-level API, we had to create placeholder variables for providing input data to a model. For the preceding simple computation graph example, 𝑧 = 2 × (𝑎 − 𝑏) + 𝑐 , let's assume that a, b, and c are the input tensors of rank 0. We can then define three placeholders, which we will then use to "feed" data to the model via a so-called feed_dict dictionary, as follows:

# TF-v1.x style
g = tf.Graph()
with g.as_default():
    a = tf.compat.v1.placeholder( shape=None, dtype=tf.int32, name='tf_a')
    b = tf.compat.v1.placeholder( shape=None, dtype=tf.int32, name='tf_b')
    c = tf.compat.v1.placeholder( shape=None, dtype=tf.int32, name='tf_c')
    z = 2*(a-b)+c
    
with tf.compat.v1.Session( graph=g ) as sess:
    feed_dict = {a:1, b:2, c:3}
    print('Result: z =', sess.run(z, feed_dict=feed_dict))

Loading input data into a model: TensorFlow v2 style

In TensorFlow v2, all this can simply be done by defining a regular Python function with a, b, and c as its input arguments, for example:

# TF v2 style
def compute_z(a,b,c):
    r1 = tf.subtract(a,b)
    r2 = tf.multiply(2,r1)
    z = tf.add(r2,c)
    return z

tf.print('Scalar Inputs:', compute_z(1,2,3))
tf.print('Rank 1 Inputs:', compute_z([1], [2], [3]))
tf.print('Rank 2 Inputs:', compute_z([[1]], [[2]], [[3]]))

Now, to carry out the computation, we can simply call this function with Tensor objects as function arguments. Note that TensorFlow functions such as add, subtract, and multiply also allow us to provide inputs of higher ranks in the form of a TensorFlow Tensor object, a NumPy array, or possibly other Python objects, such as lists and tuples. In the following code example, we provide scalar inputs (rank 0), as well as rank 1 and rank 2 inputs, as lists:

In this section, you saw how migrating to TensorFlow v2 makes the programming style simple and efficient by avoiding explicit graph and session creation steps. Now that we have seen how TensorFlow v1.x compares to TensorFlow v2, we will focus only on TensorFlow v2. Next, we will take a deeper look into decorating Python functions into a graph that allows for faster computation.

Improving computational performance with function decorators

As you saw in the previous section, we can easily write a normal Python function and utilize TensorFlow operations. However, computations via the eager execution (dynamic graph) mode are not as efficient as the static graph execution in TensorFlow v1.x. Thus, TensorFlow v2 provides a tool called AutoGraph签名 that can automatically transform Python code into TensorFlow's graph code for faster execution. In addition, TensorFlow provides a simple mechanism for compiling a normal Python function to a static TensorFlow graph in order to make the computations more efficient.

To see how this works in practice, let's work with our previous compute_z function and annotate it for graph compilation using the @tf.function decorator:

# why? using the @tf.function decorator to annotate a normal Python function so that TensorFlow will compile it into a graph for faster execution

@tf.function
def compute_z(a,b,c):
    r1 = tf.subtract(a,b)
    r2 = tf.multiply(2, r1)
    z = tf.add(r2,c)
    return z

tf.print('Scalar Input:', compute_z(1,2,3))
tf.print('Rank 1 Input:', compute_z([1],[2],[3]))
tf.print('Rank 2 Input:', compute_z([[1]], [[2]], [[3]]))

Note that we can use and call this function the same way as before, but now TensorFlow will construct a static graph based on the input arguments. Python supports dynamic typing and polymorphism[ˌpɒlɪ'mɔfɪzəm]多态性, so we can define a function such as def f(a, b): return a+b and then call it using integer, float, list, or string inputs (recall that a+b is a valid operation for lists and strings). While TensorFlow graphs require static types and shapes, tf.function supports such a dynamic typing capability. For example, let's call this function with the following inputs:

This will produce the same outputs as before. Here, TensorFlow uses a tracing mechanism to construct a graph based on the input arguments. For this tracing mechanism, TensorFlow generates a tuple of keys based on the input signatures given for calling the function. The generated keys are as follows:

• For tf.Tensor arguments, the key is based on their shapes and dtypes.
• For Python types, such as lists, their id() is used to generate cache keys. (e.g. list, Nested list)
• For Python primitive values, the cache keys are based on the input values. # primitive types: integers, floats, booleans and strings

Upon calling such a decorated function, TensorFlow will check whether a graph with the corresponding key has already been generated. If such a graph does not exist,
TensorFlow will generate a new graph and store the new key. On the other hand, if we want to limit the way a function can be called, we can specify its input signature via a tuple of tf.TensorSpec objects when defining the function. For example, let's redefine the previous function, compute_z, and specify that only rank 1(一阶的，shape=[None]) tensors of type tf.int32 are allowed:

tf.TensorSpec(shape=[None], dtype=tf.int32)

@tf.function( input_signature=(tf.TensorSpec(shape=[None], dtype=tf.int32),
                               tf.TensorSpec(shape=[None], dtype=tf.int32),
                               tf.TensorSpec(shape=[None], dtype=tf.int32),
                              ) )
def compute_z(a,b,c):
    r1 = tf.subtract(a,b)
    r2 = tf.multiply(2,r1)
    z = tf.add(r2,c)
    return z

Now, we can call this function using rank 1 tensors (or lists that can be converted to rank 1 tensors):

tf.print('Rank 1 Inputs:', compute_z([1], [2], [3]))
tf.print('Rank 1 Inputs:', compute_z([1,2], [2,4], [3,6]))

However, calling this function using tensors with ranks other than 1 will result in an error since the rank will not match the specified input signature, as follows:

tf.print('Rank 0 Inputs:', compute_z(1,2,3)) 
### will result in error since 1, 2, 3 is rank 0

...

tf.print('Rank 2 Inputs:', compute_z([[1],[2]],
                                     [[2],[4]],
                                     [[3],[6]]
                                    )) # rank 2
### will result in error

In this section, we learned how to annotate a normal Python function(@tf.function) so that TensorFlow will compile it into a graph for faster execution. Next, we will look at TensorFlow variables: how to create them and how to use them.

TensorFlow Variable objects for storing and updating model parameters

We covered Tensor objects in Chapter 13, Parallelizing Neural Network Training with TensorFlow https://blog.youkuaiyun.com/Linli522362242/article/details/112386820. In the context of TensorFlow, a Variable is a special Tensor object that allows us to store and update the parameters of our models during training. A Variable can be created by just calling the tf.Variable class on user-specified initial values. In the following code, we will generate Variable objects of type float32, int32, bool, and string:

a = tf.Variable( initial_value=3.14, name='var_a')
b = tf.Variable( initial_value=[1,2,3], name='var_b')
c = tf.Variable( initial_value=[True, False], dtype=tf.bool)
d = tf.Variable( initial_value=['abc'], dtype=tf.string)

print(a)
print(b)
print(c)
print(d)

Notice that we always have to provide the initial values when creating a Variable. Variables have an attribute called trainable, which, by default, is set to True. Higher-level APIs such as Keras will use this attribute to manage the trainable variables and non-trainable ones. You can define a non-trainable Variable as follows:

a.trainable

trainable：如果为True，则会默认将变量添加到图形集合GraphKeys.TRAINABLE_VARIABLES中。此集合用于优化器Optimizer类优化的的默认变量列表【可为optimizer指定其他的变量集合】，可就是要训练的变量列表

设定trainable=False 可以防止该变量被数据流图的 GraphKeys.TRAINABLE_VARIABLES 收集, 这样我们就不会在训练的时候尝试更新它的值。

# To compute multiple gradients(e.g. tape.gradient(x1, trainable) and tape.gradient(x2, non_trainable) ) over the same computation, create a persistent gradient tape

# To compute multiple gradients over the same computation, create a persistent gradient tape
with tf.GradientTape(persistent=True) as tape:
    trainable = tf.Variable(1., name='var_1')
    non_trainable = tf.Variable(2., trainable=False)
    x1 = trainable * 2.
    x2 = trainable * 3.
tape.gradient(x1, trainable)

tape.watched_variables()

if trainable = tf.Variable(1.) without set name='var_1'

assert tape.gradient(x2, non_trainable) is None

tape.gradient(x2, non_trainable) is None

w = tf.Variable([1,2,3], trainable=False)

print(w.trainable)

The values of a Variable can be efficiently modified by running some operations such as .assign(), .assign_add() and related methods. Let's take a look at some examples:

print( w.assign([3,1,4], read_value=True))
w.assign_add([2,-1,2], read_value=False)

print(w.value())

When the read_value argument is set to True (which is also the default), these operations will automatically return the new values after updating the current values of the Variable. Setting the read_value to False will suppress the automatic return of the updated value (but the Variable will still be updated in place). Calling w.value() will return the values in a tensor format. Note that we cannot change the shape or type of the Variable during assignment.

You will recall that for NN models, initializing model parameters with random weights is necessary to break the symmetry during backpropagation—otherwise, a multilayer NN would be no more useful than a single-layer NN like logistic regression. When creating a TensorFlow Variable, we can also use a random initialization scheme. TensorFlow can generate random numbers based on a variety of distributions via tf.random (see https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/random). In the following example, we will take a look at some standard initialization methods that are also available in Keras (see https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/initializers).

So, let's look at how we can create a Variable with Glorot initialization, which is a classic random initialization scheme that was proposed by Xavier Glorot and Yoshua
Bengio. For this, we create an operator called init as an object of class GlorotNormal. Then, we call this operator and provide the desired shape of the output tensor:

tf.random.set_seed(1)
init = tf.keras.initializers.GlorotNormal()

tf.print( init( shape=(3,) ) )

Now, we can use this operator to initialize a Variable of shape 2 × 3 :

v = tf.Variable( init(shape=(2,3)) )
tf.print(v)

######################################

The Vanishing/Exploding Gradients Problems

https://blog.youkuaiyun.com/Linli522362242/article/details/106935910

As we discussed in Chapter 10, the backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way. Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.

Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layers’ connection weights virtually unchanged, and training never converges to a good solution. We call this the vanishing gradients problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which surfaces in recurrent neural networks (see Chapter 15). More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

This unfortunate behavior was empirically[ɪm'pɪrɪklɪ]以经验为主地; observed long ago, and it was one of the reasons deep neural networks were mostly abandoned in the early 2000s. It wasn’t clear what caused the gradients to be so unstable when training a DNN, but some light was shed in a 2010 paper by Xavier Glorot and Yoshua Bengio. The authors found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time (i.e., a normal distribution with a mean of 0 and a standard deviation of 1). In short, they showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This saturation is actually made worse by the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in deep networks).

Looking at the logistic activation function (see Figure 11-1), you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagationhttps://blog.youkuaiyun.com/Linli522362242/article/details/111940633 kicks in it has virtually几乎 no gradient to propagate back through the network; and what little gradient exists keeps getting diluted稀释 as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.


import numpy as np
import matplotlib.pyplot as plt
 
def logit(z):
    return 1/( 1+np.exp(-z) )
 
z = np.linspace(-5, 5, 200)
 
plt.plot( [-5,5],[0,0], 'k-' ) #x-axis
plt.plot( [-5,5],[1,1], 'k--') #x=1
plt.plot( [0,0], [-0.2, 1.2], 'k-') #y-axis
plt.plot(z, logit(z), 'b-', linewidth=2)
props = dict( facecolor='black', shrink=0.1)
plt.annotate("Saturating", xytext=(3.5,0.7), 
             xy=(5,1), arrowprops=props, 
             fontsize=14, ha='center'
            )
plt.annotate('Saturating', xytext=(-3.5, 0.3), 
             xy=(-5,0), arrowprops=props, 
             fontsize=14, ha='center'
            )
plt.annotate('Linear', xytext=(2,0.2),
             xy=(0,0.5), arrowprops=props,
             fontsize=14, ha='center'
            )
plt.grid(True)
plt.title('Sigmoid activation function', fontsize=14)
plt.axis([-5,5, -0.2, 1.2])
 
plt.show()

Figure 11-1. Logistic activation function saturation

Xavier (or Glorot) initialization

In the early development of deep learning, it was observed that random uniform or random normal weight initialization could often result in a poor performance of the model during training. In 2010, Glorot and Bengio(11_Training Deep Neural Networks_VarianceScaling_leaky relu_PReLU_SELU _Batch Normalization_Reusing https://blog.youkuaiyun.com/Linli522362242/article/details/106935910) investigated the effect of initialization and proposed a novel, more robust initialization scheme to facilitate[fəˈsɪlɪteɪt]使便利 the training of deep networks. The general idea behind Xavier initialization is to roughly balance the variance of the gradients across different layers大致平衡不同层之间的梯度变化. Otherwise, some layers may get too much attention during training while the other layers lag behind落后.

According to the research paper by Glorot and Bengio, if we want to initialize the weights from uniform distribution, we should choose the interval of this uniform distribution as follows:
https://blog.youkuaiyun.com/Linli522362242/article/details/106935910

Here, is the number of input neurons that are multiplied by the weights, and is the number of output neurons that feed into the next layer. For initializing the weights from Gaussian (normal) distribution, it is recommended that you choose the standard
deviation of this Gaussian to be

TensorFlow supports Xavier initialization in both uniform and normal distributions of weights.

For more information about Glorot and Bengio's initialization scheme, including the mathematical derivation and proof, read their original paper (Understanding the difficulty of deep feedforward neural networks, Xavier Glorot and Yoshua Bengio, 2010), which is freely available at http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf.
######################################

Now, to put this into the context of a more practical use case, let's see how we can define a Variable inside the base tf.Module class. We will define two variables: a trainable one and a non-trainable one:

class MyModule(tf.Module):
    def __init__(self):
        init = tf.keras.initializers.GlorotNormal()
        self.w1 = tf.Variable( init( shape=(2,3) ), trainable=True )
        self.w2 = tf.Variable( init( shape=(1,2) ), trainable=False )

m = MyModule()
print('All module variables: ', [v.shape for v in m.variables])
print('Trainable variable:   ', [v.shape for v in m.trainable_variables ])

As you can see in this code example, subclassing the tf.Module class gives us direct access to all variables defined in a given object (here, an instance of our custom
MyModule class) via the .variables attribute.

Finally, let's look at using variables inside a function decorated with tf.function. When we define a TensorFlow Variable inside a normal function (not decorated), we might expect that a new Variable will be created and initialized each time the function is called. However, tf.function will try to reuse the Variable based on tracing and graph creation. Therefore, TensorFlow does not allow the creation of a Variable inside a decorated function and, as a result, the following code will raise an error:

@tf.function
def f(x):
    w = tf.Variable([1,2,3])
f([1]

...

One way to avoid this problem is to define the Variable outside of the decorated function and use it inside the function:

w = tf.Variable( tf.random.uniform((3,3)) )

@tf.function
def compute_z(x):
    return tf.matmul(w,x)

x = tf.constant([[1],[2],[3]], dtype=tf.float32)
# <tf.Tensor: shape=(3, 1), dtype=float32, numpy=
# array([[1.],
#        [2.],
#        [3.]], dtype=float32)>
tf.print(compute_z(x))

Computing gradients via automatic differentiation and GradientTape

As you already know, optimizing NNs requires computing the gradients of the cost with respect to the NN weights. This is required for optimization algorithms such as stochastic gradient descent (SGD). In addition, gradients have other applications, such as diagnosing the network to find out why an NN model is making a particular prediction for a test example. Therefore, in this section, we will cover how to compute gradients of a computation with respect to some variables.

Computing the gradients of the loss with respect to trainable variables

TensorFlow supports automatic differentiation, which can be thought of as an implementation of the chain rule for computing gradients of nested functions. When we define a series of operations that results in some output or even intermediate tensors, TensorFlow provides a context for calculating gradients of these computed tensors with respect to its dependent nodes in the computation graph. In order to compute these gradients, we have to "record" the computations via tf.GradientTape.

Let's work with a simple example where we will compute 𝑧 = 𝑤x + 𝑏 and define the loss as the squared loss between the target and prediction, 𝐿oss = . In the
more general case, where we may have multiple predictions and targets, we compute the loss as the sum of the squared error, . In order to implement this
computation in TensorFlow, we will define the model parameters, w and b, as variables, and the input, x and y, as tensors. We will place the computation of z and the loss within the tf.GradientTape context:

import tensorflow as tf

w = tf.Variable(1.0)
b = tf.Variable(0.5)

print(w.trainable, b.trainable)

TensorFlow does not allow the creation of a Variable inside a decorated function

x = tf.convert_to_tensor([1.4]) # <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.4], dtype=float32)>
y = tf.convert_to_tensor([2.1])

with tf.GradientTape() as tape:
    z = tf.add( tf.multiply(w,x), b)
    loss = tf.reduce_sum( tf.square(y-z) )

dloss_dw = tape.gradient(loss, w)

tf.print('dL/dw : ', dloss_dw)

When computing the value z, we could think of the required operations, which we recorded to the "gradient tape," as a forward pass in an NN. We used tape.gradient to compute . Since this is a very simple example, we can obtain the derivatives, , symbolically to verify that the computed gradients match the results we obtained in the previous code example:

# verifying the computed gradient
tf.print( 2*x*(w*x+b-y) )

##########################################

Understanding automatic differentiation https://en.wikipedia.org/wiki/Automatic_differentiation

Automatic differentiation represents a set of computational techniques for computing derivatives or gradients of arbitrary arithmetic operations. During this process, gradients of a computation (expressed as a series of operations) are obtained by accumulating the gradients through repeated applications of the chain rule. To better understand the concept behind automatic differentiation, Fundamental to AD(automatic differentiation) is the decomposition of differentials provided by the chain rule. For the simple composition

the chain rule gives

Usually, two distinct modes of AD are presented, forward accumulation (or forward mode) and reverse accumulation (or reverse mode). Forward accumulation specifies that one traverses the chain rule from inside to outside (that is, first compute and then and at last ), while reverse accumulation has the traversal from outside to inside (first compute and then and at last . More succinctly[sək'sɪŋktlɪ]简洁地,

forward accumulation computes the recursive relation: with,### <==<==
In forward accumulation AD(automatic differentiation), one first fixes the independent variable x with respect to which differentiation is performed and computes the derivative of each sub-expression recursively. In a pen-and-paper calculation, this involves repeatedly substituting the derivative of the inner functions in the chain rule:
reverse accumulation computes the recursive relation: with ###<==<==
In reverse accumulation AD(automatic differentiation), the dependent variable y to be differentiated is fixed and the derivative is computed with respect to each sub-expression recursively. In a pen-and-paper calculation, the derivative of the outer functions is repeatedly substituted in the chain rule:

The derivative can be computed in two different ways: forward accumulation, which starts with , and reverse accumulation, which starts with Note that TensorFlow uses the latter, reverse accumulation.

cp12_实现a M ArtificialNN_gzip_mnist_struct_savez_compressed_fetch_openml_back propagation_weight更_L2 : https://blog.youkuaiyun.com/Linli522362242/article/details/111940633

##########################################

Computing gradients with respect to nontrainable tensors

tf.GradientTape automatically supports the gradients for trainable variables. However, for non-trainable variables and other Tensor objects, we need to add an additional modification to the GradientTape called tape.watch() to monitor those as well. For example, if we are interested in computing , the code will be as follows:

with tf.GradientTape() as tape:
    tape.watch(x) ########
    z = tf.add( tf.multiply(w,x), b)
    loss = tf.square(y-z)
    
dloss_dx = tape.gradient( loss, x )
tf.print( 'dL/dx:', dloss_dx )

####################

Adversarial[ˌædvərˈseriəl]对抗（性）的 examples

Computing gradients of the loss with respect to the input example is used for generating adversarial examples (or adversarial attacks). In computer vision, adversarial examples are examples that are generated by adding some small imperceptible[ˌɪmpərˈseptəbl]感觉不到的 noise (or perturbations[ˌpɜːrtərˈbeɪʃn]干扰) to the input example, which results in a deep NN misclassifying them. Covering adversarial examples is beyond the scope of this book, but if you are interested, you can find the original paper by Christian Szegedy et al., titled Intriguing properties of neural networks, at https://arxiv.org/pdf/1312.6199.pdf.

####################

Keeping resources for multiple gradient computations

When we monitor the computations in the context of tf.GradientTape, by default, the tape will keep the resources only for a single gradient computation. For instance,
after calling tape.gradient() once, the resources will be released and the tape will be cleared. Hence, if we want to compute more than one gradient, for example, both
and , we need to make the tape persistent:

with tf.GradientTape(persistent=True) as tape:
    z = tf.add( tf.multiply(w,x), b)
    loss = tf.reduce_sum( tf.square(y-z) )
    
dloss_dw = tape.gradient( loss, w )
dloss_db = tape.gradient( loss, b )

tf.print('dL/dw:', dloss_dw)
tf.print('dL/db:', dloss_db)

However, keep in mind that this is only needed when we want to compute more than one gradient, as recording and keeping the gradient tape is less memory efficient
compared to releasing the memory after a single gradient computation. This is also why the default setting is persistent=False.

Finally, if we are computing gradients of a loss term with respect to the parameters of a model, we can define an optimizer and apply the gradients to optimize the model parameters using the tf.keras API, as follows:

optimizer = tf.keras.optimizers.SGD()
# w = tf.Variable(1.0)
# b = tf.Variable(0.5)
optimizer.apply_gradients( zip([dloss_dw, dloss_db], 
                               [w,b]) )

tf.print('Updated w:', w)
tf.print('Updated bias:', b)

You will recall that the initial weight and bias unit were w = 1.0 and b = 0.5, and applying the gradients of the loss with respect to the model parameters changed the model parameters to w = 1.0056 and b = 0.504.

Simplifying implementations of common architectures via the Keras API

You have already seen some examples of building a feedforward NN model (for instance, a multilayer perceptron) and defining a sequence of layers using Keras' Sequential class. Before we look at different approaches for configuring those layers, let's briefly recap the basic steps by building a model with two densely (fully) connected layers:

model = tf.keras.Sequential()
model.add( tf.keras.layers.Dense(units=16, activation='relu') ) # fully connected (FC) layer or linear layer
model.add( tf.keras.layers.Dense(units=32, activation='relu') )

# late variable creation
model.build( input_shape=(None, 4) )
model.summary()

We specified the input shape with model.build(), instantiating the variables after defining the model for that particular shape. The number of parameters of each layer is displayed: 4 x 16 + 16 = 80 for the first layer, and 16 × 32 + 32 = 544 for the second layer. Once variables (or model parameters) are created, we can access both trainable and non-trainable variables as follows:

# printing variables of the model
for v in model.variables:
    print('{:20s}'.format(v.name), v.trainable, v.shape)

In this case, each layer has a weight matrix called kernel as well as a bias vector.

Next, let's configure these layers, for example, by applying different activation functions, variable initializers, or regularization methods to the parameters. A comprehensive and complete list of available options for these categories can be found in the official documentation:

• Choosing activation functions via tf.keras.activations: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/activations
• Initializing the layer parameters via tf.keras.initializers: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/initializers
• Applying regularization to the layer parameters (to prevent overfitting) via tf.keras.regularizers: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/regularizers

In the following code example, we will configure the first layer by specifying initializers for the kernel and bias variables. Then, we will configure the second layer by specifying an L1 regularizer for the kernel (weight matrix):

Configuring layers¶

Keras Initializers tf.keras.initializers: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/initializers
Keras Regularizers tf.keras.regularizers: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/regularizers
Activations tf.keras.activations: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/activations

model = tf.keras.Sequential()

model.add(
    tf.keras.layers.Dense(
        units = 16,
        activation = tf.keras.activations.relu,
        kernel_initializer = tf.keras.initializers.GlorotNormal(),
        bias_initializer = tf.keras.initializers.Constant(2.0) #Initializer that generates tensors with constant values.
    )
)

model.add(
    tf.keras.layers.Dense(
        units = 32,
        activation = tf.keras.activations.sigmoid,
        kernel_regularizer = tf.keras.regularizers.l1
    )
)
model.build( input_shape=(None, 4) )
model.summary()

Furthermore, in addition to configuring the individual layers, we can also configure the model when we compile it. We can specify the type of optimizer and the loss function for training, as well as which metrics to use for reporting the performance on the training, validation, and test datasets. Again, a comprehensive list of all available options can be found in the official documentation:
• Optimizers via tf.keras.optimizers: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers
• Loss functions via tf.keras.losses: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses
• Performance metrics via tf.keras.metrics: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/metrics

#######################################

Choosing a loss function

Regarding the choices for optimization algorithms, SGD and Adam are the most widely used methods. The choice of loss function depends on the task; for example, you might use mean square error loss for a regression problem.

The family of cross-entropy loss functions supplies the possible choices for classification tasks, which are extensively discussed in cp15_Classifying Images with Deep Convolutional NN_Loss_Cross Entropy_ax.text_mnist_ CelebA_Colab_ck https://blog.youkuaiyun.com/Linli522362242/article/details/108414534.

Furthermore, you can use the techniques you have learned from previous chapters (for example, techniques for model evaluation from cp6_Model Eval_Confusion_Hyperpara Tuning_pipeline_variance_bias_ validation_learning curve_strength https://blog.youkuaiyun.com/Linli522362242/article/details/109560084) combined with the appropriate metrics for the problem. For example, precision and recall, accuracy, area under the curve (AUC), and false negative and false positive scores are appropriate metrics for evaluating classification models.
#######################################

In this example, we will compile the model using the SGD optimizer, cross-entropy loss for binary classification, and a specific list of metrics, including accuracy,
precision, and recall:

Compiling a model¶

Keras Optimizers tf.keras.optimizers: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers
Keras Loss Functions tf.keras.losses: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses
Keras Metrics tf.keras.metrics: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/metrics

model.compile(
    optimizer = tf.keras.optimizers.SGD( learning_rate=0.001 ),
    loss = tf.keras.losses.BinaryCrossentropy(),
    metrics = [tf.keras.metrics.Accuracy(),
               tf.keras.metrics.Precision(),
               tf.keras.metrics.Recall(),
              ]
)

When we train this model by calling model.fit(...), the history of the loss and the specified metrics for evaluating training and validation performance (if a validation dataset is used) will be returned, which can be used to diagnose the learning behavior.

Next, we will look at a more practical example: solving the classic XOR classification problem using the Keras API. First, we will use the tf.keras.Sequential() class to build the model. Along the way, you will also learn about the capacity of a model for handling nonlinear decision boundaries. Then, we will cover other ways of building a model that will give us more flexibility and control over the layers of the network.

Solving an XOR classification problem

The XOR classification problem is a classic problem for analyzing the capacity of a model with regard to capturing the nonlinear decision boundary between two classes. We generate a toy dataset of 200 training examples with two features () drawn from a uniform distribution between [−1, 1). Then, we assign the ground truth label for training example i according to the following rule:

We will use half of the data (100 training examples) for training and the remaining half for validation. The code for generating the data and splitting it into the training and validation datasets is as follows:

import matplotlib.pyplot as plt
tf.random.set_seed(1)
np.random.seed(1)

x = np.random.uniform( low=-1, high=1, size=(200,2) ) # shape: 200x2
y = np.ones(len(x))
y[ x[:,0]*x[:,1]<=0 ] =0

x_train = x[:100, :]
y_train = y[:100]
x_valid = x[100:, :]
y_valid = y[100:]

fig = plt.figure( figsize=(6,6) )

plt.plot( x[y==0, 0],
          x[y==0, 1], 'o', alpha=0.75, markersize=10 )
plt.plot( x[y==1, 0],
          x[y==1, 1], '<', alpha=0.75, markersize=10 )

plt.xlabel(r'$x_1$', size=15)
plt.ylabel(r'$x_2$', size=15)
plt.show()

The code results in the following scatterplot of the training and validation examples, shown with different markers based on their class label:

In the previous subsection, we covered the essential tools that we need to implement a classifier in TensorFlow. We now need to decide what architecture we should choose for this task and dataset. As a general rule of thumb, the more layers we have, and the more neurons we have in each layer, the larger the capacity of the model will be. Here, the model capacity can be thought of as a measure of how readily the model can approximate complex functions. While having more parameters means the network can fit more complex functions, larger models are usually harder to train (and prone to overfitting). In practice, it is always a good idea to start with a simple model as a base line, for example, a single-layer NN like logistic regression:

model = tf.keras.Sequential()
model.add( tf.keras.layers.Dense( units=1,
                                  input_shape=(2,),#since 200 x '2'
                                  activation='sigmoid'
         ) )
model.summary()

The total size of the parameters for this simple logistic regression model is 3: a weight matrix (or kernel) of size 2 × 1 and a bias vector of size 1. After defining the model, we will compile the model and train it for 200 epochs using a batch size of 2:

Equation 4-19. Softmax score for class k

Equation 4-20. Softmax function (hint: normalization)

Equation 4-22. Cross entropy cost function https://blog.youkuaiyun.com/Linli522362242/article/details/104124771
is equal to 1 if the target class for the ith instance is k; otherwise, it is equal to 0.

Notice that when there are just two classes (K = 2), this cost function is equivalent to the Logistic Regression’s cost function (log loss; see Equation 4-17 ).https://blog.youkuaiyun.com/Linli522362242/article/details/96480059

is a weight vector: Theta = Theta - eta*gradients

Binary cross-entropy is the loss function for a binary classification (with a single output unit), https://blog.youkuaiyun.com/Linli522362242/article/details/108414534

model.compile( optimizer = tf.keras.optimizers.SGD(),
               loss = tf.keras.losses.BinaryCrossentropy(),# default threshold=0.5
               metrics = [tf.keras.metrics.BinaryAccuracy()]
             )

hist = model.fit(x_train, y_train,
                 validation_data=(x_valid, y_valid),
                 epochs=200, batch_size=2, verbose=0)

Notice that model.fit() returns a history of training epochs, which is useful for visual inspection after training. In the following code, we will plot the learning curves, including the training and validation loss, as well as their accuracies.

We will also use the MLxtend library to visualize the validation data and the decision boundary.
MLxtend can be installed via conda or pip as follows: Install_tf_ notebook_ Spyder_tfgraphviz_pydot_Pandas_scikit-learn_ipython_pillow_NLTK_flask_mlxtend https://blog.youkuaiyun.com/Linli522362242/article/details/108037567

conda install mlxtend -c conda-forge

pip install mlxtend

The following code will plot the training performance along with the decision region bias:

from mlxtend.plotting import plot_decision_regions

history = hist.history
# hist.history.keys() : dict_keys(['loss', 'binary_accuracy', 'val_loss', 'val_binary_accuracy'])

fig = plt.figure( figsize=(16,4) )

ax = fig.add_subplot(1,3,1)
plt.plot(history['loss'], lw=4)
plt.plot(history['val_loss'], lw=4)
plt.legend(['Train loss', 'Validation loss'], fontsize=15)
ax.set_xlabel('Epochs', size=15)

ax = fig.add_subplot(1,3,2)
plt.plot(history['binary_accuracy'], lw=4)
plt.plot(history['val_binary_accuracy'], lw=4)
plt.legend(['Train Acc.', 'Validation Acc.'], fontsize=15)
ax.set_xlabel('Epochs', size=15)
            
ax = fig.add_subplot(1,3,3)
plot_decision_regions( X=x_valid, y=y_valid.astype(np.integer), # convert float to integer
                       clf=model)
ax.set_xlabel(r'$x_1$', size=15)
ax.xaxis.set_label_coords(1, -0.025)
ax.set_ylabel(r'$x_2$', size=15)
ax.yaxis.set_label_coords(-0.025, 1)

plt.show()

This results in the following figure, with three separate panels for the losses, accuracies, and the scatterplot of the validation examples, along with the decision boundary:

As you can see, a simple model with no hidden layer can only derive a linear decision boundary, which is unable to solve the XOR problem. As a consequence,
we can observe that the loss terms for both the training and the validation datasets are very high, and the classification accuracy is very low.

In order to derive a nonlinear decision boundary, we can add one or more hidden layers connected via nonlinear activation functions. The universal approximation theorem states that a feedforward NN with a single hidden layer and a relatively large number of hidden units can approximate arbitrary continuous functions relatively well. Thus, one approach for tackling the XOR problem more satisfactorily is to add a hidden layer and compare different numbers of hidden units(weight groups/weight columns, output shape) until we observe satisfactory results on the validation dataset. Adding more hidden units would correspond to increasing the width of a layer.

Alternatively, we can also add more hidden layers, which will make the model deeper. The advantage of making a network deeper rather than wider is that fewer parameters are required to achieve a comparable model capacity. However, a downside of deep (versus wide) models is that deep models are prone to vanishing and exploding gradients, which make them harder to train.

As an exercise, try adding one, two, three, and four hidden layers, each with four hidden units. In the following example, we will take a look at the results of a feedforward NN with three hidden layers:

(see Figure 11-1)https://blog.youkuaiyun.com/Linli522362242/article/details/106935910

Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).

init = tf.keras.initializers.GlorotUniform()

tf.print( init( shape=(2,) ) )

Consider that the weight coefficient is a decimal, so the result of multiplying the sample value by the weight coefficient becomes smaller, and then through the relu activation function( the ReLU activation function, mostly because it does not saturate for positive values (and because it is fast to compute), the value less than 0 becomes 0(simplify). Then repeat this step several times, the result is passed to the sigmoid function, which is greatly Reduced the possibility of saturation。 Looking at the logistic activation function (see Figure 11-1), you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus, when backpropagationhttps://blog.youkuaiyun.com/Linli522362242/article/details/111940633 kicks in it has virtually几乎 no gradient to propagate back through the network; and what little gradient exists keeps getting diluted稀释 as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.

tf.random.set_seed(1)

model = tf.keras.Sequential()                                                    # default bias_initializer='zeros'
model.add( tf.keras.layers.Dense(units=4, input_shape=(2,), activation='relu') ) # default kernel_initializer='glorot_uniform',
model.add( tf.keras.layers.Dense(units=4, activation='relu') )
model.add( tf.keras.layers.Dense(units=4, activation='relu') )
model.add( tf.keras.layers.Dense(units=1, activation='sigmoid') )

model.summary()

model.compile( optimizer = tf.keras.optimizers.SGD(),
               loss = tf.keras.losses.BinaryCrossentropy(),
               metrics = [tf.keras.metrics.BinaryAccuracy()]
             )
hist = model.fit(x_train, y_train,
                 validation_data = (x_valid, y_valid),
                 epochs=200, batch_size=2, verbose=0
                )
history=hist.history

We can repeat the previous code for visualization, which produces the following:

from mlxtend.plotting import plot_decision_regions

history = hist.history
# hist.history.keys() : dict_keys(['loss', 'binary_accuracy', 'val_loss', 'val_binary_accuracy'])

fig = plt.figure( figsize=(16,4) )

ax = fig.add_subplot(1,3,1)
plt.plot(history['loss'], lw=4)
plt.plot(history['val_loss'], lw=4)
plt.legend(['Train loss', 'Validation loss'], fontsize=15)
ax.set_xlabel('Epochs', size=15)

ax = fig.add_subplot(1,3,2)
plt.plot(history['binary_accuracy'], lw=4)
plt.plot(history['val_binary_accuracy'], lw=4)
plt.legend(['Train Acc.', 'Validation Acc.'], fontsize=15)
ax.set_xlabel('Epochs', size=15)
            
ax = fig.add_subplot(1,3,3)
plot_decision_regions( X=x_valid, y=y_valid.astype(np.integer), # convert float to integer
                       clf=model)
ax.set_xlabel(r'$x_1$', size=15)
ax.xaxis.set_label_coords(1, -0.025)
ax.set_ylabel(r'$x_2$', size=15)
ax.yaxis.set_label_coords(-0.025, 1)

plt.show()

Now, we can see that the model is able to derive a nonlinear decision boundary for this data, and the model reaches 100 percent accuracy on the training dataset. The
validation dataset's accuracy is 95 percent, which indicates that the model is slightly overfitting(more neurons we have in each layer ==>larger the capacity of the model ==> prone to overfitting).

Making model building more flexible with Keras' functional API

In the previous example, we used the Keras Sequential class to create a fully connected NN with multiple layers. This is a very common and convenient way of building models. However, it unfortunately doesn't allow us to create more complex models that have multiple input, output, or intermediate branches. That's where Keras' so-called functional API comes in handy.

To illustrate how the functional API can be used, we will implement the same architecture that we built using the objected-oriented (Sequential) approach in the previous section; however, this time, we will use the functional approach. In this approach, we first specify the input. Then, the hidden layers are constructed, with their outputs named h1, h2, and h3. For this problem, we use the output of each layer as the input to the succedent layer (note that if you are building more complex models that have multiple branches, this may not be the case, but it can still be done via the functional API). Finally, we specify the output as the final dense layer that receives h3 as input. The code for this is as follows:

tf.random.set_seed(1)

## input layer:
inputs = tf.keras.Input( shape=(2,) )

## hidden layers
h1 = tf.keras.layers.Dense( units=4, activation='relu' )(inputs)
h2 = tf.keras.layers.Dense( units=4, activation='relu' )(h1)
h3 = tf.keras.layers.Dense( units=4, activation='relu' )(h2)

## output:
outputs = tf.keras.layers.Dense( units=1, activation='sigmoid' )(h3)

## construct a model
model = tf.keras.Model( inputs=inputs, outputs=outputs )

model.summary()

Compiling and training this model is similar to what we did previously

## compile:
model.compile(optimizer=tf.keras.optimizers.SGD(),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy()])

## train:
hist = model.fit(x_train, y_train, 
                 validation_data=(x_valid, y_valid), 
                 epochs=200, batch_size=2, verbose=0)
## Plotting
history = hist.history


from mlxtend.plotting import plot_decision_regions

history = hist.history
# hist.history.keys() : dict_keys(['loss', 'binary_accuracy', 'val_loss', 'val_binary_accuracy'])

fig = plt.figure( figsize=(16,4) )

ax = fig.add_subplot(1,3,1)
plt.plot(history['loss'], lw=4)
plt.plot(history['val_loss'], lw=4)
plt.legend(['Train loss', 'Validation loss'], fontsize=15)
ax.set_xlabel('Epochs', size=15)

ax = fig.add_subplot(1,3,2)
plt.plot(history['binary_accuracy'], lw=4)
plt.plot(history['val_binary_accuracy'], lw=4)
plt.legend(['Train Acc.', 'Validation Acc.'], fontsize=15)
ax.set_xlabel('Epochs', size=15)
            
ax = fig.add_subplot(1,3,3)
plot_decision_regions( X=x_valid, y=y_valid.astype(np.int32), # convert float to integer
                       clf=model)
ax.set_xlabel(r'$x_1$', size=15)
ax.xaxis.set_label_coords(1, -0.025)
ax.set_ylabel(r'$x_2$', size=15)
ax.yaxis.set_label_coords(-0.025, 1)

plt.show()

Implementing models based on Keras' Model class

An alternative way to build complex models is by subclassing tf.keras.Model. In this approach, we create a new class derived from tf.keras.Model and define the function, __init__(), as a constructor. The call() method is used to specify the forward pass. In the constructor function, __init__(), we define the layers as attributes of the class so that they can be accessed via the self reference attribute. Then, in the call() method, we specify how these layers are to be used in the forward pass of the NN. The code for defining a new class that implements the previous model is as follows:

Sub-classing tf.keras.Model

define __init__()
define call()

class MyModel(tf.keras.Model):
    def __init__(self):
        super( MyModel, self).__init__()
        self.hidden_1 = tf.keras.layers.Dense( units=4, activation='relu' )
        self.hidden_2 = tf.keras.layers.Dense( units=4, activation='relu' )
        self.hidden_3 = tf.keras.layers.Dense( units=4, activation='relu' )
        self.output_layer = tf.keras.layers.Dense( units=1, activation='sigmoid' )
        
    def call(self, inputs):
        h = self.hidden_1(inputs) # function API
        h = self.hidden_2(h)
        h = self.hidden_3(h)
        return self.output_layer(h)

Notice that we used the same output name, h, for all hidden layers. This makes the code more readable and easier to follow.

A model class derived from tf.keras.Model through subclassing inherits general model attributes, such as build(), compile(), and fit(). Therefore, once we define
an instance of this new class, we can compile and train it like any other model built by Keras:

tf.random.set_seed(1)

## testing:
model = MyModel()
model.build( input_shape=(None,2) )

model.summary()

2*4 + 4 = 12 <== input_shape=(None,2)
4*4 + 4 = 20
4*4 + 4 = 20
4*1 + 1 = 5

## compile:
model.compile( optimizer = tf.keras.optimizers.SGD(),
               loss = tf.keras.losses.BinaryCrossentropy(),
               metrics = [tf.keras.metrics.BinaryAccuracy()]
             )
## train:
hist = model.fit(x_train, y_train,
                 validation_data=(x_valid, y_valid),
                 epochs=200, batch_size=2, verbose=0 )

## Plotting
history = hist.history
## hist.history.keys() : dict_keys(['loss', 'binary_accuracy', 'val_loss', 'val_binary_accuracy'])

fig = plt.figure( figsize=(16,4) )

ax = fig.add_subplot(131)
plt.plot(history['loss'], lw=4)
plt.plot(history['val_loss'], lw=4)
plt.legend(['Train loss', 'Validation loss'], fontsize=15)
ax.set_xlabel('Epochs', size=15)

ax = fig.add_subplot(132)
plt.plot(history['binary_accuracy'], lw=4)
plt.plot(history['val_binary_accuracy'], lw=4)
plt.legend(['Train Acc.', 'Validation Acc.'], fontsize=15)
ax.set_xlabel('Epochs', size=15)

ax = fig.add_subplot(133)
plot_decision_regions( X=x_valid, y=y_valid.astype(np.int32),
                       clf=model )
ax.set_xlabel(r'$x_1$', size=15)
ax.xaxis.set_label_coords(1, -0.025) # move to right by 1, move to up by -0.025
ax.set_ylabel(r'$x_2$', size=15)
ax.yaxis.set_label_coords(-0.025, 1) # move to right by -0.025, move to up by 1
plt.show()