19_3_Training Keras models with TensorFlow Cloud

最新推荐文章于 2025-07-08 11:10:00 发布

原创最新推荐文章于 2025-07-08 11:10:00 发布 · 440 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#tensorflow #人工智能 #python

pythonMachineLearningInAction 专栏收录该内容

101 篇文章

订阅专栏

本文详细介绍了如何在谷歌云平台上设置项目、认证笔记本、创建存储桶、链接计费账户、启用所需API、创建服务账户，并进行TensorFlow模型的训练和部署。步骤包括设置项目配置、导入模块、准备数据、远程训练以及使用TensorBoard监控进度。最后，文章还提供了重新连接Colab实例以获取训练结果的方法。

部署运行你感兴趣的模型镜像

19_Training and Deploying TensorFlowModels at Scale_walk目录_TensorFlow Serving_requests_REST_gRPC_Docker_Google API Client Library_gpu :https://blog.youkuaiyun.com/Linli522362242/article/details/119323411

19_2_Training & Deploying TensorFlowModels_%%writefile UsageError_colab_文件名含有空格_No dashboard_gcp

https://blog.youkuaiyun.com/Linli522362242/article/details/119626524

1. Create a Project:

https://blog.youkuaiyun.com/Linli522362242/article/details/119626524

named: mnist 10272021

==> click SELECT PROJECT ==> ==> Project ID : mnist-10272021

==> Project number : 97885218772

2. Authenticating the notebook to use your Google Cloud Project

This code authenticates the notebook, checking your valid Google Cloud credentials and identity. It is inside the if not tfc.remote() block to ensure that it is only run in the notebook, and will not be run when the notebook code is sent to Google Cloud.

Note: For Kaggle Notebooks click on "Add-ons"->"Google Cloud SDK" before running the cell below.

# Using tfc.remote() to ensure this code only runs in notebook

# GCP_PROJECT_ID = "mnist-10272021"
if not tfc.remote():
    # Authentication for Colab Notebooks
    if "google.colab" in sys.modules:
        print('google.colab')
        from google.colab import auth

        auth.authenticate_user()
        os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID

    # Authentication for Kaggle Notebooks
    if "kaggle_secrets" in sys.modules:
        from kaggle_secrets import UserSecretsClient

        UserSecretsClient().set_gcloud_credentials(project=GCP_PROJECT_ID)

==>

3. Create Bucket

We will use this storage bucket for temporary assets as well as to save the model checkpoints. Make a note of the name of the bucket for future reference. Note bucket names are unique globally.

==>==> Browser==>
==>CREATE BUCKET

Name your bucket: mnist_10272021_bucket

GCS_BUCKET = "mnist_10272021_bucket"
                # gs://mnist_10272021_bucket/mnist
GCS_BUCKET_PATH = f"gs://{GCS_BUCKET}"
!gsutil mb -p $GCP_PROJECT_ID $GCS_BUCKET_PATH

3. Link your billing account to your project

Next step is to set up the billing account for this project. Google Cloud Creates a project for you by default which is called “My First Project”. Use your Project ID (from step 1) to run the following commands. This will show you your Billing Account_ID, make a note of this for the next step.

!gcloud beta billing accounts list

Use your Billing Account_ID from above and run the following to link your billing account with your project.

Note if you use an existing project you may not see an Account_ID, this means you do not have the proper permissions to run the following commands, contact your admin or create a new project.

BILLING_ACCOUNT_ID = '01F938-DE847D-A19F05'
# GCP_PROJECT_ID = "mnist-10272021"
!gcloud beta billing projects link $GCP_PROJECT_ID --billing-account $BILLING_ACCOUNT_ID

OR Billing account ID : 01F938-DE847D-A19F05

4. Enable Required APIs for tensorflow-cloud in your project¶

For tensorflow_cloud we use two specific APIs: AI Platform Training Jobs API and Cloud builder API. Note that this is a one time setup for this project, you do not need to rerun this command for every notebook.

# GCP_PROJECT_ID = "mnist-10272021"
!gcloud services --project $GCP_PROJECT_ID enable ml.googleapis.com cloudbuild.googleapis.com

click enable

click enable==>

5. Create a service account

This step is required to use HP Tuning on Google Cloud using CloudTuner. To create a service account and give it project editor access run the following command and make a note of your service account name.

# GCP_PROJECT_ID = "mnist-10272021"
# Service account name must be between 6 and 30 characters (inclusive), 
# must begin with a lowercase letter, and consist of lowercase alphanumeric 
# characters that can be separated by hyphens.
SERVICE_ACCOUNT_NAME ='mnist-10272021-sa'

SERVICE_ACCOUNT_EMAIL = f'{SERVICE_ACCOUNT_NAME}@{GCP_PROJECT_ID}.iam.gserviceaccount.com'

!gcloud iam --project $GCP_PROJECT_ID service-accounts create $SERVICE_ACCOUNT_NAME
!gcloud projects add-iam-policy-binding $GCP_PROJECT_ID \
    --member serviceAccount:$SERVICE_ACCOUNT_EMAIL \
    --role=roles/editor

The default AI Platform service account is identified by an email address with the format service-PROJECT_NUMBER@cloud-ml.google.com.iam.gserviceaccount.com. Using your Project number from step one, we construct the service account email and grant the default AI Platform service account admin role (roles/iam.serviceAccountAdmin) on your new service account.

# GCP_PROJECT_ID = "mnist-10272021"
PROJECT_NUMBER = "97885218772"
DEFAULT_AI_PLATFORM_SERVICE_ACCOUNT = f'service-{PROJECT_NUMBER}@cloud-ml.google.com.iam.gserviceaccount.com'

!gcloud iam --project $GCP_PROJECT_ID service-accounts add-iam-policy-binding \
--role=roles/iam.serviceAccountAdmin \
--member=serviceAccount:$DEFAULT_AI_PLATFORM_SERVICE_ACCOUNT \
$SERVICE_ACCOUNT_EMAIL

OR in the navigation menu, go to IAM & admin → Service accounts, ==>CREATE SERVICE ACCOUNT
https://blog.youkuaiyun.com/Linli522362242/article/details/119626524

You are now ready to run tensorflow-cloud. Note that these steps only need to be run one time. Once you have your project setup you can reuse the same project and bucket configuration for future runs. For any new notebooks you will need to repeat the step two to add your Google Cloud auth credentials.

Make a note of the following values as they are needed to run tensorflow-cloud.

print(f"Your GCP_PROJECT_ID is:       {GCP_PROJECT_ID}")
print(f"Your SERVICE_ACCOUNT_NAME is: {SERVICE_ACCOUNT_NAME}")
print(f"Your BUCKET_NAME is:          {GCS_BUCKET}")

GCP_PROJECT_ID: mnist-10272021

GCS_BUCKET: mnist_10272021_bucket

JOB_NAME: mnist

The JOB_NAME is optional, and you can set it to any string. If you are doing multiple training experiemnts (for example) as part of a larger project, you may want to give each of them a unique JOB_NAME.

6. Import required modules

This guide requires TensorFlow Cloud, which you can install via:

!pip install tensorflow_cloud

import os
import sys
import tensorflow as tf
import tensorflow_cloud as tfc

7. Project Configurations

# Set Google Cloud Specific parameters

# set GCP_PROJECT_ID to your own Google Cloud project ID.
GCP_PROJECT_ID = "mnist-10272021"

# set GCS_BUCKET to your own Google Cloud Storage (GCS) bucket.
GCS_BUCKET = "mnist_10272021_bucket"

# DO NOT CHANGE: Currently only the 'us-central1' region is supported.
REGION = "us-central1"

# OPTIONAL: You can change the job name to any string.
JOB_NAME = "mnist"

# Setting location were training logs and checkpoints will be stored
                # gs://mnist_10272021_bucket/mnist
GCS_BASE_PATH = f"gs://{GCS_BUCKET}/{JOB_NAME}"
TENSORBOARD_LOGS_DIR = os.path.join(GCS_BASE_PATH, "logs")        # gs://mnist_10272021_bucket/mnist/logs
MODEL_CHECKPOINT_DIR = os.path.join(GCS_BASE_PATH, "checkpoints") # gs://mnist_10272021_bucket/mnist/checkpoints
SAVED_MODEL_DIR = os.path.join(GCS_BASE_PATH, "saved_model")      # gs://mnist_10272021_bucket/mnist/saved_model

8. Authenticating the notebook to use your Google Cloud Project

Note: For Kaggle Notebooks click on "Add-ons"->"Google Cloud SDK" before running the cell below.

# Using tfc.remote() to ensure this code only runs in notebook

if not tfc.remote():
    # Authentication for Colab Notebooks
    if "google.colab" in sys.modules:
        print('google.colab')
        from google.colab import auth

        auth.authenticate_user()
        os.environ['GOOGLE_CLOUD_PROJECT'] = GCP_PROJECT_ID

    # Authentication for Kaggle Notebooks
    if "kaggle_secrets" in sys.modules:
        from kaggle_secrets import UserSecretsClient

        UserSecretsClient().set_gcloud_credentials(project=GCP_PROJECT_ID)

==>

9. Model and data setup

From here we are following the basic procedure for setting up a simple Keras model to run classification on the MNIST dataset.

9.1Load and split data

Read raw data and split to train and test data sets.

(x_train, y_train), (x_test,y_test) = tf.keras.datasets.mnist.load_data()

9.2 Create a model and prepare for training

Create a simple model and set up a few callbacks for it.

from tensorflow.keras import layers
from tensorflow import keras

model = keras.Sequential(
    [
        keras.Input(shape=(28, 28)),
        # Use a Rescaling layer to make sure input values are in the [0, 1] range.
        layers.experimental.preprocessing.Rescaling(1.0 / 255),
        # The original images have shape (28, 28), so we reshape them to (28, 28, 1)
        layers.Reshape(target_shape=(28, 28, 1)),
        # Follow-up with a classic small convnet
        layers.Conv2D(32, 3, activation="relu"),
        layers.MaxPooling2D(2),
        layers.Conv2D(32, 3, activation="relu"),
        layers.MaxPooling2D(2),
        layers.Conv2D(32, 3, activation="relu"),
        layers.Flatten(),
        layers.Dense(128, activation="relu"),
        layers.Dense(10),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
)

Quick validation training

We'll train the model for one (1) epoch just to make sure everything is set up correctly, and we'll wrap that training command in `if not tfc.remote`, so that it only happens here in the runtime environment in which you are reading this, not when it is sent to Google Cloud.

if not tfc.remote():
    # Run the training for 1 epoch and a small subset of the data to validate setup
    model.fit( x=x_train[:100], y=y_train[:100], validation_split=0.2, epochs=1 )

9.3 Prepare for remote training

The code below will only run when the notebook code is sent to Google Cloud, not inside the runtime in which you are reading this.

First, we set up callbacks which will:

Create logs for TensorBoard.
Create checkpoints and save them to the checkpoints directory specified above.
Stop model training if loss is not improving sufficiently.

Then we call model.fit and model.save, which (when this code is running on Google Cloud) which actually run the full training (100 epochs) and then save the trained model in the GCS Bucket and directory defined above.

if tfc.remote():
    # Configure Tensorboard logs
    callbacks = [
                  # gs://mnist_10272021_bucket/mnist/logs
                  tf.keras.callbacks.TensorBoard( log_dir = TENSORBOARD_LOGS_DIR ),
                  # gs://mnist_10272021_bucket/mnist/checkpoints
                  tf.keras.callbacks.ModelCheckpoint( MODEL_CHECKPOINT_DIR, save_best_only=True ),
                  # patience: Number of epochs with no improvement after--which training will be stopped.
                  tf.keras.callbacks.EarlyStopping( monitor="val_loss", min_delta=0.001, patience=3 ), 
                ]
    model.fit( x=x_train, y=y_train, 
               epochs=100, validation_split=0.2, 
               callbacks=callbacks,
               batch_size=100
             )
    # Let's save the model in GCS after the training is complete.
    model.save( SAVED_MODEL_DIR )#gs://mnist_10272021_bucket/mnist/saved_model

Start the remote training

TensorFlow Cloud takes all the code from its local execution environment (this notebook), wraps it up, and sends it to Google Cloud for execution. (That's why the `if` and `if not tfc.remote` wrappers are important.)

This step will prepare your code from this notebook for remote execution and then start a remote training job on Google Cloud Platform to train the model.

First we add the `tensorflow-cloud` Python package to a `requirements.txt` file, which will be sent along with the code in this notebook. You can add other packages here as needed.
Then a GPU and a CPU image are specified. You only need to specify one or the other; the GPU is used in the code that follows.
Finally, the heart of TensorFlow cloud: the call to `tfc.run`. When this is executed inside this notebook, all the code from this notebook, and the rest of the files in this directory, will be packaged and sent to Google Cloud for execution. The parameters on the `run` method specify specify the details of the execution environment and the distribution strategy (if any) to be used.

# If you are using a custom image you can install modules via requirements txt file.
with open("requirements.txt", "w") as f:
    f.write( "tensorflow-cloud\n" )
  
# Optional: Some recommended base images. 
# If you provide none the system will choose one for you.
TF_GPU_IMAGE = "gcr.io/deeplearning-platform-release/tf2-cpu.2-5"
TF_CPU_IMAGE = "gcr.io/deeplearning-platform-release/tf2-gpu.2-5"

# Submit a single node training job using GPU.
tfc.run(  distribution_strategy="auto",
          requirements_txt = "requirements.txt",
          # We can also use this storage bucket for Docker image building, 
          # instead of your local Docker instance. 
          # For this, just add your bucket to the docker_image_bucket_name parameter.
          docker_config = tfc.DockerConfig(
             parent_image = TF_GPU_IMAGE, 
             image_build_bucket = GCS_BUCKET, # GCS_BUCKET = "mnist_10272021_bucket"
         ),
         chief_config = tfc.COMMON_MACHINE_CONFIGS['K80_1X'],
         job_labels = {'job':JOB_NAME}
)

Once the job is submitted you can go to the next step to monitor the jobs progress via Tensorboard.

TENSORBOARD_LOGS_DIR

==>

Training Results

Reconnect your Colab instance

Most remote training jobs are long running. If you are using Colab, it may time out before the training results are available.

In that case, **rerun the following sections in order** to reconnect and configure your Colab instance to access the training results.

6. Import required modules
7. Project Configurations
8. Authenticating the notebook to use your Google Cloud Project

**DO NOT** rerun the rest of the code.

### Load Tensorboard

While the training is in progress you can use Tensorboard to view the results. Note the results will show only after your training has started. This may take a few minutes.

Load Tensorboard

While the training is in progress you can use Tensorboard to view the results. Note the results will show only after your training has started. This may take a few minutes.

# Commented out IPython magic to ensure Python compatibility. !!!!!!
# %load_ext tensorboard
%reload_ext tensorboard
%tensorboard --logdir $TENSORBOARD_LOGS_DIR

Load your trained model

Once training is complete, you can retrieve your model from the GCS Bucket you specified above.

trained_model = tf.keras.models.load_model( SAVED_MODEL_DIR )
trained_model.summary()

X_new = x_test[:3]
Y_pred = trained_model.predict(X_new)

import numpy as np
np.argmax( Y_pred, axis=-1 )

Disabling projects linked to your billing account

您可能感兴趣的与本文相关的镜像

TensorFlow-v2.15

TensorFlow

TensorFlow 是由Google Brain 团队开发的开源机器学习框架,广泛应用于深度学习研究和生产环境。它提供了一个灵活的平台,用于构建和训练各种机器学习模型