手把手构建生产级深度学习流水线：TFX实战

原创于 2025-12-19 14:08:44 发布 · 424 阅读

CC 4.0 BY-SA版权

文章标签：

#深度学习 #人工智能 #机器学习 #机器学习运维 #TensorFlow #程序那些事 #AIGC

Tensorflow Extended (TFX) in action: build a production ready deep learning pipeline

在本教程中，我们将探索TensorFlow Extended (TFX)。TFX由某机构开发，是一个用于部署生产环境机器学习流水线的端到端平台。我们将看到如何从头开始构建一个流水线。我们将探索可使用的各种内置组件，这些组件涵盖了机器学习的整个生命周期，从研究和开发到训练和部署。

首先，让我们从一些基本概念和术语开始，以确保理解一致。

强烈推荐某机构云团队的“某机构云ML Pipelines课程”或DeepLearning.ai的“TensorFlow高级部署场景”课程，通过全面的课程来提升技能。

TFX术语表

组件是流水线的基本构建块，负责执行所有工作。组件可以直接使用，也可以用自定义代码覆盖。
元数据存储是所有组件的单一事实来源。它主要包含3部分内容：
- 工件及其属性：例如训练好的模型、数据、评估指标。
- 组件和流水线的执行记录。
- 工作流相关的元数据（组件顺序、输入、输出等）。
TFX流水线是机器学习工作流的可移植实现，由组件实例和输入参数组成。
编排器是执行TFX流水线的系统。它们本质上是用于编写、调度和监控工作流的平台。它们通常将流水线表示为一个有向无环图，并确保每个作业（或工作节点）在正确的时间以正确的输入执行。

与TFX配合使用的流行编排器示例包括Apache Airflow、Apache Beam、Kubeflow pipelines。

基于机器学习生命周期的不同阶段，TFX提供了一组具有标准功能的不同组件。这些组件可以被覆盖。例如，可以扩展其功能，也可以完全用新组件替换。不过，在大多数情况下，内置组件已能满足大部分需求。

让我们快速浏览所有组件，从数据加载开始到部署结束。请注意，不会深入探讨代码细节，因为有很多大多数人不熟悉的新库和包。

关键点在于概述TFX及其模块，并帮助理解为什么需要这种端到端的解决方案。

数据摄取

机器学习开发过程的第一步是数据加载。ExampleGen组件通过将不同类型的数据转换为tf.Record或tf.Example（两者都受TFX支持）来将数据摄取到TFX流水线中。示例代码如下：

from tfx.proto import example_gen_pb2
from tfx.components import ImportExampleGen

input_config = example_gen_pb2.Input(splits=[
     example_gen_pb2.Input.Split(name='train', pattern='train/*'),
     example_gen_pb2.Input.Split(name='eval', pattern='test/*')
 ])

example_gen = ImportExampleGen(
     input_base=data_root, input_config=input_config)

ImportExampleGen是ExampleGen的一种特殊类型，它接收数据路径以及如何处理数据的配置。在本例中，将数据分为训练集和测试集。

数据验证

下一步是探索数据，将其可视化，并验证是否存在可能的错误和异常。

StatisticsGen组件生成一组有用的统计信息，描述数据分布。可以看到，它接收ExampleGen的输出。

from tfx.components import StatisticsGen

statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])

Tensorflow Data Validation是一个内置的TFX库，除其他功能外，可以帮助可视化StatisticsGen生成的统计信息。它由StatisticsGen在内部使用，但也可以作为独立工具使用。

import tensorflow_data_validation as tfdv

tfdv.visualize_statistics(stats)

同一个库也被SchemaGen使用，后者为数据生成一个初始模式。当然，这可以根据领域知识进行调整，但它是一个不错的起点。

from tfx.components import SchemaGen

schema_gen = SchemaGen(
 statistics=statistics_gen.outputs['statistics'],
 infer_feature_shape=True)

现在可以利用生成的模式和统计信息来执行某种形式的数据验证，以捕获数据集中的异常值、异常和错误。

from tfx.components import ExampleValidator

example_validator = ExampleValidator(
      statistics=statistics_gen.outputs['statistics'],
      schema=schema_gen.outputs['schema'])

特征工程

任何机器学习流水线中最重要的一步是特征工程。基本上，需要对数据进行预处理，以便将其传递给模型。TFX提供了Transform组件和tensorflow_transform库来协助完成此任务。转换步骤可以这样执行：

from tfx.components import Transform

 transform = Transform(
      examples=example_gen.outputs['examples'],
      schema=schema_gen.outputs['schema'],
      module_file=module_file)

但这还不是全部。

需要以某种方式定义预处理功能。这就是参数module_file的作用。最常用的方法是使用一个单独的文件来包含所有的转换操作。本质上，需要实现一个preprocessing_fn函数，这是TFX的入口点。

以下是从官方TFX示例中借鉴的一个示例：

def preprocessing_fn(inputs):
  """tf.transform's callback function for preprocessing inputs."""
  outputs = {}

   image_features = tf.map_fn(
      lambda x: tf.io.decode_png(x[0], channels=3),
      inputs[_IMAGE_KEY],
      dtype=tf.uint8)
  image_features = tf.cast(image_features, tf.float32)
  image_features = tf.image.resize(image_features, [224, 224])
  image_features = tf.keras.applications.mobilenet.preprocess_input(
      image_features)

  outputs[_transformed_name(_IMAGE_KEY)] = image_features
  outputs[_transformed_name(_LABEL_KEY)] = inputs[_LABEL_KEY]

  return outputs

如你所见，这是普通的Tensorflow和Keras代码。

模型训练

训练模型是过程中的关键部分，与许多人的看法相反，它并不是一次性操作。

模型需要不断重新训练，以保持其相关性并确保结果的最佳准确性。

from tfx.dsl.components.base import executor_spec
from tfx.proto import trainer_pb2
from tfx.components.trainer.executor import GenericExecutor
from tfx.components import Trainer

 trainer = Trainer(
      module_file=module_file,
      custom_executor_spec=executor_spec.ExecutorClassSpec(GenericExecutor),
      examples=transform.outputs['transformed_examples'],
      transform_graph=transform.outputs['transform_graph'],
      schema=schema_gen.outputs['schema'],
      train_args=trainer_pb2.TrainArgs(num_steps=160),
      eval_args=trainer_pb2.EvalArgs(num_steps=4),
      custom_config={'labels_path': labels_path})

和之前一样，训练逻辑在一个单独的模块文件中。这次需要实现run_fn函数，该函数通常定义模型和训练循环。同样借鉴自官方示例并去除了一些不必要的内容，示例如下：

import tensorflow_transform as tft

def run_fn(fn_args: FnArgs):

  tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)

  train_dataset = _input_fn(
      fn_args.train_files,
      tf_transform_output,
      is_train=True,
      batch_size=_TRAIN_BATCH_SIZE)
  eval_dataset = _input_fn(
      fn_args.eval_files,
      tf_transform_output,
      is_train=False,
      batch_size=_EVAL_BATCH_SIZE)

  model, base_model = _build_keras_model()

  model.compile(
      loss='sparse_categorical_crossentropy',
      optimizer=tf.keras.optimizers.RMSprop(lr=_FINETUNE_LEARNING_RATE),
      metrics=['sparse_categorical_accuracy'])
  model.summary(print_fn=absl.logging.info)

  model.fit(
      train_dataset,
      epochs=_CLASSIFIER_EPOCHS,
      steps_per_epoch=steps_per_epoch,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps,
      callbacks=[tensorboard_callback])

请注意，_build_keras_model返回一个标准的tf.keras.Sequential模型，而input_fn返回一个包含训练样本和标签的批次数据集。

请查看官方Git仓库获取完整代码。同时请确保，通过正确的回调函数，可以利用TensorBoard来可视化训练进度。

模型验证

接下来是模型验证。一旦训练好模型，在将其推送到生产环境之前，必须对其进行评估并分析其性能。TensorFlow Model Analysis (TFMA)就是用于此目的的库。请注意，实际的模型评估在训练期间已经进行。

此步骤旨在记录未来运行的评估指标，并将其与之前的模型进行比较。

通过这种方式，可以确保当前的模型是目前最好的模型。

不会深入探讨TFMA的细节，但以下是一些供未来参考的代码：

import tensorflow_model_analysis as tfma

 eval_config = tfma.EvalConfig(
      model_specs=[tfma.ModelSpec(label_key='label_xf',
      model_type='tf_lite')],
      slicing_specs=[tfma.SlicingSpec()],
      metrics_specs=[
          tfma.MetricsSpec(metrics=[
              tfma.MetricConfig(
                  class_name='SparseCategoricalAccuracy',
                  threshold=tfma.MetricThreshold(
                      value_threshold=tfma.GenericValueThreshold(
                          lower_bound={'value': 0.55}),
                      # 如果没有从MLMD解析出基线模型（首次运行），变更阈值将被忽略。
                      change_threshold=tfma.GenericChangeThreshold(
                          direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                          absolute={'value': -1e-3})))
          ])
      ])

关键部分是在流水线中定义Evaluator组件：

from tfx.components import Evaluator

evaluator = Evaluator(
      examples=transform.outputs['transformed_examples'],
      model=trainer.outputs['model'],
      baseline_model=model_resolver.outputs['model'],
      eval_config=eval_config)

推送模型

一旦模型验证成功，就是时候将模型推送到生产环境了。这是Pusher组件的工作，它根据环境处理所有的部署事务。

from tfx.components import Pusher

 pusher = Pusher(
      model=trainer.outputs['model'],
      model_blessing=evaluator.outputs['blessing'],
      push_destination=pusher_pb2.PushDestination(
          filesystem=pusher_pb2.PushDestination.Filesystem(
              base_directory=serving_model_dir)))

构建TFX流水线

好了，目前定义了许多组件，包含了所需的一切。但是如何将它们组合在一起呢？TFX流水线使用pipeline类定义，该类接收组件列表等参数。

from tfx.orchestration import metadata
from tfx.orchestration import pipeline

  components = [
      example_gen, statistics_gen, schema_gen, example_validator,
      transform,
      trainer, model_resolver, evaluator, pusher
  ]

  pipeline = pipeline.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      components=components,
      enable_cache=True)

组件实例产生工件作为输出，并且通常依赖于上游组件实例产生的工件作为输入。组件的执行顺序由基于每个工件依赖关系的有向无环图 (DAG) 决定。以下是一个典型的TFX流水线：

来源：某机构云平台文档

运行TFX流水线

最后，来到了运行流水线的部分。如前所述，流水线由编排器执行，它将处理所有的作业调度和网络连接。这里选择使用Apache Beam（通过BeamDagRunner），但对于Kubeflow或Airflow，原理相同。

from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

if __name__ == '__main__':
    BeamDagRunner().run( pipeline)

另外，应该提到的是，可以使用TFX CLI通过命令行执行类似的命令。

可以肯定的是，像Apache Beam这样的编排器在99%的使用场景中将在云资源上运行。
这意味着Beam将启动云实例/工作节点，并通过它们流式传输数据。这将取决于环境和流水线。

Apache Beam下的典型运行器包括Spark、Flink、某机构Dataflow。另一方面，像Kubeflow这样的框架依赖于Kubernetes。因此，MLOps工程师的一项重要工作就是为他们的需求找到最佳环境。

结论

端到端机器学习系统在过去几年获得了大量关注。随着许多不同的初创公司和框架的诞生，MLOps正变得越来越重要。TFX就是一个完美的例子。必须承认，构建这样的流水线并非易事，需要深入探究TFX的复杂性。但这认为它是目前拥有的最好工具之一。所以，下次当你想要部署机器学习模型时，也许值得一试。

另外，再次推荐某机构云团队的“某机构云ML Pipelines课程”或DeepLearning.ai的“TensorFlow高级部署场景”课程。
更多精彩内容请关注我的个人公众号公众号（办公AI智能小助手）或者我的个人博客 https://blog.qife122.com/
对网络安全、黑客技术感兴趣的朋友可以关注我的安全公众号（网络安全技术点滴分享）