tensorflow之 feature_column + pre-made estimator组合实战

最近在学习tensorflow v2,以前接触过v1版本,写起来很费劲,v2比v1容易上手。我看网上比较推荐的是feature_coumn + estimator这套"组合拳",借此我们可以快速搭建现有成熟模型做实验,当然也可以用户生产环境。以下是演示代码,如有不妥之处敬请执政,不胜感激。

import numpy as np
import pandas as pd
from sklearn.utils import shuffle
from matplotlib import pyplot as plt

整理数据集

数据选自criteo数据集,随机挑选了60w条,train:test = 8:2。深度模型对数据质量要求比较高,在训练之前,需要对数据做很多预处理工作,包括填充缺失值、过滤低频特征、一些简单的变换等。

# 一些常用的函数
# 对连续特征做log变换
def log_fn(x, epson=0.000001):
    assert(x>=0.0)
    if x == 0.0:
        return np.log(epson)
    else:
        return np.log(x)
    
def column_mean_and_std_fn(df, col):
    not_nan = df[col].notna()
    tmp_df = df[col][not_nan]
    return  tmp_df.mean(), tmp_df.std()

# 填充确实特征
def column_fillna(df, col, value):
    tmp_df = df[col].fillna(value)
    return tmp_df 
    
# input_fn,这函数为模型train提供训练数据
# estimator.train方法接受的训练数据有两种形式:一种是dataset类型的数据,另一种是tensor类的数据
# tensor类型的数据,各种特征的变换需要自己实现,很麻烦。
def input_fn(features, labels, training=True, batch_size=256, num_epochs=5):
    ds = tf.data.Dataset.from_tensor_slices((dict(features), labels))
    # 如果在训练模式下混淆并重复数据。
    if training:
        ds = ds.shuffle(10000)
    ds = ds.batch(batch_size).repeat(num_epochs) 
    return ds

读取数据,并做预处理

df = pd.read_csv("./criteo_sampled_data.csv")

df = df[0:600000]
df = shuffle(df)

columns = df.columns
numeric_columns = columns[1:14]
categorical_columns = columns[14:]

numeric_columns = ['I1', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11', 'I12', 'I13']
categorical_columns = ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8','C9', 'C10', 'C11', 
                       'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18','C19', 'C20', 'C21',
                       'C22', 'C23', 'C24', 'C25', 'C26', 'I2']

强烈推荐输入数据的格式使用dataFrame格式,这个和feature_column的配合是天作之合。
最好对特征做一下分类,主要分两类,连续性的(numeric)和离散型的(categorical)。其中’I2’是Int类型的,由于criteo数据特征是匿名的,我把他放到离散类别里了,并且’I2’特征没有NULL值。

#填充缺失值
# for numeric
for col in numeric_columns:
    mean, std = column_mean_and_std_fn(df, col)
    a = column_fillna(df, col, mean)
    #print(mean)
    df[col] = a
    df[col] = df[col].apply(lambda x:log_fn(x))

# for cate
for col in categorical_columns:
    value = ("%s_nan"%(col))
    df[col] = df[col].fillna(value)

注意:对连续特征缺失值得填充是列的均值,这个需要注意,不要填0或者其他数值,因为特征缺失和特征取0是两个完全不同的概念。我们最好用均值来填充。也可以通过离线做实验来选取这个值。

接下来我我们需要定义特征怎么使用。DNN可以直接使用连续特征做输入,不用任何转换。离散类型的特征,有两种方式:
1.one hot 编码
2.embedding 编码
one hot编码会导致网络的输入层很大,目前基本所有的NN模型都采用embedding的编码方式。不管如何,我们都需要一个特征字典,来做索引。即使使用embedding,底层实现还是先one hot编码,然后查表得到embedding 向量。

# 构造catrgorical特征对应的字典信息

I2 = df['I2'] #没有nan值
I2_info = df.groupby(['I2']).size().sort_values(ascending=False)
select_features = I2_info[I2_info>=150]
I2_feature_list = select_features.index.tolist()

categorical_columns_dict = {}
omit_columns = ['I2', 'C3', 'C9', 'C12', 'C16', 'C21']
for col in categorical_columns[0:]:
    if col in omit_columns:
        continue
    tmp = df.groupby([col]).size().sort_values(ascending=False).cumsum() / 600000
    feature_list = tmp[tmp<=0.95]
    categorical_columns_dict[col] = feature_list.index.tolist()
    
categorical_columns_dict['I2'] = I2_feature_list
# print(categorical_columns_dict)

这里有两点需要说明:
1.必须要对低频特征做过滤,这个需要根据特征取值的分布,人工判断一下。我统计数据后发现,很多特征成为分布很长尾 ,只需要选取能覆盖95%的特征做字典就可以了。
2.由于现实条件限制,我们忽略了一些特征,比如omit_columns列表里的特征。这些特征在样本中的分布比较散,如果引入到模型中的话,特征字典很大,导致模型很大,爆内存,训练慢,就给去掉了。这些特征很可能是用户ID类等特征,应该很有用。

构造feature_column

feature_column是tensorflow2强烈推荐高层级API接口,我们只要定义对特征做哪些变化就可以了,不需要自己再去实现变化的主要逻辑了。比如我们需要对某一列特征做embedding,我们主要调用tf.feature_column.embedding_column方法,然后把需要嵌入的列名和嵌入的特征字典传递进去,会返回给我们一个DenseColumn对象,可以直接作为NN模型的输入。

# feature columns
import tensorflow as tf
from tensorflow.keras import layers,models
from sklearn.utils import shuffle

# 嵌入的维度
embedding_dim = 16

# 离线特征列,同上,没啥区别,可能丢弃了一些列
categorical_columns_new = [col for col in categorical_columns if col not in omit_columns]
categorical_columns_new.append('I2')

tf_feature_columns = []

# for numeric
for col in numeric_columns:
    numeric_col = tf.feature_column.numeric_column(col)
    tf_feature_columns.append(numeric_col)

对连续特征,我们直接送给模型,当然可以以分桶后再送给模型。其中tf.feature_column.numeric_column方法接受一个列名(string),该方法声明:把训练数据中的列特征为’col’的特征作为一个连续值处理,直接输入到模型里。返回一个NumericColumn对象。
注意:tf.estimator只接受为数不多的 xColumns 直接作为数据对象,NumericColumn对象是其中一个,后续还会提到。

# 对离散特征做embedding后喂给模型
for col in categorical_columns_new:
    cate_col = tf.feature_column.embedding_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            col
            , categorical_columns_dict[col]
        )
        , embedding_dim)
    tf_feature_columns.append(cate_col)

离散特征可能输数值类离散,比如年龄18,19,25等,也可能是字符串类离散,比如性别,man,women等,所以,离散特征必须经过转化才能输入到模型里。一般离散特征有两种使用方法:one hot 或者 embedding。

  1. 如果某个离散特征的取值不多,可以使用one hot来编码,这个时候需要使用tf.feature_column.indicator_column来实现。
  2. 如果某个离散特征取值很多,我们只能使用embedding 了,需要借助tf.feature_column.embedding_column 来实现。

以上两个方法的实现,都依赖于tf.feature_column.categorical_column_with_vocabulary_list。首先需要categorical_column_with_vocabulary_list方法,改方法接受一个列名’col’,告诉模型这一列需要当成是categorical特征;还一个参数,可以是列表或者文件,里面存放的该列可能的取值,就是one hot或者embedding编码的时候,tf内部需要构造这个特征对应的字典,这时候就用到第二个参数了。比如:

colors = categorical_column_with_vocabulary_list(key='colors', vocabulary_list=('X', 'R', 'G', 'B', 'Y'))
这个语句告诉我们,colors这一列是离散特征,取值有三种G , B, G 。注意,colors是一个CategoricalColumn对象,不能直接作为模型输入。必须经过indicator_column或者embedding_column转化一下才行。以embedding_column为例,接上面的colors例子
colors_emb=tf.feature_column.embedding_column(colors, dimension=16)
colors_emb就是colors这个特征做embedding后的变换,维度是16,该方法返回一个DenseColumn,可以直接作为estimator的输入。
# 交叉特征
import copy

dnn_tf_feature_columns = copy.deepcopy(tf_feature_columns)

linear_tf_feature_columns = copy.deepcopy(tf_feature_columns)

crossed_features  =  tf.feature_column.crossed_column(['C2','C23'], 6*188)
linear_tf_feature_columns.append(crossed_features)
crossed_features = tf.feature_column.crossed_column(['C18', 'C23'], 6*864)
linear_tf_feature_columns.append(crossed_features)

print(len(dnn_tf_feature_columns))
print(len(linear_tf_feature_columns))
34
36

tf.feature_column还提供可供特征交叉的方法,比如上面代码中。方法tf.feature_column.crossed_column接受两个参数:一个列表,里面存放需要交叉的特征的列名,另一个参数指定交叉后特征取值数目。这个由于模型无法提前预知交叉后新特征的取值,所以目前是对新特征做hash,这个参数就是hash分桶的数目。该方法返回一个CrossedColumn,可以直接最为模型的输入。由于我采用的是wide_and_deep模型,低阶交叉只需要喂给linear部分即可,所以么没有做embedding。

至此,我们特征的变换定义完成。以上定义了我们的特征需要做什么样的变化,以什么的形式输入到模型里。接下来定义模型结构。

模型

tensorflow2提供了estimator模型,该模块预定了很多成熟的模型,我们不需要自己实现,只要调用一个方法就行。这里我们使用大名鼎鼎的wide_and_deep模型,这个模型在estimator里叫做DNNLinearCombinedClassifier

from tensorflow.estimator import DNNLinearCombinedClassifier,RunConfig

# run time config
run_config = RunConfig(save_summary_steps=100
                       , save_checkpoints_steps=100000
                       , keep_checkpoint_max=2)

# define model
estimator_wd_v2 = DNNLinearCombinedClassifier(linear_feature_columns=linear_tf_feature_columns
                                             , dnn_feature_columns=dnn_tf_feature_columns
                                             , n_classes=2
                                             , dnn_hidden_units=[256,128, 32]
                                             , config=run_config
                                              ,dnn_optimizer='Adam')

print(type(estimator_wd_v2))
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\admin\\AppData\\Local\\Temp\\tmp6b9waqp_', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 100000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 2, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
<class 'tensorflow_estimator.python.estimator.canned.dnn_linear_combined.DNNLinearCombinedClassifierV2'>

可以看到,我们初始化了一个DNNLinearCombinedClassifier类,返回对应的实例其中:
1.linear_feature_columns:linear部分需要的特征
2.dnn_feature_columns:deep部分需要的特征
3.dnn_hidden_units:deep部分的结构
以上几个是比基础的参数,其他参数也有很多,比如优化算法(这个影响很大),dropout,BN等,详细说明参看官网。

接下来,我们调用该实例的train方法来训练模型:

split_point = int(600000*0.8)
df_train = df[0:split_point]
df_test = df[split_point:]

# train
train_labels = df_train['label'].to_numpy()
train_features = df_train.drop(['label'], axis=1)

# test
test_labels = df_test['label'].to_numpy()
test_features = df_test.drop(['label'], axis=1)

# 注意,调用train方法的时候,我们需要把训练数据送给模型
# input_fn参数接受一个函数作为输入,我们需要在这个函数里把数据喂给模型。
# 这里就用到了我们刚开始定义的input_fn函数了,该函数返回一个tf.dataset实例,可以作为estimator的输入
# 比较奇葩的方式!
# https://tensorflow.google.cn/versions/r2.0/api_docs/python/tf/estimator/DNNLinearCombinedClassifier#train
estimator_wd_v2.train(input_fn=lambda: input_fn(train_features, train_labels, training=True, num_epochs=1), steps=None)
WARNING:tensorflow:From C:\Users\admin\anaconda3\envs\tf2\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From C:\Users\admin\anaconda3\envs\tf2\lib\site-packages\tensorflow\python\training\training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Layer dnn is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:From C:\Users\admin\anaconda3\envs\tf2\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:540: Layer.add_variable (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From C:\Users\admin\anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\optimizer_v2\ftrl.py:144: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_\model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.58141315, step = 0
INFO:tensorflow:global_step/sec: 10.2417
INFO:tensorflow:loss = 0.5014093, step = 100 (9.733 sec)
INFO:tensorflow:global_step/sec: 41.5206
INFO:tensorflow:loss = 0.4694633, step = 200 (2.393 sec)
INFO:tensorflow:global_step/sec: 43.769
INFO:tensorflow:loss = 0.49514118, step = 300 (2.285 sec)
INFO:tensorflow:global_step/sec: 44.6175
INFO:tensorflow:loss = 0.48476872, step = 400 (2.241 sec)
INFO:tensorflow:global_step/sec: 44.7999
INFO:tensorflow:loss = 0.5150919, step = 500 (2.232 sec)
INFO:tensorflow:global_step/sec: 42.872
INFO:tensorflow:loss = 0.53324294, step = 600 (2.333 sec)
INFO:tensorflow:global_step/sec: 44.8966
INFO:tensorflow:loss = 0.493083, step = 700 (2.243 sec)
INFO:tensorflow:global_step/sec: 44.0126
INFO:tensorflow:loss = 0.5207912, step = 800 (2.256 sec)
INFO:tensorflow:global_step/sec: 43.5266
INFO:tensorflow:loss = 0.4375774, step = 900 (2.297 sec)
INFO:tensorflow:global_step/sec: 44.1225
INFO:tensorflow:loss = 0.48617166, step = 1000 (2.266 sec)
INFO:tensorflow:global_step/sec: 43.3785
INFO:tensorflow:loss = 0.4590522, step = 1100 (2.305 sec)
INFO:tensorflow:global_step/sec: 43.3431
INFO:tensorflow:loss = 0.43658033, step = 1200 (2.307 sec)
INFO:tensorflow:global_step/sec: 43.2788
INFO:tensorflow:loss = 0.44661155, step = 1300 (2.326 sec)
INFO:tensorflow:global_step/sec: 43.5881
INFO:tensorflow:loss = 0.4669834, step = 1400 (2.300 sec)
INFO:tensorflow:global_step/sec: 39.5629
INFO:tensorflow:loss = 0.47193682, step = 1500 (2.522 sec)
INFO:tensorflow:global_step/sec: 40.9582
INFO:tensorflow:loss = 0.4268831, step = 1600 (2.426 sec)
INFO:tensorflow:global_step/sec: 41.8295
INFO:tensorflow:loss = 0.44979426, step = 1700 (2.393 sec)
INFO:tensorflow:global_step/sec: 41.221
INFO:tensorflow:loss = 0.43585172, step = 1800 (2.424 sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1875...
INFO:tensorflow:Saving checkpoints for 1875 into C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_\model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1875...
INFO:tensorflow:Loss for final step: 0.46456712.





<tensorflow_estimator.python.estimator.canned.dnn_linear_combined.DNNLinearCombinedClassifierV2 at 0x1b256a81748>

train过程中每100个step打印一次loss信息,我们看到 loss从最开始的0.7降到最后的0.45。如果增加epoch到40,在训练集集上的loss是可以降低到0.06的水平的,这证明模型的拟合能力是没有问题的,我们之前的特征准备步骤也没有问题,和我们预想的一样。
注意:

  1. 这里有个参数num_epochs,estimator的train方法里没有epoch参数,只能在input_fn里面定义,就是我们产出的训练集重复复几次就是几个epoch.接下来,可以看看模型在整个train集合和test集合上的表现,我们只需要调用训练好的实例的evaluate函数即可。
  2. DNNLinearCombinedClassifier默认的优化算法是AdaGrad,这个算法有瑕疵,我使用默认优化算法train的时候,loss降低到0.48后基本保持不变,增大epoch也无济于事。后来改为adam算法,loss 才有明显的降低,原因是AdaGrad的学习速率保留饿了太多的历史记忆,导致训练越往后学习速率越小,优化速度越慢。
train_result_wd_v2 = estimator_wd_v2.evaluate(input_fn=lambda: input_fn(train_features, train_labels, training=False) )
print(train_result_wd_v2)
eval_result_wd_v2 = estimator_wd_v2.evaluate(input_fn=lambda: input_fn(test_features, test_labels, training=False) )
print(eval_result_wd_v2)
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Layer dnn is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-05-25T00:06:00Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_\model.ckpt-1875
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 270.25759s
INFO:tensorflow:Finished evaluation at 2020-05-25-00:10:31
INFO:tensorflow:Saving dict for global step 1875: accuracy = 0.7863625, accuracy_baseline = 0.7429625, auc = 0.7944995, auc_precision_recall = 0.58948576, average_loss = 0.45612475, global_step = 1875, label/mean = 0.2570375, loss = 0.45612475, precision = 0.64911103, prediction/mean = 0.25902134, recall = 0.36751285
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1875: C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_\model.ckpt-1875
{'accuracy': 0.7863625, 'accuracy_baseline': 0.7429625, 'auc': 0.7944995, 'auc_precision_recall': 0.58948576, 'average_loss': 0.45612475, 'label/mean': 0.2570375, 'loss': 0.45612475, 'precision': 0.64911103, 'prediction/mean': 0.25902134, 'recall': 0.36751285, 'global_step': 1875}
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Layer dnn is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-05-25T00:11:13Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_\model.ckpt-1875
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 58.74664s
INFO:tensorflow:Finished evaluation at 2020-05-25-00:12:12
INFO:tensorflow:Saving dict for global step 1875: accuracy = 0.78076667, accuracy_baseline = 0.7463, auc = 0.7726389, auc_precision_recall = 0.5565996, average_loss = 0.4708935, global_step = 1875, label/mean = 0.2537, loss = 0.47086808, precision = 0.61942714, prediction/mean = 0.25820503, recall = 0.352319
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1875: C:\Users\admin\AppData\Local\Temp\tmp6b9waqp_\model.ckpt-1875
{'accuracy': 0.78076667, 'accuracy_baseline': 0.7463, 'auc': 0.7726389, 'auc_precision_recall': 0.5565996, 'average_loss': 0.4708935, 'label/mean': 0.2537, 'loss': 0.47086808, 'precision': 0.61942714, 'prediction/mean': 0.25820503, 'recall': 0.352319, 'global_step': 1875}

训练好的模型,在整个trian集上loss:0.45612475, auc:0.7944995;在test集上的loss:0.47086808, auc:0.7726389。导致两者的可能原因:

  1. 练数据规模太小造成,即使随机拆分,train和test的分布也存在差异
  2. 模型在训练数据集上发生了过拟合
    针对第二个原因,我们可以通过加dropout等正则项来验证,这里不再演示。

以上是feature_column + estimator的组合使用的例子,很简单,上手也比较容易。如果estimator中的预定义模型没有自己想要的,feature_column也可以和其他API配合,来构造自己的网络结构。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值