TensorFlow特征列与Keras预处理层在数据处理中的应用-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_43283527/article/details/124903792

1. tf.feature_column处理特征

1.1 应用场景：

tf.feature_column是针对结构化数据（Structure Data）使用的，比如在csv中存储的数据。FC 的角色是结构化数据和模型之间的桥梁（feature columns as a bridge to map from columns in a CSV to features used to train the model），feature_column 上接pandas的DataFrame，下接DenseFeatures Layer.

详细介绍

对加权分类数据求和:

在某些情况下，您需要处理分类数据，其中类别的每次出现都附带关联的权重。在特征列中，这由 tf.feature_column.weighted_categorical_column 处理。与 indicator_column 配对时，效果是对每个类别的权重求和。

Using weighted inputs in “count” mode

layer = tf.keras.layers.CategoryEncoding(
          num_tokens=4, output_mode="count")
count_weights = np.array([[.1, .2], [.1, .1], [.2, .3], [.4, .2]])
layer([[0, 1], [0, 0], [1, 2], [3, 1]], count_weights=count_weights)

1.2 使用总结

对于不同类型的数据tf.feature_column提供不同的函数处理（numeric、categorical）,如果是文本数据则需要单独的文本embedding，不在FC的覆盖范围内。
每个FC都是针对单独一列的，一般来说一个 Datafame 包含很多列，所以FC 一般都放在一个list中（feature_columns）
Model 中第一个layer 是 tf.keras.layers.DenseFeatures(feature_columns)。

1.3 实践案例

从Kaggle泰坦尼克号项目页面下载数据

from datetime import datetime
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models


# 打印日志
def printLog(info):
    now_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("\n" + "==========" * 8 + "%s" % now_time)
    print(info + '...\n\n')


printLog("step1: prepare dataset...")

train_raw = pd.read_csv('train.csv')
test_raw = pd.read_csv('test.csv')

data = pd.concat([train_raw, test_raw])


# 处理缺失值
def prepare_df_data(df_raw):
    df = df_raw.copy()
    df.columns = [x.lower() for x in df.columns]
    df = df.rename(columns={
   
   'survived': 'label'})
    df = df.drop(['passengerid', 'name'], axis=1)  # 删除无用的列
    for col, dtype in dict(df.dtypes).items():
        # 判断是否包含缺失值
        if df[col].hasnans:
            # 添加标识是否缺失列
            df[col + '_nan'] = pd.isna(df[col]).astype('int32')
            # 填充
            if dtype not in [np.object, np.str, np.unicode]:
                df[col].fillna(df[col].mean(), inplace=True)
            else:
                df[col].fillna('', inplace=True)
    return df


pre_data = prepare_df_data(data)
train = pre_data.iloc[0:len(train_raw), :]
test = pre_data.iloc[len(train_raw):, :]


# 使用tf.data加载数据
def df_to_dataset(df, shuffle=True, batch_size=32):
    df_data = df.copy()
    if 'label' not in df_data.columns:  # 预测集合
        ds = tf.data.Dataset.from_tensor_slices(df_data.to_dict(orient='list'))
    else:
        labels = df_data.pop('label').values
        ds = tf.data.Dataset.from_tensor_slices((df_data.to_dict(orient='list'), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(df_data))
    ds = ds.batch(batch_size)
    return ds


ds_train = df_to_dataset(train)
ds_test = df_to_dataset(test)

# 使用tf.feature_column定义特征列
printLog("step2: make feature columns...")

feature_columns = []
feature_inputs = {
   
   }  # 函数式API
# 数值列
for col in ['age', 'fare', 'parch', 'sibsp'] + [c for c in pre_data.columns if c.endswith('_nan')]:
    feature_columns.append(tf.feature_column.numeric_column(col))

    feature_inputs[col] = layers.Input(shape=(1,),