引入
十一假期结束,回学校上的第一节课是lwd老师的py高级编程,很难不开心(如果一定要选择的话,这大概就是最好的选择了吧),虽然但是还是很困,差点点点点就迟到了,上课吃的早餐对不起刘老师orz,但是呢居然没睡觉,happy ending~
ppt上面的标题是:结构化数据分类
我个人感觉的话就是利用全连通数据网络sigmoid来诊断心脏病,前提就是利用数据处理,模型构建还有训练和测试,over
heart数据集介绍
首先这是一个csv文件,类似于表格一样的,一共303个数据。每个数据对应13个特征,包括6个数值特征和7个类型特征。讲这个主要是为了后面的数据预处理,数值特征的话要进行标准化处理,类型特征的话要进行one-hot映射。
下边给出13个特征的详细介绍还有数据集的全貌嗯。
考虑到有很多人不喜欢看英语,我粗略的翻译一下这些特征分别是:
age: 年龄
sex: 性别 (1男, 0女)
cp: 经历过的胸痛类型(1典型心绞痛,2非典型性心绞痛,3非心绞痛,4无症状)
trestbps: 静息血压(入院时的毫米汞柱)
chol: 胆固醇测量值,单位 :mg/dl
fbs: 人的空腹血糖(> 120 mg/dl,1真;0假)
restecg: 静息心电图测量(0正常,1患有ST-T波异常,2根据Estes的标准显示可能或确定的左心室肥大)
thalach: 最大心率
exang: 运动引起的心绞痛(1有过;0没有)
oldpeak: ST抑制,由运动引起的相对于休息引起的(“ ST”与ECG图上的位置有关。这块比较专业,我也不明白)
slope: 最高运动ST段的斜率(1上坡,2平坦,3下坡)
ca: 萤光显色的主要血管数目(0-4)
thal: 一种称为地中海贫血的血液疾病(normal正常;fixed固定缺陷;reversible可逆缺陷)
target: 心脏病(0否,1是)
最右边的那一列讲的就是特征的类型,numerical数值特征,categorical类型特征
然后csv文件的话打开来大概这样子👇
数据预处理和创建模型
首先是问题描述y=f(x),输出是判断是否患心脏病这个结果0/1
由此可以得到这是一个二分类问题,直接用全连通sigmoid,根据老师的思路loss的话可以直接选择"binary_crossentropy",optimizer的话先尝试用一下adam(因为它快),metrics出于简单考虑,选择"accuracy"
然后下一步就是读取数据集并且进行数据的预处理,同时划分数据集和测试集嗯
老师给的百分比是8:2,这边觉得很合理就不改啦
读数据的话用的是pandas,之前学过的全部忘光光了呢,差不多就是这样那样就读完了。
主要在意的是dataframe里面的sample方法,和名字一样就是采样的意思,主要就是随机采样,主要用到的参数就是frac,表示比例,顺带的另一个drop方法,就是删除,直接从原来的数据集里面排除掉采样的数据,就拿到训练集和测试集了呢,划分完之后开始转换数据格式嗯~
三种类型,分别是数值特征、类型特征和字符类型的特征。
数值特征的话一共有6个,用到的是tensorflow里面封装好的normalization层(可以理解为专门给数值特征进行标准化的一个层,主要就是借助均值/方差的这个形式嗯)
类型特征的话一共有7个,但有个是以字符串的形式表达的,所以后面我们给他单独拎出来处理一下,所以这边也算是6个,用到的也是tensorflow里面自带的CategoryEncoding层,使用方法同上,建议配合着代码一起看(往下翻)
最后就是字符类型的特征,用到的是StringLookup这个层,主要利用的是map和adapt俩个方法,具体的代码也会附在后面
全部处理完之后就要进行拼接了,这边构造层结构的原理感觉有点像嵌套函数的意思,y=f(x)=f5(f4(f3(f2(f1(x))))),在网上找了个图,感觉这个看起来还是很贴切的嗯
把上面处理完的层全部放在一起嗯
all_inputs = [
sex,
cp,
fbs,
restecg,
exang,
ca,
thal,
age,
trestbps,
chol,
thalach,
oldpeak,
slope,
]
……
all_features = layers.concatenate(
[
sex_encoded,
cp_encoded,
fbs_encoded,
restecg_encoded,
exang_encoded,
slope_encoded,
ca_encoded,
thal_encoded,
age_encoded,
trestbps_encoded,
chol_encoded,
thalach_encoded,
oldpeak_encoded,
]
)
然后拼接完了就按之前分析完了的开始构建模型
x = layers.Dense(32, activation="relu")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
直观地看下构建模型的整个过程:
训练模型并利用模型进行预测
训练模型的话就是fit函数,batch的话在前面数据处理的时候就做了,经过不完全测试,效果会比放在训练函数里做好一点,具体数值0.03嗯
model.fit(train_ds, epochs=50, validation_data=val_ds, verbose=2)
预测的话首先给出的数据参数如下👇
sample = {
"age": 60,
"sex": 1,
"cp": 1,
"trestbps": 145,
"chol": 233,
"fbs": 1,
"restecg": 2,
"thalach": 150,
"exang": 0,
"oldpeak": 2.3,
"slope": 3,
"ca": 0,
"thal": "fixed",
}
然后通过model.predict()模型预测结果并print()输出就好啦
上述操作py代码(全)
"""
Title: Structured data classification from scratch
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2020/06/09
Last modified: 2020/06/09
Description: Binary classification of structured data including numerical and categorical features.
"""
"""
## Introduction
This example demonstrates how to do structured data classification, starting from a raw
CSV file. Our data includes both numerical and categorical features. We will use Keras
preprocessing layers to normalize the numerical features and vectorize the categorical
ones.
Note that this example should be run with TensorFlow 2.3 or higher, or `tf-nightly`.
### The dataset
[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
Cleveland Clinic Foundation for Heart Disease.
It's a CSV file with 303 rows. Each row contains information about a patient (a
**sample**), and each column describes an attribute of the patient (a **feature**). We
use the features to predict whether a patient has a heart disease (**binary
classification**).
Here's the description of each feature:
Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target
"""
"""
## Setup
"""
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.layers.experimental.preprocessing import CategoryEncoding
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
"""
## Preparing the data
Let's download the data and load it into a Pandas dataframe:
"""
file_url = "heart.csv"
dataframe = pd.read_csv(file_url)
"""
The dataset includes 303 samples with 14 columns per sample (13 features, plus the target
label):
"""
dataframe.shape
"""
Here's a preview of a few samples:
"""
dataframe.head()
"""
The last column, "target", indicates whether the patient has a heart disease (1) or not
(0).
Let's split the data into a training and validation set:
"""
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)
print(
"Using %d samples for training and %d for validation"
% (len(train_dataframe), len(val_dataframe))
)
"""
Let's generate `tf.data.Dataset` objects for each dataframe:
"""
def dataframe_to_dataset(dataframe):
dataframe = dataframe.copy()
labels = dataframe.pop("target")
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
ds = ds.shuffle(buffer_size=len(dataframe))
return ds
train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)
"""
Each `Dataset` yields a tuple `(input, target)` where `input` is a dictionary of features
and `target` is the value `0` or `1`:
"""
for x, y in train_ds.take(1):
print("Input:", x)
print("Target:", y)
"""
Let's batch the datasets:
"""
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)
"""
## Feature preprocessing with Keras layers
The following features are categorical features encoded as integers:
- `sex`
- `cp`
- `fbs`
- `restecg`
- `exang`
- `ca`
We will encode these features using **one-hot encoding** using the `CategoryEncoding()`
layer.
We also have a categorical feature encoded as a string: `thal`. We will first create an
index of all possible features using the `StringLookup()` layer, then we will one-hot
encode the output indices using a `CategoryEncoding()` layer.
Finally, the following feature are continuous numerical features:
- `age`
- `trestbps`
- `chol`
- `thalach`
- `oldpeak`
- `slope`
For each of these features, we will use a `Normalization()` layer to make sure the mean
of each feature is 0 and its standard deviation is 1.
Below, we define 3 utility functions to do the operations:
- `encode_numerical_feature` to apply featurewise normalization to numerical features.
- `encode_string_categorical_feature` to first turn string inputs into integer indices,
then one-hot encode these integer indices.
- `encode_integer_categorical_feature` to one-hot encode integer categorical features.
"""
def encode_numerical_feature(feature, name, dataset):
# Create a Normalization layer for our feature
normalizer = Normalization()
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the statistics of the data
normalizer.adapt(feature_ds)
# Normalize the input feature
encoded_feature = normalizer(feature)
return encoded_feature
def encode_string_categorical_feature(feature, name, dataset):
# Create a StringLookup layer which will turn strings into integer indices
index = StringLookup()
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the set of possible string values and assign them a fixed integer index
index.adapt(feature_ds)
# Turn the string input into integer indices
encoded_feature = index(feature)
# Create a CategoryEncoding for our integer indices
encoder = CategoryEncoding(output_mode="binary")
# Prepare a dataset of indices
feature_ds = feature_ds.map(index)
# Learn the space of possible indices
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices
encoded_feature = encoder(encoded_feature)
return encoded_feature
def encode_integer_categorical_feature(feature, name, dataset):
# Create a CategoryEncoding for our integer indices
encoder = CategoryEncoding(output_mode="binary")
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the space of possible indices
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices
encoded_feature = encoder(feature)
return encoded_feature
"""
## Build a model
With this done, we can create our end-to-end model:
"""
# Categorical features encoded as integers
sex = keras.Input(shape=(1,), name="sex", dtype="int64")
cp = keras.Input(shape=(1,), name="cp", dtype="int64")
fbs = keras.Input(shape=(1,), name="fbs", dtype="int64")
restecg = keras.Input(shape=(1,), name="restecg", dtype="int64")
exang = keras.Input(shape=(1,), name="exang", dtype="int64")
ca = keras.Input(shape=(1,), name="ca", dtype="int64")
# Categorical feature encoded as string
thal = keras.Input(shape=(1,), name="thal", dtype="string")
# Numerical features
age = keras.Input(shape=(1,), name="age")
trestbps = keras.Input(shape=(1,), name="trestbps")
chol = keras.Input(shape=(1,), name="chol")
thalach = keras.Input(shape=(1,), name="thalach")
oldpeak = keras.Input(shape=(1,), name="oldpeak")
slope = keras.Input(shape=(1,), name="slope")
all_inputs = [
sex,
cp,
fbs,
restecg,
exang,
ca,
thal,
age,
trestbps,
chol,
thalach,
oldpeak,
slope,
]
# Integer categorical features
sex_encoded = encode_integer_categorical_feature(sex, "sex", train_ds)
cp_encoded = encode_integer_categorical_feature(cp, "cp", train_ds)
fbs_encoded = encode_integer_categorical_feature(fbs, "fbs", train_ds)
restecg_encoded = encode_integer_categorical_feature(restecg, "restecg", train_ds)
exang_encoded = encode_integer_categorical_feature(exang, "exang", train_ds)
ca_encoded = encode_integer_categorical_feature(ca, "ca", train_ds)
# String categorical features
thal_encoded = encode_string_categorical_feature(thal, "thal", train_ds)
# Numerical features
age_encoded = encode_numerical_feature(age, "age", train_ds)
trestbps_encoded = encode_numerical_feature(trestbps, "trestbps", train_ds)
chol_encoded = encode_numerical_feature(chol, "chol", train_ds)
thalach_encoded = encode_numerical_feature(thalach, "thalach", train_ds)
oldpeak_encoded = encode_numerical_feature(oldpeak, "oldpeak", train_ds)
slope_encoded = encode_numerical_feature(slope, "slope", train_ds)
all_features = layers.concatenate(
[
sex_encoded,
cp_encoded,
fbs_encoded,
restecg_encoded,
exang_encoded,
slope_encoded,
ca_encoded,
thal_encoded,
age_encoded,
trestbps_encoded,
chol_encoded,
thalach_encoded,
oldpeak_encoded,
]
)
x = layers.Dense(32, activation="relu")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
"""
Let's visualize our connectivity graph:
"""
# `rankdir='LR'` is to make the graph horizontal.
#keras.utils.plot_model(model, show_shapes=True, rankdir="LR")
"""
## Train the model
"""
model.fit(train_ds, epochs=50, validation_data=val_ds, verbose=2)
"""
We quickly get to 80% validation accuracy.
"""
"""
## Inference on new data
To get a prediction for a new sample, you can simply call `model.predict()`. There are
just two things you need to do:
1. wrap scalars into a list so as to have a batch dimension (models only process batches
of data, not single samples)
2. Call `convert_to_tensor` on each feature
"""
sample = {
"age": 60,
"sex": 1,
"cp": 1,
"trestbps": 145,
"chol": 233,
"fbs": 1,
"restecg": 2,
"thalach": 150,
"exang": 0,
"oldpeak": 2.3,
"slope": 3,
"ca": 0,
"thal": "fixed",
}
input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)
print(
"This particular patient had a %.1f percent probability "
"of having a heart disease, as evaluated by our model." % (100 * predictions[0][0],)
)
注意事项和帮助指南
需要使用2.3.0以上的tensorflow,不然导包步骤就会变红,主要会找不到CategoryEncoding,可惜的是我的pycharm里面只能下载2.1.0的version,无奈之下只能去Ancona装啦(没截图只能空讲)
总而言之,先去pycharm的setting里面看一下,有的话直接装最好,没有就打开Anaconda Prompt,,输入pip install tensorflow==2.3.0
命令,会自动帮你下完顺便把老的版本删除掉嗯,然后重启pycharm(我是重启了一下,说不定不用也行嗯)就okay啦
还有csv文件目录要和py代码同级,不然你改下路径也是ok的嗯
运行结果截图👇
各位下周见~挥挥