数据清洗中,一般是先检测异常点,删除之后把全部缺失值一起插值。异常点检测方法有很多,其中效果较好的是均值标准差,四分位(箱图)和DBSCAN聚类。在Python中,都有现成的工具可以使用,下边把Python代码整理在一起方便使用。
数据使用的大望村水厂的逐日供水数据(下图展示部分数据),只需要把需要识别异常点的数据整理成第一列索引(label=time),第二列值(column=Value),就可以使用下列代码。
均值标准差
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_excel(r'D:/originalData/大望村.xlsx', engine='openpyxl', index_col='time')
# 从中取出100个数,先做一个直方图看看分布状态
# df = df[df['Value'] > 0].head(100)
df = df[df['Value'] > 0]
sns.distplot(df['Value'])
plt.show()
dataMean = df['Value'].mean()
dataStd = df['Value'].std()
lowerL = (dataMean - 3 * dataStd) if (dataMean - 3 * dataStd) > 0 else 0
upperL = dataMean + 3 * dataStd
abDot = df[(df['Value'] > upperL) | (df['Value'] < lowerL)]
四分位(箱图)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_excel(r'D:/originalData/大望村.xlsx', engine='openpyxl', index_col='time')
df = df[df['Value'] > 0]
plt.figure(figsize=(7, 6))
outlier = df.boxplot(return_type='dict')
y = outlier['fliers'][0].get_ydata()
asDotList = pd.DataFrame()
for i in y:
asDotList = pd.concat((asDotList, df[df['Value'] == i]))
DBSCAN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
df = pd.read_excel(r'D:/originalData/大望村.xlsx', engine='openpyxl', index_col