如果数据中出现显著的周期性,比如 月,日 (年倒是例外吧)或者 角度之类的
应该处理成循环的数据类型. (也不一定总work)
- 利用
between_time之类的function - 借助sin/cos 函数,变成wave 形状的函数
Result
丢弃date这个变量,RMSE 65.75
加入month, day 变量,RMSE 降到49.38
如果将month, day 变成周期性变量, convert_period RMSE 反而上升了 54.02
代码如下
def convert_period(time):
return np.sin((time-min(time))/(max(time)-min(time)+1) * 2 * np.pi)
def dataIO(datafile):
input_file = datafile
# comma delimited is the default
df = pd.read_csv(input_file, header = 0)
# put the original column names in a python list
original_headers = list(df.columns.values)
day = df['day'].str.split('/')
year = np.array([int(i[0]) for i in day])
month = np.array([int(i[1]) for i in day])
day = np.array([int(i[2]) for i in day])
## if you consider periodical feature
month = convert_period(month)
day = convert_period(day)
## remove the non-numeric columns
df = df._get_numeric_data()
data = df.as_matrix()
day = np.vstack((month, day)).T
data = np.hstack((day,data))
# put the numeric column names in a python list
numeric_headers = list(df.columns.values)
# create a numpy array with the numeric values for input into scikit-learn
return original_headers, data
5万+

被折叠的 条评论
为什么被折叠?



