[上分指南] 2020华为云大数据挑战赛热身赛如何轻松快速提高10分?baseline简单解读与优化思路分享第一弹
你感受过长期35.6483的绝望吗?
如果你回答是,那么请阅读本文!!
写在前面:大家好!我是练习时长半年的在读本科生数据小白JerryX,各位数据挖掘大佬多多指教!!欢迎大家多多点赞,多多评论,多多批评指正!!
下面,我们一边研究下baseline,一边看看如何脱离35.6483的苦海。
import moxing as mox
mox.file.shift('os', 'mox')
import os
import re
import json
import pandas as pd
from pandas import to_datetime
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from collections import OrderedDict
首先是常规操作,导入一些必要的库。
# 获取竞赛数据集:将“obs-mybucket-bj4/myfolder”改成您的OBS桶名及文件夹
import moxing as mox
mox.file.copy_parallel('s3://obs-bdc2020-bj4/traffic_flow_dataset', 's3://obs-mybucket-bj4/traffic_flow_dataset')
print('Copy procedure is completed !')
我们再从华为云的OBS获取本次比赛的数据集traffic_flow_dataset并复制到本地路径下。
OBS_DATA_PATH = "s3://obs-mybucket-bj4/traffic_flow_dataset"
LOCAL_DATA_PATH = './dataset/train'
OBS_MODEL_DIR = "s3://obs-mybucket-bj4/modelfiles/model"
OBS_MODEL_PATH = OBS_MODEL_DIR + "/modelfile.m"
OBS_CONFIG_PATH = OBS_MODEL_DIR + "/config.json"
LOCAL_MODEL_PATH = './modelfile.m'
LOCAL_CONFIG_PATH = './config.json'
接下来,我们宏定义一些路径地址,包括后续调用数据集,保存模型等的路径。
# read data of one day and one direction
def read_file(path, filename):
calfile = os.path.join(path, filename)
original = pd.read_csv(calfile, header