kaggle竞赛金牌策略复现：rossmann-store-sales的第三名策略——Entity Embedding

博客围绕Kaggle竞赛中离散特征处理展开。先介绍其他模型如Catboost、LightGBM处理离散特征的方式及问题，接着重点阐述基于深度学习Embedding层的Entity Embedding方法。还讲述了数据预处理步骤及建模过程，最后给出建模结果并分析效果差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

关于这个数据集以及竞赛的详情，可以看我的上一篇：kaggle竞赛数据集：rossmann-store-sales_thorn_r的博客-优快云博客

零、其他模型离散特征的处理方式

在这个竞赛中，有着众多的离散型的特征难以处理，以最为典型店铺号为例，数据集中包含了1115家店铺类型的店铺历史信息数据。对于建模是一个十分重要的信息。但是,对于这样一个有着1115个类别的特征而言，不能使用传统的热编码或者哑变量来进行预处理：因为这样不仅会使得特征具有严重的稀疏性，而且1115个哑变量/热编码也会造成维度灾难，会严重影响模型的效果。而如果将1115个商店分别拎出来进行建模，那么不仅每家单店的数据量过少，仅983条，且没有同类型店铺的相关先验信息，并造成了很多特征不再可用。所以对于这样的多分类离散特征算是这个竞赛的痛点之一，而在之前的很多机器学习模型中，也有类似对于这类多类型离散特征变量的处理。

1、Catboost:CatBoost创新性地使用了将原本的离散特征连续化的算法：Ordered Target Statistics。具体而言，对于离散特征的每个分类，都统计其每个分类下因变量的频率（分类任务）或者均值（回归任务）,以此来将其作为连续特征，这就是Target Statistics(还没有Ordered)。然而，这种方法随之而来就产生了一个问题：将所有数据的因变量都作为一个特征来用，本身是一种数据泄露（data leakage），会产生预测偏移从而导致模型出现过拟合的可能，因此CatBoost在原本的Target Statistics基础上，现将样本随机打乱，再将打乱后每个样本排序之前的样本做Target Statistics，通过这样的方式就能在一定程度上缓解预测偏移量（具体原理似乎是减少每次梯度计算的误差，这部分我没有读过相关资料），这就是Ordered Target Statistics。

2、LightGBM：LightGBM本身使用的是各个特征“分桶”作为分裂选择的判断条件，具体到分类特征而言，就是将每个类别对应的因变量均值（或者频率）作为桶，以此和连续变量一样做“分桶”操作。当然，这样做和上面CatBoost的思路很像，也具有Target Statistic的数据泄露问题，因此LightGBM还加入了很多的正则化约束。

可以看出，对于各类的GBDT类模型而言，处理各种离散特征的方式就是拿因变量的均值或频率对原本的分类做Mapping，在此基础上使用各种方式缓解原本的数据泄露问题。

而本篇将要复现的对于离散特征的处理策略，则是基于深度学习中的Embedding层来对分类特征做处理。

一、Entity Embedding处理离散特征

在NLP处理的过程中，经常会使用Word Embedding的方法来对文本数据作预处理。从文本数据的角度而言，如果给每个词都做一个one-hot编码，那么最终的特征量将会异常庞大，发生维度灾难。所以显然的，需要一种能够将特征维度变小的预处理方法，于是就有了Word Embedding。Word Embedding相比于One-hot那种“每个特征都使用0-1向量来表示，特征有多少种分类0-1向量就有多少长度”而言，它的向量变成了我们预先给出的固定值，与特征的分类数量无关，向量也不再是0-1向量，而是一个稠密向量，这样每个分类的向量不同，同时又保证了不会发生维度灾难。对于NLP来说，还需要考虑词的先后文信息，所以还有CBOW，Skip-gram一类的word2vec预处理。当然，对于我们单纯的多分类离散特征连续化而言，只需要Embedding层就可以了。

Kaggle竞赛中使用这个策略的团队还写了篇论文到Arxiv上，有兴趣可以去阅读以下。

http://arxiv.org/abs/1604.06737

本篇文章将会复现这个第三名的策略。

github链接：https://github.com/entron/entity-embedding-rossmann 竞赛的代码可以看对应的kaggle branch。

二、数据预处理

github上的数据预处理有五个步骤：

python3 extract.py
python3 extract_weather.py
python3 extract_google_trend.py
python3 extract_fb_features.py
python3 prepare_features.py

第一个extract.py没什么好说的，现在直接用pandas就可以了。

# extract.py
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_store = pd.read_csv("store.csv")
df_store_states = pd.read_csv("store_states.csv")
df_store = df_store.merge(df_store_states,on="Store",how="left")

第二和第三个weather和Google trend是参赛给出数据集以外的external数据集，简单mapping一下也可以了：

# extract_weather.py
weather_path = "weather/"

d = {}
d['BadenWuerttemberg'] = 'BW'
d['Bayern'] = 'BY'
d['Berlin'] = 'BE'
d['Brandenburg'] = 'BB'  # do not exist in store_state
d['Bremen'] = 'HB'  # we use Niedersachsen instead of Bremen
d['Hamburg'] = 'HH'
d['Hessen'] = 'HE'
d['MecklenburgVorpommern'] = 'MV'  # do not exist in store_state
d['Niedersachsen'] = 'HB,NI'  # we use Niedersachsen instead of Bremen
d['NordrheinWestfalen'] = 'NW'
d['RheinlandPfalz'] = 'RP'
d['Saarland'] = 'SL'
d['Sachsen'] = 'SN'
d['SachsenAnhalt'] = 'ST'
d['SchleswigHolstein'] = 'SH'
d['Thueringen'] = 'TH'

def event2int(event):
    event_list = ['', 'Fog-Rain', 'Fog-Snow', 'Fog-Thunderstorm',
                  'Rain-Snow-Hail-Thunderstorm', 'Rain-Snow', 'Rain-Snow-Hail',
                  'Fog-Rain-Hail', 'Fog', 'Fog-Rain-Hail-Thunderstorm', 'Fog-Snow-Hail',
                  'Rain-Hail', 'Rain-Hail-Thunderstorm', 'Fog-Rain-Snow', 'Rain-Thunderstorm',
                  'Fog-Rain-Snow-Hail', 'Rain', 'Thunderstorm', 'Snow-Hail',
                  'Rain-Snow-Thunderstorm', 'Snow', 'Fog-Rain-Thunderstorm']
    return event_list.index(event)

df_weather = pd.DataFrame(columns=["State","Date","max_temp","mean_temp","min_temp",
                                  "max_humi","mean_humi","min_humi","max_wind","mean_wind","CloudCover","Events"])

for i in os.listdir(weather_path):
    df_weather_tmp = pd.read_csv(weather_path+i,delimiter=";")
    df_weather_tmp["State"] = d[i.replace(".csv","")]
    df_weather_tmp = df_weather_tmp.loc[:,['State','Date', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
                                          'Max_Humidity','Mean_Humidity', 'Min_Humidity','Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h',"CloudCover","Events"]]
    df_weather_tmp.columns = df_weather.columns
    df_weather = pd.concat([df_weather,df_weather_tmp],axis=0,ignore_index=True)

df_weather["CloudCover"] = df_weather["CloudCover"].replace(np.nan,0)
df_weather["Events"] = df_weather["Events"].replace(np.nan,"")
df_weather["Events"] = [event2int(i) for i in df_weather["Events"]]

# extract_google_trend.py
from datetime import datetime
trend_path = ("googletrend/")
df_google_trend = pd.DataFrame(columns=["State","year","week_of_year","trend"])
for i in os.listdir(trend_path):
    state = i.replace(".csv","").split("_")[-1]
    if state == 'NI':
        state = 'HB,NI'
    df_google_tmp = pd.read_csv(trend_path+i)
    df_google_tmp.rename(columns={df_google_tmp.columns[-1]:"trend"},inplace=True)
    df_google_tmp["Date"] = [datetime.strptime(df_google_tmp["Woche"][i].split(" - ")[-1],'%Y-%m-%d') for i in df_google_tmp.index]
    df_google_tmp["year"] = df_google_tmp["Date"].dt.year
    df_google_tmp["week_of_year"] = df_google_tmp["Date"].dt.isocalendar()["week"]
    df_google_tmp["State"] = state
    df_google_tmp = df_google_tmp.loc[:,["State","year","week_of_year","trend"]]
    df_google_trend = pd.concat([df_google_trend,df_google_tmp],axis=0)

而第四个extract_fb_features.py，具体就是做一个滑动窗口，在窗口内计算每一天离它最近的事件（State Holiday/School Hodliay等）或者事件计数。

我们先看看它的原始代码：

# extract_fb_features.py

class smart_timestamp_accessor(object):
    # 根据时间求出index

    def __init__(self, timestamps):
        self.timestamps = timestamps
        self._last_index = 0

    def compute_index_of_timestamp(self, timestamp):
        #首末尾抹去超出范围的部分
        if timestamp < self.timestamps[0]:
            return 0
        if timestamp > self.timestamps[len(self.timestamps) - 1]:
            return (len(self.timestamps) - 1)
        
        #此处会更新对象的_last_index属性，此时运行的窗口的终点变为下一次运行的窗口起点，以此达到“滑动窗口”的效果
        if timestamp == self.timestamps[self._last_index]:
            return self._last_index

        # 此处存在一个bug: 某些店铺会有从2014年7月1号~12月31号没有数据的情形,以7日的前向（forward）滑动窗口为例：
        # 此时，6月24日会进入timestamp > self.timestamps[self._last_index]分支，将2015年1月1日也放入7日的滑动窗口中
        # 但是，25号号却会进入timestamp < self.timestamps[self._last_index]分支；将滑动窗口截到6月30及之前未满7日的长度
        # 26号又变成正向的滑动窗口（大于分支），27号又变成反向的滑动窗口（<）
        if timestamp > self.timestamps[self._last_index]:
            while timestamp > self.timestamps[self._last_index]:
                self._last_index += 1
            return self._last_index
        
        
        if timestamp < self.timestamps[self._last_index]:
            print(timestamp,self.timestamps[self._last_index])
            while timestamp < self.timestamps[self._last_index]:
                self._last_index -= 1
            return self._last_index

    def get_start_end_timestamp(self):
        timestamp_start = self.timestamps[0]
        nr_timestamps = len(self.timestamps)
        timestamp_end = self.timestamps[nr_timestamps - 1]
        return (timestamp_start, timestamp_end)




def generate_forward_backward_information_accumulated(data, store_id, window_size=14, only_zero=False):
    dat_store = data[data["Store"] == store_id]
    columns = ["Promo", "StateHoliday", "SchoolHoliday"]
    # columns = ["StateHoliday"]

    column_dat = np.array(dat_store[columns], dtype="str")
    (nr_obs, nr_cols) = column_dat.shape
    generated_features = []
    generated_features.append([timestamp for timestamp in dat_store["Date"][::-1]])#因为原数据中时间逆序排序，此处调整为顺序排序，后同
    generated_features.append([store_id for i in range(nr_obs)])

    timestamps = np.array(dat_store["Date"], dtype="datetime64[D]")
    column_dat = column_dat[::-1]
    timestamps = timestamps[::-1]

    generated_feature_names = []
    generated_feature_names.append("Date")
    generated_feature_names.append("Store")

    for i_col, column in enumerate(columns):
        unique_elements = set(el for el in np.array(dat_store[column], dtype="str"))
        if only_zero:
            unique_elements = set("0")

        for unique_element in unique_elements:
            first_forward_looking = []
            last_backward_looking = []
            count_forward_looking = []
            count_backward_looking = []
            forward_looking_timestamps = smart_timestamp_accessor(timestamps)
            backward_looking_timestamps = smart_timestamp_accessor(timestamps)
            for i_obs in range(nr_obs):
                timestamp = timestamps[i_obs]
                timestamp_forward = timestamp + np.timedelta64(window_size, "D")
                timestamp_backward = timestamp + np.timedelta64(-window_size, "D")
                index_forward = forward_looking_timestamps.compute_index_of_timestamp(timestamp_forward)
                index_backward = backward_looking_timestamps.compute_index_of_timestamp(timestamp_backward)

                # 前n天不满足窗口长度的那一部分count(StateHoliday计数)记为0，满足时求出那部分的发生事件的总数（此处即!= unique_element的部分），下同
                # 表示在窗口长度内（下写作14天），离最近一次的promo等事件还有（或者说过去了）几天，最小为1，最大为15 注意代码里的+1。
                # 而StateCount则是未来/过去14天中包含多少事件
                # 均不包含当天的事件
                if i_obs == nr_obs - 1:
                    first_forward_looking.append(window_size + 1)
                    count_forward_looking.append(0)
                else:
                    forward_looking_data = column_dat[(i_obs + 1):(index_forward + 1), i_col]
                    forward_looking_data = forward_looking_data != unique_element
                    nr_occurences = np.sum(forward_looking_data)
                    if nr_occurences == 0:
                        first_forward_looking.append(window_size + 1)
                    else:
                        first_forward_looking.append(np.argmax(forward_looking_data) + 1)
                    count_forward_looking.append(nr_occurences)

                if i_obs == 0:
                    last_backward_looking.append(window_size + 1)
                    count_backward_looking.append(0)
                else:
                    backward_looking_data = column_dat[index_backward:i_obs, i_col]
                    backward_looking_data = backward_looking_data != unique_element
                    nr_occurences = np.sum(backward_looking_data)
                    if nr_occurences == 0:
                        last_backward_looking.append(window_size + 1)
                    else:
                        last_backward_looking.append(np.argmax(backward_looking_data[::-1]) + 1)
                    count_backward_looking.append(np.sum(backward_looking_data))

            generated_features.append(first_forward_looking)
            generated_features.append(last_backward_looking)
            if column == "StateHoliday":
                generated_features.append(count_forward_looking)
                generated_features.append(count_backward_looking)

            generated_feature_names.append(column + "_first_forward_looking")
            generated_feature_names.append(column + "_last_backward_looking")
            if column == "StateHoliday":
                generated_feature_names.append(column + "_count_forward_looking")
                generated_feature_names.append(column + "_count_backward_looking")

    return (np.array(generated_features).T, generated_feature_names)

在原始代码中，建立了一个smart_timestamp_accessor类，以此来建立滑动窗口，来计算前n天或者后n天的时间节点，也就是compute_index_of_timestamp函数。但是，原代码中有个小bug：在原始数据中，某些店铺会有从2014年7月1号~12月31号没有数据的情形。但是在smart_timestamp_accessor类中，使用的是self._last_index来保存时间点信息，通过将self._last_index这一变量的加减来实现“滑动窗口”的。以7日滑动窗口为例，6月24日会进入timestamp > self.timestamps[self._last_index]分支，将2015年1月1日也放入7日的滑动窗口中，2015年1月1日就变成了self._last_index对对应下标，所以25号号反而会进入timestamp < self.timestamps[self._last_index]分支，将滑动窗口截到6月30及之前未满7日的长度。26号又变成正向的滑动窗口（大于分支），27号又变成反向的滑动窗口（<）。为了处理这样的情况，我们需要对“截断”的那部分进行补全。

# 断档处理
lost_stores = []
for store in df_train["Store"].unique():
    tmp = df_train[df_train.Store == store]
    tmp = tmp.sort_values("Date")
    tmp_dates = pd.to_datetime(tmp["Date"]).reset_index()["Date"]
    if sum([tmp_dates[i]-tmp_dates[i-1]>pd.Timedelta("1 days") for i in range(1,len(tmp_dates))]) > 0:
        lost_stores.append(store)


df_train_interp = df_train.copy()
#df_test_interp = df_test.copy()
for store in lost_stores:
    tmp = df_train[df_train.Store == store]
    tmp = tmp.sort_values("Date")
    tmp_dates = pd.to_datetime(tmp["Date"]).reset_index()["Date"]
    for i in range(1,len(tmp_dates)):
        if tmp_dates[i]-tmp_dates[i-1]>pd.Timedelta("1 days"):
            min_date = tmp_dates[i-1]+pd.Timedelta("1 days")
            max_date = tmp_dates[i]
            while min_date<max_date:
                #print(min_date)
                tmp_add_series = pd.Series({"Store":store,"Date":str(min_date)[0:10],"Promo":0,"StateHoliday":0,"SchoolHoliday":0})
                df_train_interp = df_train_interp.append(tmp_add_series,ignore_index=True)
                min_date = min_date+pd.Timedelta("1 days")
                
    print(store,"complete")

对于“最近一次事件发生在何时”时，可以用一个更为简单的思路来看：

假设在滑动窗口的时间段内，没有发生任何的事件，那么就用滑动窗口的长度+1填充所有值，针对于滑动窗口中每一次发生事件，再根据事件对左右辐射来更新结果。

def get_window_looking(array,window_size,forward=True):
    """
    array为事件发生与否的0-1向量
    """
    res = np.ones_like(array)
    res *= window_size+1
    ones = np.where(array==1)[0]
    for i in ones:
        if forward:
            for j in range(1,window_size+1):
                if i-j>=0:
                    res[i-j] = min(j,res[i-j])
        else:
            for j in range(1,window_size+1):
                if i+j<array.shape[0]:
                    res[i+j] = min(j,res[i+j])
    return res

而对于“滑动窗口内事件发生的频次”问题，可以直接用一个步长与filter为1的卷积核来计算。

def get_window_count(array,window_size,forward=True):
    """
    array为事件发生与否的0-1向量
    """
    if forward:
        array = array[::-1]
    #array = np.zeros(len(test_state_tmp))
    array = np.concatenate([np.zeros(window_size),array])
    # 使用filter为1，步长为1的卷积核作滑动窗口，表示当前位置到前面（滑动窗口长度）事件发生的总数为多少；最后一位包含了原本事件的最后一天，舍去
    res = torch.conv1d(torch.from_numpy(array.reshape(1,-1)).float(),torch.ones(1,1,window_size))[0][0:-1].numpy()
    if forward:
        res = res[::-1]
    return res

最终使用这2种方法构建出forward、backward特征：

need_cols = ["Store","Date","Promo_first_forward_looking","Promo_last_backward_looking"
     ,"StateHoliday_first_forward_looking","StateHoliday_last_backward_looking"
     ,"StateHoliday_count_forward_looking","StateHoliday_count_backward_looking"
     ,"SchoolHoliday_first_forward_looking","SchoolHoliday_last_backward_looking"]
df_fb = pd.DataFrame(columns=need_cols)
for store in df_train["Store"].unique():
    tmp = [df_train_interp[df_train_interp["Store"]==store]["Store"].T,df_train_interp[df_train_interp["Store"]==store].sort_values("Date")["Date"].T]
    for col in ["Promo","StateHoliday","SchoolHoliday"]:
        arr = df_train_interp[df_train_interp["Store"]==store].sort_values("Date")[col].astype("str")
        arr = (arr!="0").astype(int)
        tmp.append(get_window_looking(arr,7))
        tmp.append(get_window_looking(arr,7,forward=False))
        if col == "StateHoliday":
            tmp.append(get_window_count(arr,7))
            tmp.append(get_window_count(arr,7,forward=False))
    tmp = pd.DataFrame(np.array(tmp).T,columns=need_cols)
    df_fb = pd.concat([df_fb,tmp],axis=0)
    
df_fb = df_fb.reset_index()
df_fb.drop("index",axis = 1,inplace=True)
df_fb.to_pickle("df_fb.pickle")

最后到了prepare_features.py里，将剩下几个离散特征作数字化（Tokenize）以方便后续Embedding，并将之前的数据mapping上去。

# prepare_features.py
from isoweek import Week
def abc2int(char):
    d = {'0': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4}
    return d[char]


def state2int(state):
    d = {'HB,NI': 0, 'HH': 1, 'TH': 2, 'RP': 3, 'ST': 4, 'BW': 5,
         'SN': 6, 'BE': 7, 'HE': 8, 'SH': 9, 'BY': 10, 'NW': 11}
    return d[state]


def PromoInterval2int(promointerval):
    char = promointerval[0]
    d = {'0': 0, 'J': 1, 'F': 2, 'M': 3}
    return d[char]

# 竞争对手有多少个月
def hasCompetitionmonths(date, CompetitionOpenSinceYear, CompetitionOpenSinceMonth):
    CompetitionOpenSinceYear = int(CompetitionOpenSinceYear)
    CompetitionOpenSinceMonth = int(CompetitionOpenSinceMonth)
    if CompetitionOpenSinceYear == 0:
        return 0
    dt_competition_open = datetime(year=CompetitionOpenSinceYear,
                                   month=CompetitionOpenSinceMonth,
                                   day=15)
    months_since_competition = (date - dt_competition_open).days // 30
    if months_since_competition < 0:
        return 0
    return min(months_since_competition, 24)

# Promo2 第几周
def hasPromo2weeks(date, Promo2SinceYear, Promo2SinceWeek):
    Promo2SinceYear = int(Promo2SinceYear)
    Promo2SinceWeek = int(Promo2SinceWeek)
    if Promo2SinceYear == 0:
        return 0
    start_promo2 = Week(Promo2SinceYear, Promo2SinceWeek).monday()
    weeks_since_promo2 = (date.date() - start_promo2).days // 7
    if weeks_since_promo2 < 0:
        return 0
    return min(weeks_since_promo2, 25)
# Promo2 第几个月
def latest_promo2_months(date, promointerval, Promo2SinceYear, Promo2SinceWeek):
    Promo2SinceYear = int(Promo2SinceYear)
    Promo2SinceWeek = int(Promo2SinceWeek)
    if not hasPromo2weeks(date, Promo2SinceYear, Promo2SinceWeek):
        return 0
    promo2int = PromoInterval2int(promointerval)
    if promo2int == 0:
        return 0

    if date.month < promo2int:
        latest_promo2_start_year = date.year - 1
        latest_promo2_start_month = promo2int + 12 - 3 #去年的最后一个季度
    else:
        latest_promo2_start_year = date.year
        latest_promo2_start_month = ((date.month - promo2int) // 3) * 3 + promo2int

    latest_promo2_start_day = datetime(year=latest_promo2_start_year,
                                       month=latest_promo2_start_month,
                                       day=1)
    weeks_since_latest_promo2 = (date - latest_promo2_start_day).days // 30
    return weeks_since_latest_promo2

## stores
df_train_features = df_train.merge(df_store,on="Store",how="left")
tmp = [datetime.strptime(df_train.loc[i,'Date'], '%Y-%m-%d') for i in range(df_train.shape[0])]
df_train_features["year"] = pd.Series([i.year for i in tmp])
df_train_features["month"] = pd.Series([i.month for i in tmp])
df_train_features["day"] = pd.Series([i.day for i in tmp])
df_train_features["week_of_year"] = pd.Series([i.isocalendar()[1] for i in tmp])

## weather
df_train_features = df_train_features.merge(df_weather,on=["State","Date"],how="left")
for col in ['max_temp', 'mean_temp', 'min_temp']:
    df_train_features[col] = (df_train_features[col]-10)/30
for col in ['max_humi','mean_humi', 'min_humi']:
    df_train_features[col] = (df_train_features[col]-50)/50
df_train_features['max_wind'] = df_train_features['max_wind']/50
df_train_features['mean_wind'] = df_train_features['mean_wind']/30

## trends 
df_train_features = df_train_features.merge(df_google_trend,on=["State","year","week_of_year"],how="left")

df_google_trend_DE = df_google_trend[df_google_trend["State"]=='DE'].drop("State",axis=1)
df_google_trend_DE = df_google_trend_DE.rename(columns={"trend":"trend_DE"})
df_train_features = df_train_features.merge(df_google_trend_DE,on=["year","week_of_year"],how="left")

df_train_features["StateHoliday"] = pd.Series([abc2int(i) for i in df_train_features["StateHoliday"].astype("str")])
df_train_features["CompetitionDistance"] = df_train_features["CompetitionDistance"].fillna(0)
df_train_features["CompetitionOpenSinceMonth"] = df_train_features["CompetitionOpenSinceMonth"].fillna(0).astype("int")
df_train_features["CompetitionOpenSinceYear"] = df_train_features["CompetitionOpenSinceYear"].fillna(0).astype("int")

df_train_features["hasCompetitionmonths"] = pd.Series([hasCompetitionmonths(
    datetime.strptime(df_train_features.loc[i,"Date"], '%Y-%m-%d'),
    df_train_features.loc[i,"CompetitionOpenSinceYear"],
    df_train_features.loc[i,"CompetitionOpenSinceMonth"]
    ) for i in range(df_train_features.shape[0])])

df_train_features["Promo2SinceWeek"] = df_train_features["Promo2SinceWeek"].fillna(0).astype("int")
df_train_features["Promo2SinceYear"] = df_train_features["Promo2SinceYear"].fillna(0).astype("int")
df_train_features["PromoInterval"] = df_train_features["PromoInterval"].fillna(0)

df_train_features["has_promo2_for_weeks"] = pd.Series([hasPromo2weeks(
    datetime.strptime(df_train_features.loc[i,"Date"], '%Y-%m-%d'),
    df_train_features.loc[i,"Promo2SinceYear"],
    df_train_features.loc[i,"Promo2SinceWeek"]
    ) for i in range(df_train_features.shape[0])])

df_train_features["has_promo2_for_weeks"] = pd.Series([hasPromo2weeks(
    datetime.strptime(df_train_features.loc[i,"Date"], '%Y-%m-%d'),
    df_train_features.loc[i,"Promo2SinceYear"],
    df_train_features.loc[i,"Promo2SinceWeek"]
    ) for i in range(df_train_features.shape[0])])

df_train_features["latest_promo2_for_months"] = pd.Series([latest_promo2_months(
    datetime.strptime(df_train_features.loc[i,"Date"], '%Y-%m-%d'),
    df_train_features.loc[i,"PromoInterval"],
    df_train_features.loc[i,"Promo2SinceYear"],
    df_train_features.loc[i,"Promo2SinceWeek"]
    ) for i in range(df_train_features.shape[0])])

df_train_features = df_train_features.merge(df_fb,on=["Store","Date"],how="left")

df_train_features["CompetitionDistance"] = pd.Series([math.log(i+1)/10 for i in df_train_features["CompetitionDistance"]])

for col in ["StoreType","Assortment"]:
    df_train_features[col] = pd.Series([abc2int(i) for i in df_train_features[col]])
df_train_features["PromoInterval"] = pd.Series([PromoInterval2int(i) for i in df_train_features["PromoInterval"].astype("str")])
df_train_features["State"] = pd.Series([state2int(i) for i in df_train_features["State"].astype("str")])


df_train_features_final = df_train_features.loc[:,["Open","Store","DayOfWeek",
                                                  "Promo","year","month","day","StateHoliday",
                                                  "SchoolHoliday","hasCompetitionmonths","has_promo2_for_weeks","latest_promo2_for_months",
                                                  "CompetitionDistance","StoreType","Assortment","PromoInterval",
                                                  "CompetitionOpenSinceYear","Promo2SinceYear","State","week_of_year",
                                                  'max_temp', 'mean_temp', 'min_temp', 'max_humi',
                                                   'mean_humi', 'min_humi', 'max_wind', 'mean_wind', 'CloudCover',
                                                   'Events', 'trend', 'trend_DE','Promo_first_forward_looking', 'Promo_last_backward_looking',
                                                   'StateHoliday_first_forward_looking',
                                                   'StateHoliday_last_backward_looking',
                                                   'StateHoliday_count_forward_looking',
                                                   'StateHoliday_count_backward_looking',
                                                   'SchoolHoliday_first_forward_looking',
                                                   'SchoolHoliday_last_backward_looking',"Sales"]]

df_train_features_final.to_pickle("train_final_info.pickle")


## stores
df_test_features = df_test.merge(df_store,on="Store",how="left")

#df_test_features = df_test.merge(df_store_states,on="Store",how = "left")
tmp = [datetime.strptime(df_test.loc[i,'Date'], '%Y-%m-%d') for i in range(df_test.shape[0])]
df_test_features["year"] = pd.Series([i.year for i in tmp])
df_test_features["month"] = pd.Series([i.month for i in tmp])
df_test_features["day"] = pd.Series([i.day for i in tmp])
df_test_features["week_of_year"] = pd.Series([i.isocalendar()[1] for i in tmp])

## weather
df_test_features = df_test_features.merge(df_weather,on=["State","Date"],how="left")
for col in ['max_temp', 'mean_temp', 'min_temp']:
    df_test_features[col] = (df_test_features[col]-10)/30
for col in ['max_humi','mean_humi', 'min_humi']:
    df_test_features[col] = (df_test_features[col]-50)/50
df_test_features['max_wind'] = df_test_features['max_wind']/50
df_test_features['mean_wind'] = df_test_features['mean_wind']/30

## trends 
df_test_features = df_test_features.merge(df_google_trend,on=["State","year","week_of_year"],how="left")

df_google_trend_DE = df_google_trend[df_google_trend["State"]=='DE'].drop("State",axis=1)
df_google_trend_DE = df_google_trend_DE.rename(columns={"trend":"trend_DE"})
df_test_features = df_test_features.merge(df_google_trend_DE,on=["year","week_of_year"],how="left")

df_test_features["StateHoliday"] = pd.Series([abc2int(i) for i in df_test_features["StateHoliday"].astype("str")])
df_test_features["CompetitionDistance"] = df_test_features["CompetitionDistance"].fillna(0)
df_test_features["CompetitionOpenSinceMonth"] = df_test_features["CompetitionOpenSinceMonth"].fillna(0).astype("int")
df_test_features["CompetitionOpenSinceYear"] = df_test_features["CompetitionOpenSinceYear"].fillna(0).astype("int")

df_test_features["hasCompetitionmonths"] = pd.Series([hasCompetitionmonths(
    datetime.strptime(df_test_features.loc[i,"Date"], '%Y-%m-%d'),
    df_test_features.loc[i,"CompetitionOpenSinceYear"],
    df_test_features.loc[i,"CompetitionOpenSinceMonth"]
    ) for i in range(df_test_features.shape[0])])

df_test_features["Promo2SinceWeek"] = df_test_features["Promo2SinceWeek"].fillna(0).astype("int")
df_test_features["Promo2SinceYear"] = df_test_features["Promo2SinceYear"].fillna(0).astype("int")
df_test_features["PromoInterval"] = df_test_features["PromoInterval"].fillna(0)

df_test_features["has_promo2_for_weeks"] = pd.Series([hasPromo2weeks(
    datetime.strptime(df_test_features.loc[i,"Date"], '%Y-%m-%d'),
    df_test_features.loc[i,"Promo2SinceYear"],
    df_test_features.loc[i,"Promo2SinceWeek"]
    ) for i in range(df_test_features.shape[0])])

df_test_features["has_promo2_for_weeks"] = pd.Series([hasPromo2weeks(
    datetime.strptime(df_test_features.loc[i,"Date"], '%Y-%m-%d'),
    df_test_features.loc[i,"Promo2SinceYear"],
    df_test_features.loc[i,"Promo2SinceWeek"]
    ) for i in range(df_test_features.shape[0])])

df_test_features["latest_promo2_for_months"] = pd.Series([latest_promo2_months(
    datetime.strptime(df_test_features.loc[i,"Date"], '%Y-%m-%d'),
    df_test_features.loc[i,"PromoInterval"],
    df_test_features.loc[i,"Promo2SinceYear"],
    df_test_features.loc[i,"Promo2SinceWeek"]
    ) for i in range(df_test_features.shape[0])])

df_test_features = df_test_features.merge(df_fb_test,on=["Store","Date"],how="left")
df_test_features["CompetitionDistance"] = pd.Series([math.log(i+1)/10 for i in df_test_features["CompetitionDistance"]])

for col in ["StoreType","Assortment"]:
    df_test_features[col] = pd.Series([abc2int(i) for i in df_test_features[col]])
df_test_features["PromoInterval"] = pd.Series([PromoInterval2int(i) for i in df_test_features["PromoInterval"].astype("str")])
df_test_features["State"] = pd.Series([state2int(i) for i in df_test_features["State"].astype("str")])

df_test_features.to_pickle("df_test_features.pickle")

df_test_features_final = df_test_features.loc[:,["Open","Store","DayOfWeek",
                                                  "Promo","year","month","day","StateHoliday",
                                                  "SchoolHoliday","hasCompetitionmonths","has_promo2_for_weeks","latest_promo2_for_months",
                                                  "CompetitionDistance","StoreType","Assortment","PromoInterval",
                                                  "CompetitionOpenSinceYear","Promo2SinceYear","State","week_of_year",
                                                  'max_temp', 'mean_temp', 'min_temp', 'max_humi',
                                                   'mean_humi', 'min_humi', 'max_wind', 'mean_wind', 'CloudCover',
                                                   'Events', 'trend', 'trend_DE','Promo_first_forward_looking', 'Promo_last_backward_looking',
                                                   'StateHoliday_first_forward_looking',
                                                   'StateHoliday_last_backward_looking',
                                                   'StateHoliday_count_forward_looking',
                                                   'StateHoliday_count_backward_looking',
                                                   'SchoolHoliday_first_forward_looking',
                                                   'SchoolHoliday_last_backward_looking']]

df_test_features_final.to_pickle("test_final_info.pickle")

最后，为了Embedding，需要对各个Tokenize后的分类特征减一以实现“index从零开始”

def split_features(df):
    for col in ["Store","DayOfWeek","month","day","week_of_year","Promo_first_forward_looking","Promo_last_backward_looking"
                ,"StateHoliday_first_forward_looking","StateHoliday_last_backward_looking","SchoolHoliday_first_forward_looking",
               "SchoolHoliday_last_backward_looking"]:
        df[col] = df[col]-1
    
    df["year"] = df["year"]-2013
    
    for i in range(df.shape[0]):
        if i % 100000 == 0:
            print(i,"/",df.shape[0])
        if df.loc[i,"CompetitionOpenSinceYear"]<2000:
            df.loc[i,"CompetitionOpenSinceYear"] = 1
        else:
            df.loc[i,"CompetitionOpenSinceYear"] -= 1998
            
        df.loc[i,"Promo2SinceYear"] = max(df.loc[i,"Promo2SinceYear"]-2008,0)
    return df

df_train_features_final_splited = split_features(df_train_features_final)
df_test_features_final_splited = split_features(df_test_features_final)

三、建模

在keras2.x中，已经将原本的Merge层拆分出来了，这边使用Concatenate就行。

此处定义EntitiyEmbedding类，对离散特征的输入（Embedding）和对连续特征的输入（Dense）分别作函数，以此更方便地构建模型。注意几个天气类别的特征，已经被归为同一个特征了。

class EntitiyEmbedding:
    # input_dim是“词表”长度，即有多少种分类，output_dim是“词向量”大小，也就是这一特征embedding后的大小，input_length是“词”的长度，此处每条记录只有1个对应值，全局设为1即可
    # 不是，fit函数有对应设置batch_size的
    def __init__(self):
        # input与output要分别列出
        self.inputs = []
        self.sub_models = []
        self.features = []
        self.checkpoint = ModelCheckpoint("EntityEmbedding_Rossmann.h5")
        
    def add_embedding_layer(self,name,input_shape,output_shape):
        self.features.append(name)
        #model_tmp = Sequential()
        input_model = Input(shape=(1,),name=name+"_input")
        output = Embedding(input_shape,output_shape,input_length=1,name=name+"_output")(input_model)
        output = Reshape((output_shape,))(output)
        self.inputs.append(input_model)
        self.sub_models.append(output)
        
    def add_dense_layer(self,name,unit,dim):
        self.features.append(name)
        # model_tmp = Sequential()
        input_model = Input(shape=(dim,),name=name+"_input")
        output = Dense(unit,input_dim=dim,name=name+"_output")(input_model)
        self.inputs.append(input_model)
        self.sub_models.append(output)
        
    def concat(self):
        self.model = Concatenate()(self.sub_models)
        self.model = Dropout(0.02)(self.model)
        self.model = Dense(1000, kernel_initializer='uniform')(self.model)
        self.model = Activation('relu')(self.model)
        self.model = Dense(500, kernel_initializer='uniform')(self.model)
        self.model = Activation('relu')(self.model)
        self.model = Dense(1)(self.model)
        self.model = Activation('sigmoid')(self.model)
        self.model = Model(self.inputs,self.model)
        self.model.compile(loss='mean_absolute_error', optimizer='adam')
        
    
    def preprocess_y(self,y):
        self.max_log_y = np.max(np.log(y))
        y = np.log(y) / self.max_log_y
        
        return y
    
    def after_process(self,y):
        return np.exp(y*self.max_log_y)
    
    def summary(self):
        return self.model.summary()
    
    def fit(self,X_train,X_valid,Y_train,Y_valid,n_epoch,batch_size):
        #X = self.preprocess_fit(X)
        Y_train = self.preprocess_y(Y_train)
        if len(Y_valid)>0:
            Y_valid = np.log(Y_valid) / self.max_log_y
            self.model.fit(X_train,Y_train,epochs = n_epoch,batch_size = batch_size,validation_data = (X_valid,Y_valid),callbacks = [self.checkpoint]) 
        else:
            self.model.fit(X_train,Y_train,epochs = n_epoch,batch_size = batch_size,callbacks = [self.checkpoint]) 
    def predict(self,X):
        
        return self.after_process(self.model.predict(X))

model = EntitiyEmbedding()

model.add_embedding_layer("Store",1115,50)
model.add_embedding_layer("day_of_week",7,6)
model.add_dense_layer("promo",1,1)
model.add_embedding_layer("year",3,2)
model.add_embedding_layer("month",12,6)
model.add_embedding_layer("day",31,10)
model.add_embedding_layer("state_holiday",4,3)
model.add_dense_layer("school_holiday",1,1)
model.add_embedding_layer("hasCompetitionmonths",25,2)
model.add_embedding_layer("has_promo2_for_weeks",26,1)
model.add_embedding_layer("latest_promo2_for_months",4,1)
model.add_dense_layer("CompetitionDistance",1,1)
model.add_embedding_layer("StoreType",5,2)
model.add_embedding_layer("Assortment",4,3)
model.add_embedding_layer("PromoInterval",4,3)
model.add_embedding_layer("CompetitionOpenSinceYear",18,4)
model.add_embedding_layer("Promo2SinceYear",8,4)
model.add_embedding_layer("State",12,6)
model.add_embedding_layer("week_of_year",53,2)
model.add_dense_layer("temp",3,3)
model.add_dense_layer("humi",3,3)
model.add_dense_layer("wind",2,2)
model.add_dense_layer("cloud",1,1)
model.add_embedding_layer("Event_weather",22,4)
model.add_dense_layer("trend",1,1)
model.add_dense_layer("trend_DE",1,1)
model.add_embedding_layer("Promo_first_forward_looking",8,1)
model.add_embedding_layer("Promo_last_backward_looking",8,1)
model.add_embedding_layer("StateHoliday_first_forward_looking",8,1)
model.add_embedding_layer("StateHoliday_last_backward_looking",8,1)
model.add_embedding_layer("StateHoliday_count_forward_looking",3,1)
model.add_embedding_layer("StateHoliday_count_backward_looking",3,1)
model.add_embedding_layer("SchoolHoliday_first_forward_looking",8,1)
model.add_embedding_layer("SchoolHoliday_last_backward_looking",8,1)

model.concat()

"""GPU设置为按需增长"""
import os
import tensorflow as tf
# 指定第一块GPU可用
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

def preprocess(X,feature_lists):
    de_sc = StandardScaler() #Google trend de
    state_sc = StandardScaler() #Google trend SC

    for i in X.columns:
        if i in {'max_temp', 'mean_temp','min_temp', 'max_humi', 'mean_humi', 'min_humi', 'max_wind','mean_wind', 'CloudCover','trend', 'trend_DE'}:
            X[i] = X[i].astype(float)
        else:
            X[i] = X[i].astype(int)
    
    X["trend_DE"] = de_sc.fit_transform(X["trend_DE"].values.reshape(-1, 1))
    X["trend"] = state_sc.fit_transform(X["trend"].values.reshape(-1, 1))
    return X,de_sc,state_sc

def concat(X,feature_list):
    X_list = []
    for i in feature_list:
        tmp = np.array(X[i].values)
        X_list.append(tmp)
    return X_list

feature_list = ["Store","DayOfWeek",
              "Promo","year","month","day","StateHoliday",
              "SchoolHoliday","hasCompetitionmonths","has_promo2_for_weeks","latest_promo2_for_months",
              "CompetitionDistance","StoreType","Assortment","PromoInterval",
              "CompetitionOpenSinceYear","Promo2SinceYear","State","week_of_year",
              ['max_temp', 'mean_temp', 'min_temp'], ['max_humi','mean_humi', 'min_humi'],
                ['max_wind', 'mean_wind'], 'CloudCover',
               'Events', 'trend', 'trend_DE','Promo_first_forward_looking', 'Promo_last_backward_looking',
               'StateHoliday_first_forward_looking',
               'StateHoliday_last_backward_looking',
               'StateHoliday_count_forward_looking',
               'StateHoliday_count_backward_looking',
               'SchoolHoliday_first_forward_looking',
               'SchoolHoliday_last_backward_looking']

X = df_train_features_final_splited[df_train_features_final_splited["Sales"]>0].drop(["Sales","Open"],axis=1)
Y = df_train_features_final_splited[df_train_features_final_splited["Sales"]>0]["Sales"]

X, de_sc,state_sc = preprocess(X,feature_list)

X_list = concat(X,feature_list)

model.fit(X_list,[],Y,[],n_epoch=50,batch_size=128)

经过一段时间的训练后，就能获得对应结果。

X = df_test_features_final_splited.drop(["Open"],axis=1)
for i in X.columns:
    if i in {'max_temp', 'mean_temp','min_temp', 'max_humi', 'mean_humi', 'min_humi', 'max_wind','mean_wind', 'CloudCover','trend', 'trend_DE'}:
        X[i] = X[i].astype(float)
    else:
        X[i] = X[i].astype(int)

X["trend_DE"] = de_sc.transform(X["trend_DE"].values.reshape(-1, 1))
X["trend"] = state_sc.transform(X["trend"].values.reshape(-1, 1))

X_list = concat(X,feature_list)

res = model.predict(X_list)
res = res.reshape(1,-1)[0]

zero_idx = [i for i in df_test.index if df_test.loc[i,"Open"].astype(int)==0]
res[zero_idx] = 0

submission = pd.read_csv("sample_submission.csv")
submission["Sales"] = res
submission.to_csv("entity_res.csv",index=None)

结果如图所示，效果算好，但是没有原本第三名的那个效果好。Kaggle上复现的人认为这是因为Keras将Merge改为Concatenate导致的，我个人认为除了包的版本迭代外，对方团队在比赛时可能还用到了一些trick或者对某些结果进行正则化处理我们没有考虑到。

顺带一提，也有人用EntityEmbedding来作为模型预处理的一部分，将Embedding后的离散特征加入到其他模型（如LightGBM）中。我此处也做了提取权重的代码：

weights = {}
for i in model.features:
    if i not in continous_features:
        w = model.model.get_layer(i + '_output').get_weights()[0]
        df_tmp = pd.DataFrame(w,columns=[i + "_" + str(j) for j in range(w.shape[1])])
        df_tmp[i] = [j for j in range(w.shape[0])]
        weights[i] = df_tmp

最终可以获得一个字典，字典的键是离散特征，值是一个表格，代表Embedding后每个离散值所对应的稠密向量。

Kaggle上有人试过用这些向量作为特征训练，效果也还好，附上链接：

Rossmann sales (EmbLayer + XG + LGB) | Kaggle