用户新增预测挑战赛-优快云博客

本文链接：https://blog.youkuaiyun.com/SinkAboutIt/article/details/149334597

一、赛事背景

讯飞开放平台针对不同行业、不同场景提供相应的AI能力和解决方案，赋能开发者的产品和应用，帮助开发者通过AI解决相关实际问题，实现让产品能听会说、能看会认、能理解会思考。

用户新增预测是分析用户使用场景以及预测用户增长情况的关键步骤，有助于进行后续产品和应用的迭代升级。赛事链接于2025 iFLYTEK AI开发者大赛

二、赛事任务

本次大赛提供了讯飞开放平台海量的应用数据作为训练样本，参赛选手需要基于提供的样本构建模型，预测用户的新增情况。

2.1 赛事任务与目标

核心任务: 基于用户行为日志，构建分类模型。
预测目标: 预测用户 did 是新增用户 (is_new_did = 1) 还是老用户 (is_new_did = 0)。

2.2 数据字段解读

身份ID: did, device_brand, os_type
行为ID: mid, eid
时空信息: common_ts, common_country, common_province, common_city
环境信息: appver, channel, ntt, operator
“宝藏”字段: udmap (内含botId和pluginId的JSON)

我们通过数据探索的代码，有一些关键发现：

测试集中93%的用户出现在训练集中
训练集中88%的用户is_new_did为 0

2.3 评估指标：F1-Score

采用 F1-Score，提示我们要注意类别不平衡问题，并且需要对模型的分类阈值进行优化。

解题要点和难点

要解这道题，需要解决以下要点和难点：

用户行为事件数据 → 用户级别预测
高维稀疏特征（设备/地域/行为ID）
正负样本不均衡（新增用户占比较少）
用户行为聚合：如何将事件级数据转化为用户特征
时间敏感特征：用户行为模式随时间变化

解题思考过程

关键决策点：

选择树模型而非神经网络（训练速度/特征处理）
优先构造简单的时间特征而非复杂特征工程

三、Baseline核心逻辑

数据预处理: 对时间戳进行转换，提取了day, hour等特征；对类别特征进行了LabelEncoder。这些操作是正确的。
建模粒度: 【关键缺陷】 该方案直接在事件级别的 DataFrame (train_df) 上进行训练。这意味着，训练集的每一行都是一条用户行为事件，而不是一个用户。
模型训练: 使用 LightGBM 在高达数百万行的事件数据上进行5折交叉验证。
预测方式: 对测试集的每一条事件都生成一个预测概率，最后将所有预测概率进行平均，得到最终的测试集预测结果。
数据洞察: 代码正确地分析出训练集和测试集的did有高达93%的重叠。

文件路径	类型	作用
train.csv（官方提供）	输入数据	含标签的训练集
testA_data.csv（官方提供）	输入数据	无标签测试集
Baseline.ipynb	Baseline代码文件	全流程处理脚本
submit.csv（运行后生成）	输出	预测结果提交文件

Baseline的“致命缺陷”

这个Baseline是一个典型的“美好的陷阱”，它看似走了捷径，实则绕了远路。

缺陷一：建模粒度与任务目标不匹配
- 赛题要求预测一个用户 (did) 的标签，而模型却在学习预测一条事件 (eid) 的标签。模型看到的只是“树木”（单次行为），而非“森林”（用户全貌）。一个老用户偶尔做了一个新用户常做的行为，模型在这一条数据上就很容易被误导。
- 证据：最终的特征重要性显示 mid 和 appver 最高。这很可能是因为某些特定的模块ID或应用版本与新/老用户状态强相关，模型仅仅抓住了这种表层联系。
缺陷二：特征工程的“不作为”
- 由于是在事件级别建模，方案完全没有构建描述用户整体行为的聚合特征。例如，用户的总活跃天数、行为频率、访问时段偏好、使用过的插件种类数等信息全部丢失了。模型对每个did的理解是支离破碎的。
缺陷三：计算效率极其低下
- 在数百万行的数据上进行5折交叉验证，耗时巨大。从notebook的输出可以看到，训练过程长达 9分25秒。这在需要快速迭代特征和模型的竞赛中是难以接受的。如果转为用户级别建模，训练数据量将急剧下降到几十万行，训练速度会提升一个数量级。
缺陷四：错失“送分题”
- 代码虽然发现了训练集和测试集中did大量重叠（Data Leakage），但完全没有利用这一信息！对于那些在训练集中出现过的did，它们的标签是已知的，我们本可以直接将这个“标准答案”用到提交结果中。

结论：这个Baseline的建模范式存在根本性问题。我们需要一次彻底的重构。

四、核心思想：为每个用户构建一幅“画像”

我们的目标是，将原始数据中一个did对应的多行事件记录，转换为一行能全面描述该用户行为模式的特征向量。

4.1 第一步：DID级别特征聚合

这是范式转换的核心。我们将使用 groupby('did').agg(...) 来创建用户画像。

统计特征: 计算count, nunique, mean, std, max, min, ptp(范围)等。例如：
- common_ts的ptp -> 用户生命周期时长。
- eid的nunique -> 用户行为多样性。
- hour的mean/std -> 用户的活跃时段和规律性。
派生特征: 基于基础聚合，创造更有意义的比率特征。
- eid_count / did_nunique -> 平均每个模块下的事件数。

4.2 第二步：利用数据泄露（Data Leakage）

这是竞赛中的“核武器”，必须用上。

逻辑: 创建提交文件时，首先检查测试集中的每个did是否存在于训练集中。
- 如果存在，直接从训练集中查找该did的is_new_did标签并作为最终预测结果。
- 如果不存在，才使用我们训练好的模型进行预测。

通过这两步，我们已经构建了一个远比Baseline强大的新方案。它更高效，特征更丰富，且利用了赛题规则。

4.3 武器一：TF-IDF —— 挖掘用户行为的“关键词”

解决问题: 简单的nunique等统计无法捕捉用户的行为偏好。
应用: 我们将每个did的eid序列、mid序列等看作“文档”，应用TF-IDF提取每个用户最具代表性的行为“关键词”。

4.4 武器二：Hyperopt —— 智能参数“炼丹炉”

解决问题: 手动设置的超参数几乎不可能是最优的。
应用: 使用贝叶斯优化库Hyperopt，定义参数搜索空间，让程序在几十次迭代内智能地找到接近最优的参数组合，省时省力且效果显著。

五、全部代码

# 导入所有需要的库
import pandas as pd
import numpy as np
import json
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from hyperopt import fmin, tpe, hp, Trials
from hyperopt.pyll import scope
import warnings
import time
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# 设置全局选项
warnings.filterwarnings('ignore')

# 1. 数据加载与初步特征工程
print("Loading data and performing initial feature engineering...")
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./testA_data.csv')
submit_df_original_order = test_df[['did']].copy()

# 合并数据集以便统一处理
full_df = pd.concat([train_df.drop('is_new_did', axis=1), test_df], axis=0)

# 时间特征工程
full_df['ts'] = pd.to_datetime(full_df['common_ts'], unit='ms')
full_df['day'] = full_df['ts'].dt.day
full_df['dayofweek'] = full_df['ts'].dt.dayofweek
full_df['hour'] = full_df['ts'].dt.hour
full_df.drop(['ts'], axis=1, inplace=True)

# udmap JSON字段解析
def parse_udmap_json(udmap_str):
    try:
        data = json.loads(udmap_str)
        bot_id = str(data.get('botId', 'UNKNOWN_BOT'))
        plugin_id = str(data.get('pluginId', 'UNKNOWN_PLUGIN'))
    except (json.JSONDecodeError, TypeError):
        bot_id = 'INVALID_JSON_BOT'
        plugin_id = 'INVALID_JSON_PLUGIN'
    return bot_id, plugin_id

udmap_parsed = full_df['udmap'].apply(lambda x: pd.Series(parse_udmap_json(x)))
full_df['botId'] = udmap_parsed[0]
full_df['pluginId'] = udmap_parsed[1]

# 类别特征编码
cat_features = [
    'device_brand', 'ntt', 'operator', 'common_country',
    'common_province', 'common_city', 'appver', 'channel',
    'os_type', 'udmap', 'botId', 'pluginId'
]
for feature in tqdm(cat_features, desc="Encoding categorical features"):
    le = LabelEncoder()
    full_df[feature] = le.fit_transform(full_df[feature].astype(str))

# 分离回处理过的训练集和测试集
train_df_processed = full_df.iloc[:len(train_df)].copy()
train_df_processed['is_new_did'] = train_df['is_new_did'].values
test_df_processed = full_df.iloc[len(train_df):].copy()
did_is_new_map_train = train_df_processed.groupby('did')['is_new_did'].first()

# 2. 基础特征聚合 (DID-level)
print("\nAggregating base features by did...")
agg_funcs_base = {
    'common_ts': ['count', 'min', 'max', 'mean', 'std', np.ptp], # np.ptp计算范围(max-min)
    'mid': ['nunique'],
    'eid': ['nunique'],
    'hour': ['min', 'max', 'mean', 'std', 'nunique'],
    'day': ['min', 'max', 'mean', 'std', 'nunique'],
    'dayofweek': ['min', 'max', 'mean', 'std', 'nunique'],
}

for col in tqdm(cat_features, desc="Aggregating categorical features"):
    agg_funcs_base[col] = ['nunique']

train_did_base_agg = train_df_processed.groupby('did').agg(agg_funcs_base)
train_did_base_agg.columns = ['_'.join(col).strip() for col in train_did_base_agg.columns.values]
test_did_base_agg = test_df_processed.groupby('did').agg(agg_funcs_base)
test_did_base_agg.columns = ['_'.join(col).strip() for col in test_did_base_agg.columns.values]


# 3. TF-IDF 特征工程
print("\nCreating TF-IDF features...")

def create_tfidf_features(df, group_col='did', text_cols=None, max_features_per_col=50):
    if text_cols is None:
        text_cols = ['eid', 'mid']

    for col in text_cols:
        df[col] = df[col].astype(str)

    agg_text_df = df.groupby(group_col)[text_cols].agg(lambda x: ' '.join(x)).reset_index()

    all_tfidf_features = []
    for col in tqdm(text_cols, desc="Generating TF-IDF"):
        vectorizer = TfidfVectorizer(
            max_features=max_features_per_col,
            token_pattern=r'(?u)\b\w+\b'
        )
        tfidf_matrix = vectorizer.fit_transform(agg_text_df[col])
        tfidf_cols = [f'tfidf_{col}_{i}' for i in range(tfidf_matrix.shape[1])]
        tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_cols)
        all_tfidf_features.append(tfidf_df)

    final_tfidf_df = pd.concat(all_tfidf_features, axis=1)
    final_tfidf_df[group_col] = agg_text_df[group_col]
    return final_tfidf_df

tfidf_cols_to_use = ['eid', 'mid', 'appver', 'common_city', 'channel']
train_tfidf = create_tfidf_features(train_df_processed, 'did', tfidf_cols_to_use, max_features_per_col=50)
test_tfidf = create_tfidf_features(test_df_processed, 'did', tfidf_cols_to_use, max_features_per_col=50)

# 合并所有特征
train_did_df = pd.merge(train_did_base_agg.reset_index(), train_tfidf, on='did', how='left')
test_did_df = pd.merge(test_did_base_agg.reset_index(), test_tfidf, on='did', how='left')
train_did_df['is_new_did'] = train_did_df['did'].map(did_is_new_map_train)

# 统一特征列并填充
model_features = [col for col in train_did_df.columns if col not in ['did', 'is_new_did']]
for col in model_features:
    if col not in test_did_df.columns:
        test_did_df[col] = 0
test_did_df = test_did_df[train_did_df.columns.drop('is_new_did')]

for df in [train_did_df, test_did_df]:
    df[model_features] = df[model_features].fillna(0)
print(f"\nTotal features created: {len(model_features)}")


# 4. 数据泄露利用
print("\nHandling overlapping dids...")
submit_df_original_order['is_new_did'] = submit_df_original_order['did'].map(did_is_new_map_train).fillna(-1)
unseen_dids_in_test_df = submit_df_original_order[submit_df_original_order['is_new_did'] == -1]
unseen_dids_in_test = unseen_dids_in_test_df['did'].unique()
test_did_df_to_predict = test_did_df[test_did_df['did'].isin(unseen_dids_in_test)].set_index('did').loc[unseen_dids_in_test].reset_index()

X_train_did = train_did_df[model_features]
y_train_did = train_did_df['is_new_did']
X_test_did_to_predict = test_did_df_to_predict[model_features]


# 5. 贝叶斯优化 (Hyperopt)
print("\nStarting Hyperparameter Optimization with Hyperopt...")
space = {
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'max_depth': scope.int(hp.quniform('max_depth', 5, 15, 1)),
    'num_leaves': scope.int(hp.quniform('num_leaves', 20, 100, 1)),
    'feature_fraction': hp.uniform('feature_fraction', 0.6, 1.0),
    'bagging_fraction': hp.uniform('bagging_fraction', 0.6, 1.0),
    'bagging_freq': scope.int(hp.quniform('bagging_freq', 1, 10, 1)),
    'min_child_samples': scope.int(hp.quniform('min_child_samples', 5, 50, 1)),
    'reg_alpha': hp.loguniform('reg_alpha', np.log(0.01), np.log(10)),
    'reg_lambda': hp.loguniform('reg_lambda', np.log(0.01), np.log(10)),
}

def find_optimal_threshold(y_true, y_pred_proba):
    best_threshold = 0.5; best_f1 = 0
    for threshold in np.arange(0.1, 0.7, 0.01):
        f1 = f1_score(y_true, (y_pred_proba >= threshold).astype(int))
        if f1 > best_f1:
            best_f1 = f1; best_threshold = threshold
    return best_threshold, best_f1

def objective(params):
    params.update({k: int(v) for k, v in params.items() if k in ['max_depth', 'num_leaves', 'bagging_freq', 'min_child_samples']})
    params.update({'objective': 'binary', 'metric': 'binary_logloss', 'n_jobs': -1, 'verbose': -1, 'seed': 42, 'boosting_type': 'gbdt'})
    
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    f1_scores = []
    for train_idx, val_idx in kf.split(X_train_did, y_train_did):
        model = lgb.LGBMClassifier(**params, n_estimators=1000)
        model.fit(X_train_did.iloc[train_idx], y_train_did.iloc[train_idx],
                  eval_set=[(X_train_did.iloc[val_idx], y_train_did.iloc[val_idx])],
                  callbacks=[lgb.early_stopping(50, verbose=False)])
        _, best_f1 = find_optimal_threshold(y_train_did.iloc[val_idx], model.predict_proba(X_train_did.iloc[val_idx])[:, 1])
        f1_scores.append(best_f1)
    return -np.mean(f1_scores)

trials = Trials()
best_params = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials, rstate=np.random.default_rng(42))
print("\nBest parameters found:", best_params)

# 6. 使用最优参数进行最终模型训练
print("\nTraining final model...")
final_params = {k: v for k, v in best_params.items()}
final_params.update({k: int(v) for k, v in final_params.items() if k in ['max_depth', 'num_leaves', 'bagging_freq', 'min_child_samples']})
final_params.update({'objective': 'binary', 'metric': 'binary_logloss', 'n_jobs': -1, 'seed': 42, 'boosting_type': 'gbdt'})

n_folds = 5
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
test_preds = np.zeros(len(X_test_did_to_predict))
fold_thresholds = []; models = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X_train_did, y_train_did)):
    print(f"\n======= Fold {fold + 1}/{n_folds} =======")
    model = lgb.LGBMClassifier(**final_params, n_estimators=2000)
    model.fit(X_train_did.iloc[train_idx], y_train_did.iloc[train_idx],
              eval_set=[(X_train_did.iloc[val_idx], y_train_did.iloc[val_idx])],
              eval_metric='f1', callbacks=[lgb.early_stopping(100, verbose=100)])
    models.append(model)
    val_proba = model.predict_proba(X_train_did.iloc[val_idx])[:, 1]
    best_thr, best_f1 = find_optimal_threshold(y_train_did.iloc[val_idx], val_proba)
    fold_thresholds.append(best_thr)
    print(f"Fold {fold + 1} Best F1: {best_f1:.5f} at Threshold: {best_thr:.2f}")
    test_preds += model.predict_proba(X_test_did_to_predict)[:, 1] / n_folds

# 7. 生成提交文件
avg_threshold = np.mean(fold_thresholds)
print(f"\nAverage Optimal Threshold: {avg_threshold:.4f}")
unseen_labels = (test_preds >= avg_threshold).astype(int)
unseen_map = pd.Series(unseen_labels, index=unseen_dids_in_test)
submit_df_original_order.loc[submit_df_original_order['did'].isin(unseen_dids_in_test), 'is_new_did'] = \
    submit_df_original_order.loc[submit_df_original_order['did'].isin(unseen_dids_in_test), 'did'].map(unseen_map)
submit_df_original_order['is_new_did'] = submit_df_original_order['is_new_did'].replace(-1, 0).astype(int)

submission_filename = 'submit_final.csv'
submit_df_original_order[['is_new_did']].to_csv(submission_filename, index=False)
print(f"\nSubmission file saved: {submission_filename}")

这条优化之路远未结束，未来还有更多值得探索的方向：

更高级的特征工程:
- Target Encoding: 对高基数类别特征进行目标编码，直接引入标签信息。
- RFM模型: 构建Recency (最近一次行为), Frequency (频率), Monetary (价值)特征，这是用户价值分析的利器。
模型融合（Ensembling）:
- 将LightGBM, XGBoost, CatBoost等多个强力模型的预测结果进行加权或Stacking融合，通常能带来稳定的小幅提升。
探索深度学习:
- 对于行为序列，可以尝试使用GRU或LSTM等循环神经网络来直接建模，或许能发现更复杂的时序依赖关系。