Datawhale夏令营2024-分子预测赛学习笔记

深海小咸鱼

已于 2024-07-09 19:16:46 修改

阅读量1.8k

点赞数 52

文章标签：学习笔记

于 2024-07-07 19:41:27 首次发布

本文链接：https://blog.youkuaiyun.com/Yixuanxia/article/details/140220771

版权

一、问题背景

此次夏令营活动是基于讯飞开放平台“分子性质预测挑战赛”开展实践学习，非常适合想入门并实践机器学习算法 的同学们一起参与，可以跟着手册操作很容易完成，只需要十几分钟~

‌‌‬‬⁠⁠⁠⁠⁠‍‌‍‌‬‬⁠‬‍‍‍‌‌零基础入门AI(机器学习)竞赛 - 飞书云文档 (feishu.cn)

比赛简介

在当今科技日新月异的时代，人工智能（AI）技术在化学及药物研发中展现出了巨大潜力。精准预测分子性质有助于高效筛选出具有优异性能的候选药物。以PROTACs为例，它是一种三元复合物由目标蛋白配体、linker、E3连接酶配体组成，靶向降解目标蛋白质。本次大赛聚焦于运用先进的人工智能算法预测其降解效能，旨在激发参赛者创新思维，推动AI技术与化学生物学的深度融合。

赛事任务

选手根据提供的demo数据集，可以基于demo数据集进行数据增强、自行搜集数据等方式扩充数据集，并自行划分数据。运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力，若DC50>100nM且Dmax<80% ，则视为降解能力较差（demo数据集中Label=0）；若DC50<=100nM或Dmax>=80%，则视为降解能力好（demo数据集中Label=1）。

赛题数据

部分赛题数据如下所示:

二、建模过程

解题思路

我们的任务是基于训练集的样本数据，构建一个模型来预测测试集中分子的性质情况。这是一个二分类任务，其中目标是根据分析相关信息以及结构信息等特征，预测该分子的性质标签。具体来说，选手需要利用给定的数据集进行特征工程、模型选择和训练，然后使用训练好的模型对测试集中的用户进行预测，并生成相应的预测结果。

参考代码

# 1. 导入需要用到的相关库
# 导入 pandas 库，用于数据处理和分析
import pandas as pd
# 导入 numpy 库，用于科学计算和多维数组操作
import numpy as np
# 从 lightgbm 模块中导入 LGBMClassifier 类
from lightgbm import LGBMClassifier


# 2. 读取训练集和测试集
# 使用 read_excel() 函数从文件中读取训练集数据，文件名为 'traindata-new.xlsx'
train = pd.read_excel('./data/data280993/traindata-new.xlsx')
# 使用 read_excel() 函数从文件中读取测试集数据，文件名为 'testdata-new.xlsx'
test = pd.read_excel('./data/data280993/testdata-new.xlsx')

# 3 特征工程
# 3.1 test数据不包含 DC50 (nM) 和 Dmax (%)，将train数据中的DC50 (nM) 和 Dmax (%)删除
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1)

# 3.2 将object类型的数据进行目标编码处理
for col in train.columns[2:]:
    if train[col].dtype == object or test[col].dtype == object:
        train[col] = train[col].isnull()
        test[col] = test[col].isnull()

# 4. 加载决策树模型进行训练
model = LGBMClassifier(verbosity=-1)
model.fit(train.iloc[:, 2:].values, train['Label'])
pred = model.predict(test.iloc[:, 1:].values, )

# 5. 保存结果文件到本地
pd.DataFrame(
    {
        'uuid': test['uuid'],
        'Label': pred
    }
).to_csv('submit.csv', index=None)

评价指标

本次竞赛的评价标准采用f1_score，分数越高，效果越好。f1分数（f1_score）是统计学中用来衡量二分类（或多任务二分类）模型精确度的一种指标。f1分数同时兼顾了分类模型的准确率和召回率，可以看作是两者的一种加权平均。假如有100个样本，其中1个正样本，99个负样本，如果模型的预测只输出0，那么正确率是99%，此时用正确率来衡量模型的好坏显然是不对的。

优化建议

1. 提取更多特征：在数据挖掘比赛中，特征总是最终制胜法宝，去思考什么信息可以帮助我们提高预测精准度，然后将其转化为特征输入到模型。对于本次赛题可以从专业角度构建特征，除了Smiles特征外，还有很多特征可以提取有价值的信息，比如InChI是由一系列部分组成，提供了关于分子结构的详细信息。比如开头标识、分子式、连接表、氢原子计数、多可旋转键计数、立体化学信息、同分异构体信息、混合物或互变异构体信息、电荷和自旋多重度信息等。
2. 尝试不同的模型：模型间存在很大的差异，预测结果也会不一样，比赛的过程就是不断的实验和试错的过程，通过不断的实验寻找最佳模型，同时帮助自身加强模型的理解能力。

三、进阶升级（特征融合版）

下载相关依赖库

!pip install lightgbm==2.3.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install xgboost==2.0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install catboost
!pip install rdkit
!pip install openpyxl

（因为lightgbm、xgboost调用时部分参数在新版本中有所调整，不建议下载新版本）

导入依赖库

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import f1_score
from rdkit import Chem
from rdkit.Chem import Descriptors
from sklearn.feature_extraction.text import TfidfVectorizer
import tqdm, sys, os, gc, re, argparse, warnings
warnings.filterwarnings('ignore')

数据预处理

train = pd.read_excel('./data/data280993/traindata-new.xlsx')
test = pd.read_excel('./data/data280993/testdata-new.xlsx')

# test数据不包含 DC50 (nM) 和 Dmax (%)
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1)

# 定义了一个空列表drop_cols，用于存储在测试数据集中非空值小于10个的列名。
drop_cols = []
for f in test.columns:
    if test[f].notnull().sum() < 10:
        drop_cols.append(f)
        
# 使用drop方法从训练集和测试集中删除了这些列，以避免在后续的分析或建模中使用这些包含大量缺失值的列
train = train.drop(drop_cols, axis=1)
test = test.drop(drop_cols, axis=1)

# 使用pd.concat将清洗后的训练集和测试集合并成一个名为data的DataFrame，便于进行统一的特征工程处理
data = pd.concat([train, test], axis=0, ignore_index=True)
cols = data.columns[2:]

特征工程

# 将SMILES转换为分子对象列表,并转换为SMILES字符串列表
data['smiles_list'] = data['Smiles'].apply(lambda x:[Chem.MolToSmiles(mol, isomericSmiles=True) for mol in [Chem.MolFromSmiles(x)]])
data['smiles_list'] = data['smiles_list'].map(lambda x: ' '.join(x))  

# 使用TfidfVectorizer计算TF-IDF
tfidf = TfidfVectorizer(max_df = 0.9, min_df = 1, sublinear_tf = True)
res = tfidf.fit_transform(data['smiles_list'])

# 将结果转为dataframe格式
tfidf_df = pd.DataFrame(res.toarray())
tfidf_df.columns = [f'smiles_tfidf_{i}' for i in range(tfidf_df.shape[1])]

# 按列合并到data数据
data = pd.concat([data, tfidf_df], axis=1)

# 自然数编码
def label_encode(series):
    unique = list(series.unique())
    return series.map(dict(zip(
        unique, range(series.nunique())
    )))

for col in cols:
    if data[col].dtype == 'object':
        data[col]  = label_encode(data[col])
        
train = data[data.Label.notnull()].reset_index(drop=True)
test = data[data.Label.isnull()].reset_index(drop=True)

# 特征筛选
features = [f for f in train.columns if f not in ['uuid','Label','smiles_list']]

# 构建训练集和测试集
x_train = train[features]
x_test = test[features]

# 训练集标签
y_train = train['Label'].astype(int)

特征融合

import lightgbm as lgb
import xgboost as xgb

def cv_model(clf, train_x, train_y, test_x, clf_name, seed = 2023):
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    oof = np.zeros(train_x.shape[0])
    test_predict = np.zeros(test_x.shape[0])
    cv_scores = []
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'min_child_weight': 6,
                'num_leaves': 2 ** 6,
                'lambda_l2': 10,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.35,
                'seed': 2024,
                'nthread' : 16,
                'verbose' : -1,
            }
            model = clf.train(params, train_matrix, 2000, valid_sets=[train_matrix, valid_matrix],
                              categorical_feature=[], verbose_eval=1000, early_stopping_rounds=100)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        
        if clf_name == "xgb":
            xgb_params = {
              'booster': 'gbtree', 
              'objective': 'binary:logistic',
            #   'num_class':2,
              'max_depth': 5,
              'lambda': 10,
              'subsample': 0.7,
              'colsample_bytree': 0.7,
              'colsample_bylevel': 0.7,
              'eta': 0.35,
              'tree_method': 'hist',
              'seed': 520,
              'nthread': 16
              }
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            test_matrix = clf.DMatrix(test_x)
            
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')] 
            
            model = clf.train(params=xgb_params, dtrain=train_matrix, num_boost_round=2000,  evals=watchlist, verbose_eval=1000, early_stopping_rounds=100)
            val_pred  = model.predict(valid_matrix)
            test_pred = model.predict(test_matrix)
            
        if clf_name == "cat":
            params = {'learning_rate': 0.35, 'depth': 5, 'bootstrap_type':'Bernoulli','random_seed':2024,
                      'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False}
            
            model = clf(iterations=2000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      metric_period=1000,
                      use_best_model=True, 
                      cat_features=[],
                      verbose=1)
            # 注意，CatBoostClassifier 的 predict_proba 返回的是一个二维数组，选取第二列作为预测概率
            val_pred  = model.predict_proba(val_x)[:, 1]
            test_pred = model.predict_proba(test_x)[:, 1]

        
        oof[valid_index] = val_pred
        test_predict += test_pred / kf.n_splits
        
        F1_score = f1_score(val_y, np.where(val_pred>0.5, 1, 0))
        cv_scores.append(F1_score)
        print(cv_scores)
        
    return oof, test_predict

# 参考demo,具体对照baseline实践部分调用cv_model函数
# 选择lightgbm模型
lgb_oof, lgb_test = cv_model(lgb, x_train, y_train, x_test, 'lgb')
# 选择xgboost模型
xgb_oof, xgb_test = cv_model(xgb, x_train, y_train, x_test, 'xgb')
# 选择catboost模型
cat_oof, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, 'cat')
# 进行取平均融合
final_test = (lgb_test + xgb_test + cat_test) / 3
 
pd.DataFrame(
    {
        'uuid': test['uuid'],
        'Label': np.where(final_test>0.5, 1, 0)
    }
).to_csv('submit_2.csv', index=None)

三大Boosting算法对比

首先，XGBoost、LightGBM和CatBoost都是目前经典的SOTA（state of the art）Boosting算法，都可以归类到梯度提升决策树算法系列，在精度和速度上都有各自的优点。
三者的区别，主要有两个方面：1、构造方式有所不同，XGBoost使用按层生长（level-wise）的决策树构建策略，LightGBM则是使用按叶子生长（leaf-wise）的构建策略，而CatBoost使用了对称树结构，其决策树都是完全二叉树。2、对于类别特征的处理，XGBoost本身不具备自动处理类别特征的能力，对于数据中的类别特征，要手动处理变换成数值后才能输入到模型中；LightGBM中则需要指定类别特征名称，算法即可对其自动进行处理；CatBoost以处理类别特征而闻名，通过目标变量统计等特征编码方式也能实现类别特征的高效处理。

#AI夏令营 #Datawhale #机器学习