山东大学机器学习大作业 Recsys 2024 实验代码 (丐版)
前言
啥也不会,仅供参考,代码只是能跑出来,准确率我也没测,还有给的各种指标我也没测。应该在那个
train_small
数据集上测
设计的模型借鉴了一点Bert4Rec的思路
模型设计很简陋,本来是想用余弦相似度计算得分的,但是莫的时间
test数据集感觉要跑好长时间,直接开摆,也不知道最后要不要提交结果上去
有反馈说需要评估结果截图的,这里放一个
生成的txt文件下载链接
基本思路
- 直接根据用户历史记录来学习用户特征
- 用户历史记录统一规范化为100条,不足的补0,多的裁减
- 把用户历史记录的
title
转换成向量(用那bert模型),统一长度为29,多的裁减,少的补0 - 在每个title对应的向量开头加一个
user_id
,这样最后的每个历史记录的embedding就是30了
- 因为前边已经规范化为100条了,所以输出维度也是100,代表每一个标题的得分,优劣参考历史记录对应的滚动百分比
hint
- 有一个问题就是这个模型设计上就有很大问题(),就是对于一个新的文章,在输入列表的不同位置,得分差别很大,几乎没啥逻辑性。所以这个就是能跑而已哈哈哈。 (有时间就改了)
- 聚类或K近邻感觉都比这个靠谱()
- 我这里用的gpu, 应该你是GPU/CPU都不会出啥问题。内存不足就减少一下那个
dataloader
的batch_size
- 性能分析应该就没啥问题
- 等跑完了会上传模型文件(trained_model)以及预测结果文件(pred_re.txt)到网址
2024 6.4 添加了几个指标的评测 直接使用train_small 训练集进行训练与评测,未生成txt
文件
模型训练训练部分
初始化操作
- 导入用到的库
- 加载模型
- 定义文本到向量的函数
from transformers import AutoModelForMaskedLM, AutoTokenizer,AutoModel
from torch.utils.data import DataLoader,Dataset
from torch import nn
from torch.nn import TransformerEncoderLayer, TransformerEncoder
from transformers import pipeline
import torch
import numpy as np
import torch.nn as nn
from tqdm import tqdm
model_path = './model'
transformer_model = AutoModel.from_pretrained(model_path)
transformer_tokenizer = AutoTokenizer.from_pretrained(model_path)
MAX_VECTOR_LEN=29
def text_to_tensor(texts)->torch.Tensor:#传入文本列表
re=torch.Tensor([])
for text in texts:
tmp=transformer_tokenizer(text,max_length=MAX_VECTOR_LEN,padding='max_length',truncation=True,return_tensors='pt')['input_ids']
re=torch.cat((re,tmp),dim=0)
return re
2024-06-04 11:59:51.045301: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-04 11:59:51.047871: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-04 11:59:51.077469: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-04 11:59:51.731127: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
加载训练数据
训练数据主要结构(使用到的)
- train_behaviiors
- 用户的一些特征
- 展示的文章id列表
- train_history
- 用户的历史行为记录
- 包含了用户以前浏览的文章信息
- 用户的历史行为记录
- train_article
- 文章信息(这里仅用到了id,title和subtitle)
import pandas as pd
train_behaviors=pd.read_parquet('./data/train/behaviors.parquet')
train_history=pd.read_parquet('./data/train/history.parquet')
train_article=pd.read_parquet('./data/articles.parquet')
print('train_behaviors')
print(train_behaviors.info())
print('train_history')
print(train_history.info())
print('train_article')
print(train_article.info())
train_behaviors
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232887 entries, 0 to 232886
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 impression_id 232887 non-null uint32
1 article_id 70421 non-null float64
2 impression_time 232887 non-null datetime64[us]
3 read_time 232887 non-null float32
4 scroll_percentage 69098 non-null float32
5 device_type 232887 non-null int8
6 article_ids_inview 232887 non-null object
7 article_ids_clicked 232887 non-null object
8 user_id 232887 non-null uint32
9 is_sso_user 232887 non-null bool
10 gender 16219 non-null float64
11 postcode 4673 non-null float64
12 age 6341 non-null float64
13 is_subscriber 232887 non-null bool
14 session_id 232887 non-null uint32
15 next_read_time 226669 non-null float32
16 next_scroll_percentage 206617 non-null float32
dtypes: bool(2), datetime64[us](1), float32(4), float64(4), int8(1), object(2), uint32(3)
memory usage: 19.3+ MB
None
train_history
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15143 entries, 0 to 15142
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 15143 non-null uint32
1 impression_time_fixed 15143 non-null object
2 scroll_percentage_fixed 15143 non-null object
3 article_id_fixed 15143 non-null object
4 read_time_fixed 15143 non-null object
dtypes: object(4), uint32(1)
memory usage: 532.5+ KB
None
train_article
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20738 entries, 0 to 20737
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 article_id 20738 non-null int32
1 title 20738 non-null object
2 subtitle 20738 non-null object
3 last_modified_time 20738 non-null datetime64[us]
4 premium 20738 non-null bool
5 body 20738 non-null object
6 published_time 20738 non-null datetime64[us]
7 image_ids 18860 non-null object
8 article_type 20738 non-null object
9 url 20738 non-null object
10 ner_clusters 20738 non-null object
11 entity_groups 20738 non-null object
12 topics 20738 non-null object
13 category 20738 non-null int16
14 subcategory 20738 non-null object
15 category_str 20738 non-null object
16 total_inviews 9968 non-null float64
17 total_pageviews 9856 non-null float64
18 total_read_time 9856 non-null float32
19 sentiment_score 20738 non-null float32
20 sentiment_label 20738 non-null object
dtypes: bool(1), datetime64[us](2), float32(2), float64(2), int16(1), int32(1), object(12)
memory usage: 2.8+ MB
None
def find_train_article_title(article_ids,train_article=train_article):
value_to_index_dict = {value: index for index, value in enumerate(article_ids)}
bool_index=train_article['article_id'].isin(article_ids)
selected_index=train_article[bool_index].index
re_index=[]
for i in range(len(selected_index)):
re_index.append(value_to_index_dict[train_article['article_id'][selected_index[i]]])
return train_article[train_article['article_id'].isin(article_ids)]['title'],re_index
def find_train_article_sub_title(article_ids,train_article=train_article):
return train_article[train_article['article_id'].isin(article_ids)]['subtitle']
训练数据集的定义
self.data
的结构是[ [ [article_id,… … ] ] …]- get_item()返回对应的掩码的序列和对应的原序列,用于训练以及计算损失
class HistoryDataSet(Dataset):
def __init__(self,data,max_seq_len) -> None:
super().__init__()
self.data=data
self.max_seq_len=max_seq_len
def __len__(self):
return self.data[0].shape[0]
def cope_seq(self,seq:torch.Tensor,idx:int):
user_id=self.data[2][idx]
seq_re=torch.ones((seq.shape[0],1),dtype=torch.float64)*user_id
seq_re=torch.cat((seq_re,seq),dim=1)
return seq_re
def __getitem__(self,idx):
#获取用户的信息
article_ids=self.data[0][idx]
texts,selected_indices=find_train_article_title(article_ids,train_article=train_article)
seq=text_to_tensor(texts)
#规范化,对历史记录的长度进行规范
score=torch.Tensor(self.data[1][idx][selected_indices])
padding_nums=self.max_seq_len-len(seq)
if(padding_nums<=0):
seq=seq[:self.max_seq_len]
score=score[:self.max_seq_len]
else:
seq=torch.cat((seq,torch.zeros((padding_nums,len(seq[0])))))
score=torch.cat((score,torch.zeros((padding_nums,))))
score=torch.nan_to_num(score,50)
re_seq=self.cope_seq(seq,idx)
re_seq=torch.tensor(re_seq,dtype=torch.float64)
score=torch.tensor(score,dtype=torch.float64)
return re_seq,score
构造训练集
train_data=[train_history['article_id_fixed'],train_history['scroll_percentage_fixed'],train_history['user_id']]
train_data_set=HistoryDataSet(train_data,max_seq_len=100)
seq,score=train_data_set.__getitem__(0)
train_data_set.__len__()
/tmp/ipykernel_46213/1734182082.py:29: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
re_seq=torch.tensor(re_seq,dtype=torch.float64)
/tmp/ipykernel_46213/1734182082.py:30: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
score=torch.tensor(score,dtype=torch.float64)
15143
定义Bert4Rec模型
- 通过历史记录中的
滚动百分比
来衡量用户对文章兴趣程度 - 对于无效值默认设置为50
模型代码
- 初始化模型的参数
- embed_dim=30:嵌入维度,即每个输入标题的向量表示长度
- 这里实际上对于每个文章的
title
的维度是29,在头部添加了一个user_id
用于区分不同的用户特征
- 这里实际上对于每个文章的
- num_heads=10:Transformer编码器中自注意力机制的头数
- num_layers=6:Transformer编码器的层数,加深模型的非线性表达能力。
- dropout=0.2:用于防止过拟合的丢弃率
- out_dim=100:输出层的维度,对应的输出100个分别对应用户可能对文章的兴趣度(输入的就是100个记录,在后续的预测中也会对展示的商品处理)
- embed_dim=30:嵌入维度,即每个输入标题的向量表示长度
模型结构
- 模型输入
(user_nums, user_history_nums, title_embedding_len=30)
- Transformer编码: 序列通过多层Transformer编码器,用于捕获序列内部的长距离依赖关系
- 平均池化: 编码后的序列通过序列长度维度进行平均池化,得到每个批次样本的固定长度表示,视为用户历史行为的综合表示
- 线性变换: 这一步通过一个全连接层将用户历史行为的综合表示转换为输出维度out_dim的向量,算是转换到一个潜在的打分空间
import torch
import torch.nn as nn
class SimplifiedBert4Rec(nn.Module):
def __init__(self, embed_dim=30, num_heads=5, num_layers=3, dropout=0.1, out_dim=100):
super(SimplifiedBert4Rec, self).__init__()
self.transformer_encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim*4, dropout=dropout),
num_layers=num_layers
)
self.linear = nn.Linear(embed_dim, out_dim) # 输出层用于预测得分
def forward(self, seqs):
"""
seqs: 用户历史记录的嵌入表示,形状为(user_nums, user_history_nums, title_embedding_len=30)
"""
seqs = torch.permute(seqs, (1, 0, 2)) # 调整形状以便输入(seq_len, batch, embed_dim)
seqs = self.transformer_encoder(seqs.float())
pooled_output = seqs.mean(dim=0) # 平均池化获取序列的表示
seqs=self.linear(pooled_output) #预测得分(这里就是滚动百分比)
return seqs/100#返回对于每一个历史记录的预测滚动百分比
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model实例化
from torch.nn.utils.rnn import pad_sequence
model = SimplifiedBert4Rec()
model=model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
num_epochs=8
data_loader=DataLoader(train_data_set,batch_size=1)
for epoch in range(num_epochs):
id=0
sum_loss=[]
for user_hist, labels in tqdm(data_loader):
user_hist, labels = user_hist.to(device), labels.to(device) # 数据转移到GPU
optimizer.zero_grad()
preds = model(user_hist)
labels=labels[0]/100
loss = criterion(preds.float(), labels.float())
loss.backward()
optimizer.step()
sum_loss.append(loss.item())
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {np.array(loss.detach().cpu()).mean()}")
0%| | 0/15143 [00:00<?, ?it/s]
/tmp/ipykernel_46213/1734182082.py:29: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
re_seq=torch.tensor(re_seq,dtype=torch.float64)
/tmp/ipykernel_46213/1734182082.py:30: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
score=torch.tensor(score,dtype=torch.float64)
100%|██████████| 15143/15143 [03:53<00:00, 64.94it/s]
Epoch 1/8, Loss: 0.013482279144227505
100%|██████████| 15143/15143 [03:53<00:00, 64.99it/s]
Epoch 2/8, Loss: 0.014844291843473911
100%|██████████| 15143/15143 [03:53<00:00, 64.89it/s]
Epoch 3/8, Loss: 0.014158891513943672
100%|██████████| 15143/15143 [03:52<00:00, 65.01it/s]
Epoch 4/8, Loss: 0.014420343562960625
100%|██████████| 15143/15143 [03:53<00:00, 64.77it/s]
Epoch 5/8, Loss: 0.012479839846491814
100%|██████████| 15143/15143 [03:53<00:00, 64.99it/s]
Epoch 6/8, Loss: 0.014308840036392212
100%|██████████| 15143/15143 [03:53<00:00, 64.93it/s]
Epoch 7/8, Loss: 0.012692894786596298
100%|██████████| 15143/15143 [03:53<00:00, 64.81it/s]
Epoch 8/8, Loss: 0.013837754726409912
torch.save(model,'./trianed_model_ts.pickle')
预测部分
处理预测数据,根据test_behaviors.parquet
预测
- 利用test中的behavior提供的数据
- 对artivle_inview的文章列表计算其得分并根据得分给出用户对文章列表中文章感兴趣程度的一个列表,对列表进行排序即可得到结果
model=torch.load('./trianed_model_ts.pickle')
print(model)
impression_articles_in_view_ids=train_behaviors['article_ids_inview']
impression_ids=train_behaviors['impression_id']
user_ids=train_behaviors['user_id']
clicked_titles=train_behaviors['article_ids_clicked']
model.eval()
def get_history_by_id(user_id):
return train_history[train_history['user_id']==user_id]
def gen_pre_data(user_id,article_ids):
history_ids=get_history_by_id(user_id)['article_id_fixed']
history_ids=history_ids.to_list()
pred_titles=text_to_tensor(find_train_article_title(article_ids,train_article)[0])
num=pred_titles.shape[0]
tmp_num=num
now_pos=0
pred_data_splited=[]
while(tmp_num-100>0):
pred_data_splited.append(pred_titles[now_pos:now_pos+100])
tmp_num-=100
now_pos+=100
pred_data_splited.append(pred_titles[now_pos:])
if(pred_data_splited[-1].shape[0]<100):
pred_data_splited[-1]=torch.cat((pred_data_splited[-1],torch.zeros((100-pred_data_splited[-1].shape[0],29),dtype=torch.float64)),dim=0)
for i in range(len(pred_data_splited)):
pred_data_splited[i]=torch.cat((torch.ones(100,1)*user_id,pred_data_splited[i]),dim=1)
pred_data_splited=torch.stack(pred_data_splited)
return pred_data_splited,num
SimplifiedBert4Rec(
(transformer_encoder): TransformerEncoder(
(layers): ModuleList(
(0-2): 3 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=30, out_features=30, bias=True)
)
(linear1): Linear(in_features=30, out_features=120, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=120, out_features=30, bias=True)
(norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
)
(linear): Linear(in_features=30, out_features=100, bias=True)
)
读取测试数据并处理预测 并 计算正确率
total_num=0
correct_num=0
correct_list=[]
pred_re=[]
real_re=[]
pred_order=[]
for im_id,us_id,art_ids,clicked_item in tqdm(zip(impression_ids,user_ids,impression_articles_in_view_ids,clicked_titles)):
if(total_num>=10000):
break
total_num+=1
pred_input,num=gen_pre_data(us_id,art_ids)
total_re=[]
for input in pred_input:
re=model(input.reshape(1,input.shape[0],input.shape[1]).to(device)).tolist()
total_re=total_re+re[0]
total_re=total_re[:num]
total_re=np.array(total_re)
sorted_pairs = sorted(enumerate(total_re), key=lambda x: x[1],reverse=True)
order_list = [index + 1 for index, value in sorted_pairs]
art_ids=art_ids.tolist()
pred_re.append(order_list)
real_re.append(art_ids.index(clicked_item[0]))
pred_order.append(order_list[art_ids.index(clicked_item[0])])
if(art_ids.index(clicked_item[0])==order_list.index(1)):
correct_num+=1
correct_list.append(True)
else:
correct_list.append(False)
print("pred_correct_rate")
print(correct_num/total_num)
10000it [00:35, 285.51it/s]
pred_correct_rate
0.1194
正确率在0.12左右
MRR 指标测试
mrr=(1/np.array(pred_order)).mean()
print('MRR : '+str(mrr))
MRR : 0.31861658386008895
NDCG2 指标测试
def find_indices(main_list, target_list):
return [i for i, element in enumerate(main_list) if element in target_list]
main_list = [1, 2, 3, 4, 5, 2, 7]
target_list = [2, 8, 9]
indices = find_indices(main_list, target_list)
print(indices)
def ndcg_k(k,pred_res,real_res):
in_num=0
r=[i for i in range(1,k+1)]
for pred_re,re in zip(pred_res,real_res):
indices=find_indices(pred_re,r)
if(re in indices):
in_num+=1
return in_num
print('NDCG2'+' : '+str(ndcg_k(3,pred_re,real_re)/len(real_re)))
[1, 5]
NDCG2 : 0.3576