4.2 基于LR的点击率预测模型训练
-
本小节主要根据广告点击样本数据集(raw_sample)、广告基本特征数据集(ad_feature)、用户基本信息数据集(user_profile)构建出了一个完整的样本数据集,并按日期划分为了训练集(前七天)和测试集(最后一天),利用逻辑回归进行训练。
训练模型时,通过对类别特征数据进行处理,一定程度达到提高了模型的效果
'''从HDFS中加载样本数据信息'''
_raw_sample_df1 = spark.read.csv("hdfs://localhost:8020/csv/raw_sample.csv", header=True)
# _raw_sample_df1.show() # 展示数据,默认前20条
# 更改表结构,转换为对应的数据类型
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, LongType, StringType
# 更改df表结构:更改列类型和列名称
_raw_sample_df2 = _raw_sample_df1.\
withColumn("user", _raw_sample_df1.user.cast(IntegerType())).withColumnRenamed("user", "userId").\
withColumn("time_stamp", _raw_sample_df1.time_stamp.cast(LongType())).withColumnRenamed("time_stamp", "timestamp").\
withColumn("adgroup_id", _raw_sample_df1.adgroup_id.cast(IntegerType())).withColumnRenamed("adgroup_id", "adgroupId").\
withColumn("pid", _raw_sample_df1.pid.cast(StringType())).\
withColumn("nonclk", _raw_sample_df1.nonclk.cast(IntegerType())).\
withColumn("clk", _raw_sample_df1.clk.cast(IntegerType()))
_raw_sample_df2.printSchema()
_raw_sample_df2.show()
# 样本数据pid特征处理
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
stringindexer = StringIndexer(inputCol='pid', outputCol='pid_feature')
encoder = OneHotEncoder(dropLast=False, inputCol='pid_feature', outputCol='pid_value')
pipeline = Pipeline(stages=[stringindexer, encoder])
pipeline_fit = pipeline.fit(_raw_sample_df2)
raw_sample_df = pipeline_fit.transform(_raw_sample_df2)
raw_sample_df.show()
'''pid和特征的对应关系
430548_1007:0
430549_1007:1
'''
显示结果:
root
|-- userId: integer (nullable = true)
|-- timestamp: long (nullable = true)
|-- adgroupId: integer (nullable = true)
|-- pid: string (nullable = true)
|-- nonclk: integer (nullable = true)
|-- clk: integer (nullable = true)
+------+----------+---------+-----------+------+---+
|userId| timestamp|adgroupId| pid|nonclk|clk|
+------+----------+---------+-----------+------+---+
|581738|1494137644| 1|430548_1007| 1| 0|
|449818|1494638778| 3|430548_1007| 1| 0|
|914836|1494650879| 4|430548_1007| 1| 0|
|914836|1494651029| 5|430548_1007| 1| 0|
|399907|1494302958| 8|430548_1007| 1| 0|
|628137|1494524935| 9|430548_1007| 1| 0|
|298139|1494462593| 9|430539_1007| 1| 0|
|775475|1494561036| 9|430548_1007| 1| 0|
|555266|1494307136| 11|430539_1007| 1| 0|
|117840|1494036743| 11|430548_1007| 1| 0|
|739815|1494115387| 11|430539_1007| 1| 0|
|623911|1494625301| 11|430548_1007| 1| 0|
|623911|1494451608| 11|430548_1007| 1| 0|
|421590|1494034144| 11|430548_1007| 1| 0|
|976358|1494156949| 13|430548_1007| 1| 0|
|286630|1494218579| 13|430539_1007| 1| 0|
|286630|1494289247| 13|430539_1007| 1| 0|
|771431|1494153867| 13|430548_1007| 1| 0|
|707120|1494220810| 13|430548_1007| 1| 0|
|530454|1494293746| 13|430548_1007| 1| 0|
+------+----------+---------+-----------+------+---+
only showing top 20 rows
+------+----------+---------+-----------+------+---+-----------+-------------+
|userId| timestamp|adgroupId| pid|nonclk|clk|pid_feature| pid_value|
+------+----------+---------+-----------+------+---+-----------+-------------+
|581738|1494137644| 1|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|449818|1494638778| 3|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|914836|1494650879| 4|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|914836|1494651029| 5|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|399907|1494302958| 8|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|628137|1494524935| 9|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|298139|1494462593| 9|430539_1007| 1| 0| 1.0|(2,[1],[1.0])|
|775475|1494561036| 9|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|555266|1494307136| 11|430539_1007| 1| 0| 1.0|(2,[1],[1.0])|
|117840|1494036743| 11|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|739815|1494115387| 11|430539_1007| 1| 0| 1.0|(2,[1],[1.0])|
|623911|1494625301| 11|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|623911|1494451608| 11|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|421590|1494034144| 11|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|976358|1494156949| 13|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|286630|1494218579| 13|430539_1007| 1| 0| 1.0|(2,[1],[1.0])|
|286630|1494289247| 13|430539_1007| 1| 0| 1.0|(2,[1],[1.0])|
|771431|1494153867| 13|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|707120|1494220810| 13|430548_1007| 1| 0| 0.0|(2,[0],[1.0])|
|530454|1494293746| 13|430548_1007| 1| 0| 0.0|(2,