Pyspark分类--NaiveBayes

最新推荐文章于 2024-06-05 17:24:37 发布

Gadaite

最新推荐文章于 2024-06-05 17:24:37 发布

阅读量812

点赞数

分类专栏： ML基础文章标签：分类机器学习 spark

本文链接：https://blog.youkuaiyun.com/weixin_46408961/article/details/123415621

版权

ML基础专栏收录该内容

43 篇文章

订阅专栏

NaiveBayes朴素贝叶斯分类

class pyspark.ml.classification.NaiveBayes(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, smoothing=1.0, modelType=‘multinomial’, thresholds=None, weightCol=None)

朴素贝叶斯分类器。它同时支持多项式和伯努利 NB。多项式 NB 可以处理有限支持的离散数据。例如，通过将文档转换为 TF-IDF 向量，可以用于文档分类。通过使每个向量成为二进制（0/1）数据。它也可以用作伯努利NB。输入特征值必须是非负的

featuresCol = Param(parent=‘undefined’, name=‘featuresCol’, doc=‘features column name.’)

modelType = Param(parent=‘undefined’, name=‘modelType’, doc=‘模型类型，字符串（区分大小写）。支持的选项：多项式（默认）和bernoulli。’)

predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘prediction column name.’)

probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件的列名注意：并非所有模型都输出经过良好校准的概率估计！这些概率应被视为置信度，而不是精确概率。’)

rawPredictionCol = Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测列名（又名置信度）。’）

smoothing = Param(parent=‘undefined’, name=‘smoothing’, doc=‘平滑参数，应该 >= 0，默认为 1.0’)

thresholds = Param(parent=‘undefined’, name=‘thresholds’, doc="多类分类中的阈值，用于调整每个类的预测概率。数组的长度必须等于类数，值> 0，但最多一个值可能为0。p最大的类

model.pi：先验概率的对数

model.theta：条件概率的对数

01.创建数据集并查看结构

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
    .config("spark.ui.showConsoleProgress","false").appName("NaiveBayes")\
    .master("local[*]").getOrCreate()
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
    Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
df.show()
df.printSchema()

输出结果：

+---------+-----+------+
| features|label|weight|
+---------+-----+------+
|[0.0,0.0]|  0.0|   0.1|
|[0.0,1.0]|  0.0|   0.5|
|[1.0,0.0]|  1.0|   1.0|
+---------+-----+------+

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- weight: double (nullable = true)

02.使用朴素贝叶斯分类器，转换数据并查看结果

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
model = nb.fit(df)
model.transform(df).show()
print(model.transform(df).head(3))

输出结果：

+---------+-----+------+--------------------+--------------------+----------+
| features|label|weight|       rawPrediction|         probability|prediction|
+---------+-----+------+--------------------+--------------------+----------+
|[0.0,0.0]|  0.0|   0.1|[-0.8109302162163...|[0.44444444444444...|       1.0|
|[0.0,1.0]|  0.0|   0.5|[-1.3217558399823...|[0.59016393442622...|       0.0|
|[1.0,0.0]|  1.0|   1.0|[-1.7272209480904...|[0.32432432432432...|       1.0|
+---------+-----+------+--------------------+--------------------+----------+

[Row(features=DenseVector([0.0, 0.0]), label=0.0, weight=0.1, rawPrediction=DenseVector([-0.8109, -0.5878]), probability=DenseVector([0.4444, 0.5556]), prediction=1.0),
 Row(features=DenseVector([0.0, 1.0]), label=0.0, weight=0.5, rawPrediction=DenseVector([-1.3218, -1.6864]), probability=DenseVector([0.5902, 0.4098]), prediction=0.0),
 Row(features=DenseVector([1.0, 0.0]), label=1.0, weight=1.0, rawPrediction=DenseVector([-1.7272, -0.9933]), probability=DenseVector([0.3243, 0.6757]), prediction=1.0)]

03.查看先验概率和条件概率的对数

print(model.pi)
print(model.theta)

输出结果：

[-0.8109302162163285,-0.587786664902119]
DenseMatrix([[-0.91629073, -0.51082562],
             [-0.40546511, -1.09861229]])