Pyspark分类--NaiveBayes

NaiveBayes朴素贝叶斯分类

class pyspark.ml.classification.NaiveBayes(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, smoothing=1.0, modelType=‘multinomial’, thresholds=None, weightCol=None)

朴素贝叶斯分类器。 它同时支持多项式和伯努利 NB。 多项式 NB 可以处理有限支持的离散数据。 例如,通过将文档转换为 TF-IDF 向量,可以用于文档分类。 通过使每个向量成为二进制(0/1)数据。它也可以用作伯努利NB。输入特征值必须是非负的

featuresCol = Param(parent=‘undefined’, name=‘featuresCol’, doc=‘features column name.’)

modelType = Param(parent=‘undefined’, name=‘modelType’, doc=‘模型类型,字符串(区分大小写)。支持的选项:多项式(默认)和bernoulli。’)

predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘prediction column name.’)

probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件的列名 注意:并非所有模型都输出经过良好校准的概率估计!这些概率应被视为置信度,而不是精确概率。’)

rawPredictionCol = Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测列名 (又名置信度)。’)

smoothing = Param(parent=‘undefined’, name=‘smoothing’, doc=‘平滑参数,应该 >= 0,默认为 1.0’)

thresholds = Param(parent=‘undefined’, name=‘thresholds’, doc="多类分类中的阈值,用于调整每个类的预测概率。数组的长度必须等于类数,值> 0,但最多一个值可能为0。p最大的类

model.pi:先验概率的对数

model.theta:条件概率的对数

01.创建数据集并查看结构

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
    .config("spark.ui.showConsoleProgress","false").appName("NaiveBayes")\
    .master("local[*]").getOrCreate()
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
    Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
df.show()
df.printSchema()

​ 输出结果:

+---------+-----+------+
| features|label|weight|
+---------+-----+------+
|[0.0,0.0]|  0.0|   0.1|
|[0.0,1.0]|  0.0|   0.5|
|[1.0,0.0]|  1.0|   1.0|
+---------+-----+------+

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- weight: double (nullable = true)

02.使用朴素贝叶斯分类器,转换数据并查看结果

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
model = nb.fit(df)
model.transform(df).show()
print(model.transform(df).head(3))

​ 输出结果:

+---------+-----+------+--------------------+--------------------+----------+
| features|label|weight|       rawPrediction|         probability|prediction|
+---------+-----+------+--------------------+--------------------+----------+
|[0.0,0.0]|  0.0|   0.1|[-0.8109302162163...|[0.44444444444444...|       1.0|
|[0.0,1.0]|  0.0|   0.5|[-1.3217558399823...|[0.59016393442622...|       0.0|
|[1.0,0.0]|  1.0|   1.0|[-1.7272209480904...|[0.32432432432432...|       1.0|
+---------+-----+------+--------------------+--------------------+----------+

[Row(features=DenseVector([0.0, 0.0]), label=0.0, weight=0.1, rawPrediction=DenseVector([-0.8109, -0.5878]), probability=DenseVector([0.4444, 0.5556]), prediction=1.0),
 Row(features=DenseVector([0.0, 1.0]), label=0.0, weight=0.5, rawPrediction=DenseVector([-1.3218, -1.6864]), probability=DenseVector([0.5902, 0.4098]), prediction=0.0),
 Row(features=DenseVector([1.0, 0.0]), label=1.0, weight=1.0, rawPrediction=DenseVector([-1.7272, -0.9933]), probability=DenseVector([0.3243, 0.6757]), prediction=1.0)]

03.查看先验概率和条件概率的对数

print(model.pi)
print(model.theta)

​ 输出结果:

[-0.8109302162163285,-0.587786664902119]
DenseMatrix([[-0.91629073, -0.51082562],
             [-0.40546511, -1.09861229]])
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值