IsolationForest算法spark实现

本文介绍了一种基于IsolationForest算法的异常检测方法,并详细展示了如何使用Apache Spark MLlib库中的IsolationForest实现该算法。通过Wisconsin乳腺癌数据集进行实验,训练模型并评估其准确性,最终模型的AUC值达到0.931。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

/*
Notice:
需要事先将IsolationForest算法源码利用mvn方式jar包,才可以使用import org.apache.spark.ml.iforest.IForest
scala源代码地址:https://github.com/titicaca/spark-iforest

python库sklearn.ensemble.IsolationForest官方文档地址:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
*/

import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.iforest.IForest
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator


// Wisconsin Breast Cancer Dataset
val dataset = (spark.read.option("inferSchema", "true")
              .csv("/anomaly-detection/breastw.csv"))



// Index label values: 2 -> 0, 4 -> 1
val indexer = (new StringIndexer()
               .setInputCol("_c10")
               .setOutputCol("label"))

val assembler = (new VectorAssembler()
                    .setInputCols(dataset.columns.filter(!_.contains("label")))
                    .setOutputCol("features"))

val iForest = (new IForest()
                   .setNumTrees(100)
                   .setMaxSamples(256)
                   .setContamination(0.35)
                   .setBootstrap(false)
                   .setMaxDepth(100)
                   .setSeed(123456L))

val pipeline = new Pipeline().setStages(Array(indexer, assembler, iForest))


// let's split the dataset into a training and test dataframe
val Array(trainDF, testDF) = dataset.randomSplit(Array(0.8, 0.2),seed = 123456L)

val model = pipeline.fit(trainDF)
val predictions = model.transform(testDF)


// What was the overall accuracy of the model, using AUC
val evaluator = (new BinaryClassificationEvaluator()
   .setLabelCol("label")
   .setRawPredictionCol("prediction")
   .setMetricName("areaUnderROC"))

val auc = evaluator.evaluate(predictions)
println(s"The model's auc: $auc")

/*

scala> val auc = evaluator.evaluate(predictions)
auc: Double = 0.9311653116531164

scala> println(s"The model's auc: $auc")
The model's auc: 0.9311653116531164

*/

转载于:https://my.oschina.net/kyo4321/blog/2994487

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值