使用spark进行关联规则挖掘:
1 首先数据全部处理为分类变量
2 使用spark mllib 中的FPGrowth挖掘关联规则
存在问题
1 数据需要处理成sparse格式,也有人称basket格式
2 版本原因会报错:
java.lang.IllegalArgumentException: Can not set
final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer
Serialization trace:
nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
加上conf..set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")即可
具体如下:
Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF) val conf = new SparkConf() conf.setAppName("ares_login_FPGrowth").set("spark.serializer", "org.apache.spark.serializer.JavaSerializer") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val data= sqlContext.read.parquet(dataPath) println("total num: " + data.count()) val trainsactions: RDD[Array[String]] = data.rdd.map(row => { row.getString(1) + " " + row.getString(2) + " " + row.getString(3)) }).map(_.trim.split("\\s+")) val minSupp = 0.6 val minConf = 0.8 val partitions = 10 val fpGrowth = new FPGrowth().setMinSupport(minSupp).setNumPartitions(partitions) val model = fpGrowth.run(trainsactions) println("frequent itemsets: ") model.freqItemsets.collect().foreach(itemset =>{ println(itemset.items.mkString("[",",","]") + " {" + itemset.freq + "}") }) println("rules: ") model.generateAssociationRules(minConf).collect().foreach(rule => { println(rule.antecedent.mkString("[", ",", "]") + "=>" + rule.consequent.mkString("[", "," ,"]") + " {" + rule.confidence + "}") }) println("num of rules: " + model.generateAssociationRules(minConf).collect().length) sc.stop()