第72课:Spark UDF和UDAF解密学习笔记
本期内容:
1 Spark UDF实战
2 Spark UDAF实战
UDAF=USER DEFINE AGGREGATE FUNCTION
下面直接实战编写UDF和UDAF:
package SparkSQLByScala
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkContext, SparkConf}
/**
* 使用Scala开发集群运行的Spark WordCount程序
* @author DT大数据梦工厂
* 新浪微博:http://weibo.com/ilovepains/
* Created by hp on 2016/3/31.
* 通过案例实战Spark SQL下的UDF和UDAF的具体使用:
* UDF: User Defined Function,用户自定义的函数,函数的输入是一条具体的数据记录,实现上讲就是普通的Scala函数;
* UDAF:User Defined Aggregation Function,用户自定义的聚合函数,函数本身作用于数据集合,能够在聚合操作的基础上进行自定义操作;
*
* 实质上讲,例如说UDF会被Spark SQL中的Catalyst封装成为Expression,最终会通过eval方法来计算输入的数据Row(此处的Row和DataFrame
* 中的Row没有任何关系)
*
*/
object SparkSQLUDFUDAF {
def main (args: Array[String]) {
/**
* 第1步:创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,
* 例如说通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置
* 为local,则代表Spark程序在本地运行,特别适合于机器配置条件非常差(例如
* 只有1G的内存)的初学者 *
*/
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("SparkSQLUDFUDAF") //设置应用程序的名称,在程序运行的监控界面可以看到名称
// conf.setMaster("spark://Master:7077") //此时,程序在Spark集群
conf.setMaster("local[4]")
/**
* 第2步:创建SparkContext对象
* SparkContext是Spark程序所有功能的唯一入口,无论是采用Scala、Java、Python、R等都必须有一个SparkContext
* SparkContext核心作用:初始化Spark应用程序运行所需要的核心组件,包括DAGScheduler、TaskScheduler、SchedulerBackend
* 同时还会负责Spark程序往Master注册程序等
* SparkContext是整个Spark应用程序中最为至关重要的一个对象
*/
val sc = new SparkContext(conf) //创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
val sqlContext = new SQLContext(sc) //构建SQL上下文
//模拟实际使用的数据
val bigData = Array("Spark", "Spark", "Hadoop", "Spark", "Hadoop", "Spark", "Spark", "Hadoop", "Spark", "Hadoop")
/**
* 基于提供的数据创建DataFrame
*/
val bigDataRDD = sc.parallelize(bigData)
val bigDataRDDRow = bigDataRDD.map(item => Row(item))
val structType = StructType(Array(StructField("word", StringType, true)))
val bigDataDF = sqlContext.createDataFrame(bigDataRDDRow,structType)
bigDataDF.registerTempTable("bigDataTable") //注册成为临时表
/**
* 通过SQLContext注册UDF,在Scala 2.10.x版本UDF函数最多可以接受22个输入参数
*/
sqlContext.udf.register("computeLength", (input: String) => input.length)
//直接在SQL语句中使用UDF,就像使用SQL自动的内部函数一样
sqlContext.sql("select word, computeLength(word) as length from bigDataTable").show
sqlContext.udf.register("wordCount", new MyUDAF)
sqlContext.sql("select word,wordCount(word) as count,computeLength(word) as length" +
" from bigDataTable group by word").show()
while(true)()
}
}
/**
* 按照模板实现UDAF
*/
class MyUDAF extends UserDefinedAggregateFunction {
/**
* 该方法指定具体输入数据的类型
* @return
*/
override def inputSchema: StructType = StructType(Array(StructField("input", StringType, true)))
/**
* 在进行聚合操作的时候所要处理的数据的结果的类型
* @return
*/
override def bufferSchema: StructType = StructType(Array(StructField("count", IntegerType, true)))
/**
* 指定UDAF函数计算后返回的结果类型
* @return
*/
override def dataType: DataType = IntegerType
override def deterministic: Boolean = true
/**
* 在Aggregate之前每组数据的初始化结果
* @param buffer
*/
override def initialize(buffer: MutableAggregationBuffer): Unit = {buffer(0) =0}
/**
* 在进行聚合的时候,每当有新的值进来,对分组后的聚合如何进行计算
* 本地的聚合操作,相当于Hadoop MapReduce模型中的Combiner
* @param buffer
* @param input
*/
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getAs[Int](0) + 1
}
/**
* 最后在分布式节点进行Local Reduce完成后需要进行全局级别的Merge操作
* @param buffer1
* @param buffer2
*/
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Int](0) + buffer2.getAs[Int](0)
}
/**
* 返回UDAF最后的计算结果
* @param buffer
* @return
*/
override def evaluate(buffer: Row): Any = buffer.getAs[Int](0)
}
在Eclipse中运行如下:
16/04/13 23:54:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
16/04/13 23:54:38 INFO Executor: Finished task 61.0 in stage 5.0 (TID 70). 1609 bytes result sent to driver
16/04/13 23:54:38 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms
16/04/13 23:54:38 INFO Executor: Finished task 59.0 in stage 5.0 (TID 68). 1609 bytes result sent to driver
16/04/13 23:54:38 INFO Executor: Finished task 47.0 in stage 5.0 (TID 56). 1609 bytes result sent to driver
16/04/13 23:54:38 INFO TaskSetManager: Starting task 62.0 in stage 5.0 (TID 71, localhost, partition 63,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:38 INFO TaskSetManager: Starting task 63.0 in stage 5.0 (TID 72, localhost, partition 64,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:38 INFO Executor: Running task 62.0 in stage 5.0 (TID 71)
16/04/13 23:54:38 INFO TaskSetManager: Starting task 64.0 in stage 5.0 (TID 73, localhost, partition 65,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:38 INFO TaskSetManager: Finished task 61.0 in stage 5.0 (TID 70) in 184 ms on localhost (59/199)
16/04/13 23:54:38 INFO TaskSetManager: Finished task 59.0 in stage 5.0 (TID 68) in 250 ms on localhost (60/199)
16/04/13 23:54:38 INFO Executor: Running task 63.0 in stage 5.0 (TID 72)
16/04/13 23:54:39 INFO Executor: Running task 64.0 in stage 5.0 (TID 73)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 46 ms
16/04/13 23:54:39 INFO Executor: Finished task 63.0 in stage 5.0 (TID 72). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 45 ms
16/04/13 23:54:39 INFO TaskSetManager: Starting task 65.0 in stage 5.0 (TID 74, localhost, partition 66,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Finished task 64.0 in stage 5.0 (TID 73). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Finished task 63.0 in stage 5.0 (TID 72) in 107 ms on localhost (61/199)
16/04/13 23:54:39 INFO Executor: Running task 65.0 in stage 5.0 (TID 74)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 66.0 in stage 5.0 (TID 75, localhost, partition 67,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 64.0 in stage 5.0 (TID 73) in 121 ms on localhost (62/199)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Running task 66.0 in stage 5.0 (TID 75)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/13 23:54:39 INFO TaskSetManager: Starting task 67.0 in stage 5.0 (TID 76, localhost, partition 68,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Finished task 66.0 in stage 5.0 (TID 75). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO Executor: Running task 67.0 in stage 5.0 (TID 76)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 68.0 in stage 5.0 (TID 77, localhost, partition 69,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 66.0 in stage 5.0 (TID 75) in 43 ms on localhost (63/199)
16/04/13 23:54:39 INFO Executor: Running task 68.0 in stage 5.0 (TID 77)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 67.0 in stage 5.0 (TID 76). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 69.0 in stage 5.0 (TID 78, localhost, partition 70,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Running task 69.0 in stage 5.0 (TID 78)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 60.0 in stage 5.0 (TID 69) in 426 ms on localhost (64/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 47.0 in stage 5.0 (TID 56) in 881 ms on localhost (65/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 67.0 in stage 5.0 (TID 76) in 67 ms on localhost (66/199)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/13 23:54:39 INFO Executor: Finished task 69.0 in stage 5.0 (TID 78). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 70.0 in stage 5.0 (TID 79, localhost, partition 71,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 69.0 in stage 5.0 (TID 78) in 36 ms on localhost (67/199)
16/04/13 23:54:39 INFO Executor: Running task 70.0 in stage 5.0 (TID 79)
16/04/13 23:54:39 INFO Executor: Finished task 62.0 in stage 5.0 (TID 71). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 71.0 in stage 5.0 (TID 80, localhost, partition 72,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 62.0 in stage 5.0 (TID 71) in 501 ms on localhost (68/199)
16/04/13 23:54:39 INFO Executor: Running task 71.0 in stage 5.0 (TID 80)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 65.0 in stage 5.0 (TID 74). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 72.0 in stage 5.0 (TID 81, localhost, partition 73,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Finished task 71.0 in stage 5.0 (TID 80). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO Executor: Running task 72.0 in stage 5.0 (TID 81)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 65.0 in stage 5.0 (TID 74) in 315 ms on localhost (69/199)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 73.0 in stage 5.0 (TID 82, localhost, partition 74,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 71.0 in stage 5.0 (TID 80) in 93 ms on localhost (70/199)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
16/04/13 23:54:39 INFO Executor: Running task 73.0 in stage 5.0 (TID 82)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 73.0 in stage 5.0 (TID 82). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 74.0 in stage 5.0 (TID 83, localhost, partition 75,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Finished task 72.0 in stage 5.0 (TID 81). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 75.0 in stage 5.0 (TID 84, localhost, partition 76,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 73.0 in stage 5.0 (TID 82) in 69 ms on localhost (71/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 72.0 in stage 5.0 (TID 81) in 80 ms on localhost (72/199)
16/04/13 23:54:39 INFO Executor: Running task 74.0 in stage 5.0 (TID 83)
16/04/13 23:54:39 INFO Executor: Running task 75.0 in stage 5.0 (TID 84)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 26 ms
16/04/13 23:54:39 INFO Executor: Finished task 75.0 in stage 5.0 (TID 84). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 28 ms
16/04/13 23:54:39 INFO TaskSetManager: Starting task 76.0 in stage 5.0 (TID 85, localhost, partition 77,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Finished task 68.0 in stage 5.0 (TID 77). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO Executor: Running task 76.0 in stage 5.0 (TID 85)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 75.0 in stage 5.0 (TID 84) in 78 ms on localhost (73/199)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/13 23:54:39 INFO Executor: Finished task 74.0 in stage 5.0 (TID 83). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO Executor: Finished task 70.0 in stage 5.0 (TID 79). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 77.0 in stage 5.0 (TID 86, localhost, partition 78,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 68.0 in stage 5.0 (TID 77) in 430 ms on localhost (74/199)
16/04/13 23:54:39 INFO Executor: Running task 77.0 in stage 5.0 (TID 86)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 76.0 in stage 5.0 (TID 85). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO Executor: Finished task 77.0 in stage 5.0 (TID 86). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 78.0 in stage 5.0 (TID 87, localhost, partition 79,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Running task 78.0 in stage 5.0 (TID 87)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 79.0 in stage 5.0 (TID 88, localhost, partition 80,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 80.0 in stage 5.0 (TID 89, localhost, partition 81,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 78.0 in stage 5.0 (TID 87). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 81.0 in stage 5.0 (TID 90, localhost, partition 82,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO Executor: Running task 81.0 in stage 5.0 (TID 90)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 82.0 in stage 5.0 (TID 91, localhost, partition 83,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 70.0 in stage 5.0 (TID 79) in 413 ms on localhost (75/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 74.0 in stage 5.0 (TID 83) in 165 ms on localhost (76/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 76.0 in stage 5.0 (TID 85) in 102 ms on localhost (77/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 77.0 in stage 5.0 (TID 86) in 75 ms on localhost (78/199)
16/04/13 23:54:39 INFO TaskSetManager: Finished task 78.0 in stage 5.0 (TID 87) in 76 ms on localhost (79/199)
16/04/13 23:54:39 INFO Executor: Running task 79.0 in stage 5.0 (TID 88)
16/04/13 23:54:39 INFO Executor: Running task 80.0 in stage 5.0 (TID 89)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 79.0 in stage 5.0 (TID 88). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Finished task 80.0 in stage 5.0 (TID 89). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/13 23:54:39 INFO Executor: Running task 82.0 in stage 5.0 (TID 91)
16/04/13 23:54:39 INFO Executor: Finished task 81.0 in stage 5.0 (TID 90). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
16/04/13 23:54:39 INFO TaskSetManager: Starting task 83.0 in stage 5.0 (TID 92, localhost, partition 84,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO TaskSetManager: Starting task 84.0 in stage 5.0 (TID 93, localhost, partition 85,NODE_LOCAL, 1999 bytes)
16/04/13 23:54:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
16/04/13 23:54:39 INFO Executor: Running task 83.0 in stage 5.0 (TID 92)
16/04/13 23:54:39 INFO Executor: Running task 84.0 in stage 5.0 (TID 93)
16/04/13 23:54:39 INFO Executor: Finished task 82.0 in stage 5.0 (TID 91). 1609 bytes result sent to driver
16/04/13 23:54:39 INFO TaskSetManager: Starting task 85.0 in stage 5.0 (TID 94, localhost, partition 86,NODE_LOCAL, 1999 bytes)