Spark SQL之UTF和UTAF
一、UTF 和 UTAF
UTF:UserDefindFunction
UTAF:UserDefindAggregateFunction
UTF和UTAF都是用户自定义SQL函数
二、UTF
sparkSession.udf.register("addName",(name:String)=>{
"addName:"+name
})
sparkSession.sql("select addName(name) as name from user").show()
一般UTF,只是注册一个名称,注册一个函数即可,把字段传入进去
UTF,是一进一出函数
三、 UTAF
UTAF分为弱类型和强类型
弱类型
弱类型,需要继承UserDefindAggregateFunction,因为sql是弱类型语言,所以一般使用弱类型来使用
sparkSession.udf.register(“ageAvg”,new MyUtaf)
// TODO 弱类型UTAL,spark3.0 建议不用,建议使用Aggregator
class MyUtaf extends UserDefinedAggregateFunction{
//输入类型
override def inputSchema: StructType = {
StructType(Array(StructField("age",LongType)))
}
//缓冲区类型
override def bufferSchema: StructType = {
StructType(Array(StructField("total",LongType),StructField("count",LongType)))
}
//最终返回类型
override def dataType: DataType = LongType
//是否保持数据一致
override def deterministic: Boolean = true
//初始化缓冲区值
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0,0L)
buffer.update(1,0L)
}
//更新 缓冲区值
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0)=buffer.getLong(0)+input.getLong(0)
buffer(1)=buffer.getLong(1)+1L
}
//合并,scala里默认提交第一个
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0,buffer1.getLong(0)+buffer2.getLong(0))
buffer1.update(1,buffer1.getLong(1)+buffer2.getLong(1))
}
//计算结果
override def evaluate(buffer: Row): Any = buffer.getLong(0)/buffer.getLong(1)
}
但是在Spark 3.0后不建议使用 弱类型,建议使用强类型
@Stable
@deprecated("Aggregator[IN, BUF, OUT] should now be registered as a UDF" +
" via the functions.udaf(agg) method.", "3.0.0")
abstract class UserDefinedAggregateFunction extends Serializable {
强类型
spark 3.0 及后,强类型,需要继承Aggregator
case class Buff(var total:Long,var count:Long)
//TODO spark 3.0 强类型 UTAF
class MyUtafStrong extends Aggregator[Long,Buff,Long]{
//初始值
override def zero: Buff = Buff(0L,0L)
// 计算每个值
override def reduce(b: Buff, a: Long): Buff = {
b.total=b.total+a
b.count=b.count+1L
b
}
//合并
override def merge(b1: Buff, b2: Buff): Buff = {
b1.total=b1.total+b2.total
b1.count=b1.count+b2.count
b1
}
//最终结果
override def finish(reduction: Buff): Long = {reduction.total/reduction.count}
//固定写法,自定义用户
override def bufferEncoder: Encoder[Buff] = Encoders.product
//scala 自带的数据
override def outputEncoder: Encoder[Long] = Encoders.scalaLong
}
使用时,需要注册为udf,但是这个是强类型的,需要转化为弱类型使用
sparkSession.udf.register("avgStrong",functions.udaf(new MyUtafStrong))
sparkSession.sql("select avgStrong(age) as age from user").show()
但是,spark 3.0 之前,也是继承Aggregator,使用时是需要用DataSet来使用,但是DataSet是强类型的,那么输入这个UDAF输入是强类型,那么定义时也需要输入为强类型
case class User(var age:Long,var name:String)
case class Buff(var total:Long,var count:Long)
//TODO spark 3.0 以前需要用强类型,使用上需要用DataSet
class MyUtafStrong extends Aggregator[User,Buff,Long]{
//初始值
override def zero: Buff = Buff(0L,0L)
// 计算每个值
override def reduce(b: Buff, a: User): Buff = {
b.total=b.total+a.age
b.count=b.count+1L
b
}
//合并
override def merge(b1: Buff, b2: Buff): Buff = {
b1.total=b1.total+b2.total
b1.count=b1.count+b2.count
b1
}
//最终结果
override def finish(reduction: Buff): Long = {reduction.total/reduction.count}
//固定写法,自定义用户
override def bufferEncoder: Encoder[Buff] = Encoders.product
//scala 自带的数据
override def outputEncoder: Encoder[Long] = Encoders.scalaLong
}
使用时需要用DataSet来使用
//TODO spark 3.0 之前,要用强类型的UDAF,要使用DSL语法,那么需要DataSet
val utafColumn: TypedColumn[User, Long] = new MyUtafStrong().toColumn
ds.select(utafColumn).show()