sparkSQL中UDF的使用

本文介绍如何在 Spark 中使用 UDF 功能实现自定义方法,包括多参数支持、字段合并、统计数量等实用技巧。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark中使用sql时一些功能需要自定义方法实现,这时候就可以使用UDF功能来实现

多参数支持

UDF不支持参数*的方式输入多个参数,例如String*,不过可以使用array来解决这个问题。

定义udf方法,此处功能是将多个字段合并为一个字段

<code class="hljs python has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">allInOne</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(seq: Seq[Any], sep: String)</span>:</span> String = seq.mkString(sep)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

在sql中使用

<code class="hljs scala has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;">sqlContext.udf.register(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"allInOne"</span>, allInOne _)

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//将col1,col2,col3三个字段合并,使用','分割</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> sql =
    <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"""
      |select allInOne(array(col1,col2,col3),",") as col
      |from tableName
    """</span>.stripMargin
sqlContext.sql(sql).show()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

在DataFrame中使用

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;">import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.functions</span>.{udf,array,lit}
val myFunc = udf(allInOne _)
val cols = array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col1"</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col2"</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col3"</span>)
val sep = lit(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">","</span>)
df<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.select</span>(myFunc(cols,sep)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.alias</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col"</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.show</span>()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

一些简单的例子

1.个数统计

表结构如下,统计出每个人的爱好个数

name hobbies
alice jogging,Coding,cooking
lina travel,dance
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 将某个字段中逗号分隔的数量统计出来</span>
sqlContext.udf.register<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hobby_num"</span>, (s: String) => s.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>).size)</span>
<span class="hljs-title" style="box-sizing: border-box;">sqlContext</span>.<span class="hljs-title" style="box-sizing: border-box;">sql</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"select *,hobby_num(hobbies) as hobby_num from table"</span>)</span></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

结果

name hobbies hobby_num
alice read book,coding,cooking 3
lina travel,dance 2

2.空值填补

表结构如下

A B
null 123456
234234 234234
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 填补第一个字段的空值</span>
sqlContext.udf.register<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"combine"</span>, (s1: String,s2: String)=> {<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(s1 == <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">null</span>) s2 <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span> s1})</span>
<span class="hljs-title" style="box-sizing: border-box;">sqlContext</span>.<span class="hljs-title" style="box-sizing: border-box;">sql</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"select combine(A,B) as A from table"</span>)</span></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

结果

A
123456
234234

3. 类型转化

类型转化,将 String 转化为 Int

sqlContext.udf.register("str2Int", (s: String) => s.toInt)

或者直接使用cast

sqlContext.sql("select cast(a AS Int) from table")

4. 综合运用

原始数据,ID(用户名),loginIP(帐号登录的ip地址)

ID loginIP
alice ip1
lina ip2
sven ip3
alice ip1
sven ip2
alice ip4

计算每个用户在哪些ip登录过,并统计数量

ID ip_list loginIP_num
alice ip1,ip4 2
lina ip2 1
sven ip2,ip3 2
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span>统计数量
sqlContext.udf.register<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"list_size"</span>, (s: String) => s.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>).size)</span>
<span class="hljs-title" style="box-sizing: border-box;">val</span> <span class="hljs-title" style="box-sizing: border-box;">sql</span> =
    """<span class="hljs-title" style="box-sizing: border-box;">select</span> <span class="hljs-title" style="box-sizing: border-box;">ID</span>,<span class="hljs-title" style="box-sizing: border-box;">ip_list</span>,<span class="hljs-title" style="box-sizing: border-box;">list_size</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(ip_list)</span> <span class="hljs-title" style="box-sizing: border-box;">as</span> <span class="hljs-title" style="box-sizing: border-box;">loginIP_num</span>
      |<span class="hljs-title" style="box-sizing: border-box;">from</span> <span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(select ID,concat_ws(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>,collect_set(loginIP)) as ip_list from table)</span>
    """.<span class="hljs-title" style="box-sizing: border-box;">stripMargin</span>
<span class="hljs-title" style="box-sizing: border-box;">sqlContext</span>.<span class="hljs-title" style="box-sizing: border-box;">sql</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(sql)</span></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>

参考:

[Spark UDF with varargs]( 
http://stackoverflow.com/questions/33151866/spark-udf-with-varargs)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值