在Spark中使用sql时一些功能需要自定义方法实现,这时候就可以使用UDF功能来实现
多参数支持
UDF不支持参数*
的方式输入多个参数,例如String*
,不过可以使用array来解决这个问题。
定义udf方法,此处功能是将多个字段合并为一个字段
<code class="hljs python has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">allInOne</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(seq: Seq[Any], sep: String)</span>:</span> String = seq.mkString(sep)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
在sql中使用
<code class="hljs scala has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;">sqlContext.udf.register(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"allInOne"</span>, allInOne _) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//将col1,col2,col3三个字段合并,使用','分割</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> sql = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">""" |select allInOne(array(col1,col2,col3),",") as col |from tableName """</span>.stripMargin sqlContext.sql(sql).show()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
在DataFrame中使用
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;">import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.functions</span>.{udf,array,lit} val myFunc = udf(allInOne _) val cols = array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col1"</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col2"</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col3"</span>) val sep = lit(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">","</span>) df<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.select</span>(myFunc(cols,sep)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.alias</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"col"</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.show</span>()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
一些简单的例子
1.个数统计
表结构如下,统计出每个人的爱好个数
name | hobbies |
---|---|
alice | jogging,Coding,cooking |
lina | travel,dance |
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 将某个字段中逗号分隔的数量统计出来</span> sqlContext.udf.register<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hobby_num"</span>, (s: String) => s.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>).size)</span> <span class="hljs-title" style="box-sizing: border-box;">sqlContext</span>.<span class="hljs-title" style="box-sizing: border-box;">sql</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"select *,hobby_num(hobbies) as hobby_num from table"</span>)</span></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
结果
name | hobbies | hobby_num |
---|---|---|
alice | read book,coding,cooking | 3 |
lina | travel,dance | 2 |
2.空值填补
表结构如下
A | B |
---|---|
null | 123456 |
234234 | 234234 |
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 填补第一个字段的空值</span> sqlContext.udf.register<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"combine"</span>, (s1: String,s2: String)=> {<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(s1 == <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">null</span>) s2 <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span> s1})</span> <span class="hljs-title" style="box-sizing: border-box;">sqlContext</span>.<span class="hljs-title" style="box-sizing: border-box;">sql</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"select combine(A,B) as A from table"</span>)</span></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
结果
A |
---|
123456 |
234234 |
3. 类型转化
类型转化,将 String
转化为 Int
sqlContext.udf.register("str2Int", (s: String) => s.toInt)
或者直接使用cast
sqlContext.sql("select cast(a AS Int) from table")
4. 综合运用
原始数据,ID(用户名),loginIP(帐号登录的ip地址)
ID | loginIP |
---|---|
alice | ip1 |
lina | ip2 |
sven | ip3 |
alice | ip1 |
sven | ip2 |
alice | ip4 |
计算每个用户在哪些ip登录过,并统计数量
ID | ip_list | loginIP_num |
---|---|---|
alice | ip1,ip4 | 2 |
lina | ip2 | 1 |
sven | ip2,ip3 | 2 |
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span>统计数量 sqlContext.udf.register<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"list_size"</span>, (s: String) => s.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>).size)</span> <span class="hljs-title" style="box-sizing: border-box;">val</span> <span class="hljs-title" style="box-sizing: border-box;">sql</span> = """<span class="hljs-title" style="box-sizing: border-box;">select</span> <span class="hljs-title" style="box-sizing: border-box;">ID</span>,<span class="hljs-title" style="box-sizing: border-box;">ip_list</span>,<span class="hljs-title" style="box-sizing: border-box;">list_size</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(ip_list)</span> <span class="hljs-title" style="box-sizing: border-box;">as</span> <span class="hljs-title" style="box-sizing: border-box;">loginIP_num</span> |<span class="hljs-title" style="box-sizing: border-box;">from</span> <span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(select ID,concat_ws(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>,collect_set(loginIP)) as ip_list from table)</span> """.<span class="hljs-title" style="box-sizing: border-box;">stripMargin</span> <span class="hljs-title" style="box-sizing: border-box;">sqlContext</span>.<span class="hljs-title" style="box-sizing: border-box;">sql</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(sql)</span></span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>
参考:
[Spark UDF with varargs](
http://stackoverflow.com/questions/33151866/spark-udf-with-varargs)