四、 Spark SQL源码函数解读
1. Spark SQL内置函数解密与实战
SparkSQL的DataFrame引入了大量的内置函数,这些内置函数一般都有CG(CodeGeneration)功能,这样的函数在编译和执行时都会经过高度优化。
问题:SparkSQL操作Hive和Hive on Spark一样吗?
=> 不一样。SparkSQL操作Hive只是把Hive当作数据仓库的来源,而计算引擎就是SparkSQL本身。Hive on spark是Hive的子项目,Hive on Spark的核心是把Hive的执行引擎换成Spark。众所周知,目前Hive的计算引擎是Mapreduce,因为性能低下等问题,所以Hive的官方就想替换这个引擎。
SparkSQL操作Hive上的数据叫Spark on Hive,而Hive on Spark依旧是以Hive为核心,只是把计算引擎由MapReduce替换为Spark。
Spark官网上DataFrame 的API Docs:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
Experimental
A distributed collection of data organized into named columns.
A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.
val people = sqlContext.read.parquet("...") // in ScalaDataFrame people = sqlContext.read().parquet("...") // in JavaOnce created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.To select a column from the data frame, use apply method in Scala and col in Java.val ageCol = people("age") // in ScalaColumn ageCol = people.col("age") // in JavaNote that the Column type can also be manipulated through its various functions.// The following creates a new column that increases everybody's age by 10.people("age") + 10 // in Scalapeople.col("age").plus(10); // in JavaA more concrete example in Scala:// To create DataFrame using SQLContextval people = sqlContext.read.parquet("...")val department = sqlContext.read.parquet("...")people.filter("age > 30").join(department, people("deptId") === department("id")).groupBy(department("name"), "gender").agg(avg(people("salary")), max(people("age")))and in Java:// To create DataFrame using SQLContextDataFrame people = sqlContext.read().parquet("...");DataFrame department = sqlContext.read().parquet("...");people.filter("age".gt(30)).join(department, people.col("deptId").equalTo(department("id"))).groupBy(department.col("name"), "gender").agg(avg(people.col("salary")), max(people.col("age")));
以上内容中的join,groupBy,agg都是SparkSQL的内置函数。
SParkl1.5.x以后推出了很多内置函数,据不完全统计,有一百多个内置函数。
下面实战开发一个聚合操作的例子:
package com.dt.sparkimport org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.sql.{Row, SQLContext}import org.apache.spark.sql.functions._/*** 使用Spark SQL中的内置函数对数据进行分析,Spark SQL API不同的是,DataFrame中的内置函数操作的结果是返回一个Column对象,而* DataFrame天生就是"A distributed collection of data organized into named columns.",这就为数据的复杂分析建立了坚实的基础* 并提供了极大的方便性,例如说,我们在操作DataFrame的方法中可以随时调用内置函数进行业务需要的处理,这之于我们构建附件的业务逻辑而言是可以* 极大的减少不必须的时间消耗(基于上就是实际模型的映射),让我们聚焦在数据分析上,这对于提高工程师的生产力而言是非常有价值的* Spark 1.5.x开始提供了大量的内置函数,例如agg:* def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame = {* groupBy().agg(aggExpr, aggExprs : _*)*}* 还有max、mean、min、sum、avg、explode、size、sort_array、day、to_date、abs、acros、asin、atan* 总体上而言内置函数包含了五大基本类型:* 1,聚合函数,例如countDistinct、sumDistinct等;* 2,集合函数,例如sort_array、explode等* 3,日期、时间函数,例如hour、quarter、next_day* 4, 数学函数,例如asin、atan、sqrt、tan、round等;* 5,开窗函数,例如rowNumber等* 6,字符串函数,concat、format_number、rexexp_extract* 7, 其它函数,isNaN、sha、randn、callUDF*/object SparkSQLAgg {def main(args: Array[String]) {System.setProperty("hadoop.home.dir", "G:/datarguru spark/tool/hadoop-2.6.0")val conf = new SparkConf()conf.setAppName("SparkSQLlinnerFunctions")//conf.setMaster("spark://master:7077")conf.setMaster("local")val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc) //构建SQL上下文//要使用Spark SQL的内置函数,就一定要导入SQLContext下的隐式转换import sqlContext.implicits._//模拟电商访问的数据,实际情况会比模拟数据复杂很多,最后生成RDDval userData = Array("2016-3-27,001,http://spark.apache.org/,1000","2016-3-27,001,http://Hadoop.apache.org/,1001","2016-3-27,002,http://fink.apache.org/,1002","2016-3-28,003,http://kafka.apache.org/,1020","2016-3-28,004,http://spark.apache.org/,1010","2016-3-28,002,http://hive.apache.org/,1200","2016-3-28,001,http://parquet.apache.org/,1500","2016-3-28,001,http://spark.apache.org/,1800")val userDataRDD = sc.parallelize(userData)//生成分布式集群对象//根据业务需要对数据进行预处理生成DataFrame,要想把RDD转换成DataFrame,需要先把RDD中的元素类型变成Row类型//于此同时要提供DataFrame中的Columns的元数据信息描述val userDataRDDRow = userDataRDD.map(row => {val splited = row.split(","); Row(splited(0),splited(1).toInt,splited(2), splited(3).toInt)})val structType = StructType(Array(StructField("time", StringType, true),StructField("id", IntegerType, true),StructField("url", StringType, true),StructField("amount", IntegerType, true)))val userDataDF = sqlContext.createDataFrame(userDataRDDRow, structType)//第五步:使用Spark SQL提供的内置函数对DataFrame进行操作,特别注意:内置函数生成的Column对象且自定进行CG;userDataDF.groupBy("time").agg('time, countDistinct('id)).map(row => Row(row(1),row(2))).collect().foreach(println)userDataDF.groupBy("time").agg('time, sum('amount)).map(row => Row(row(1),row(2))).collect().foreach(println)}}
2. Spark SQL窗口函数解密与实战
窗口函数包括:
分级函数、分析函数、聚合函数
较全的窗口函数介绍参考:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html
窗口函数中最重要的是row_number。row_bumber是对分组进行排序,所谓分组排序就是说在分组的基础上再进行排序。
下面使用SparkSQL的方式重新编写TopNGroup.scala程序并执行:
package com.dt.sparkimport org.apache.spark.sql.hive.HiveContextimport org.apache.spark.{SparkConf, SparkContext}object SparkSQLWindowFunctionOps {def main(args: Array[String]) {val conf = new SparkConf()conf.setMaster("spark://master:7077")conf.setAppName("SparkSQLWindowFunctionOps")val sc = new SparkContext(conf)val hiveContext = new HiveContext(sc)hiveContext.sql("DROP TABLE IF EXISTS scores")hiveContext.sql("CREATE TABLE IF NOT EXISTS scores(name STRING,score INT)"+"ROW FORMAT DELIMITED FIELDS TERMINATED ' ' LINES TERMINATED BY '\\n'")//将要处理的数据导入到Hive表中hiveContext.sql("LOAD DATA LOCAL INPATH 'G://datarguru spark/tool/topNGroup.txt' INTO TABLE SCORES")//hiveContext.sql("LOAD DATA LOCAL INPATH '/opt/spark-1.4.0-bin-hadoop2.6/dataSource' INTO TABLE SCORES")/*** 使用子查询的方式完成目标数据的提取,在目标数据内幕使用窗口函数row_number来进行分组排序:* PARTITION BY :指定窗口函数分组的Key;* ORDER BY:分组后进行排序;*/val result = hiveContext.sql("SELECT name,score FROM ("+ "SELECT name,score,row_number() OVER (PARTITION BY name ORDER BY score DESC) rank FROM scores) sub_scores"+ "WHERE rank <= 4")result.show() //在Driver的控制台上打印出结果内容//把数据保存在Hive数据仓库中hiveContext.sql("DROP TABLE IF EXISTS sortedResultScores")result.saveAsTable("sortedResultScores")}}
报错:
ERROR metadata.Hive: NoSuchObjectException(message:default.scores table not found)Exception in thread "main" org.apache.spark.sql.AnalysisException: missing BY at '' '' near '<EOF>'; line 1 pos 96
参考:
http://blog.youkuaiyun.com/slq1023/article/details/51138709
3. Spark SQL UDF和UDAF解密与实战
UDAF=USER DEFINE AGGREGATE FUNCTION
通过案例实战Spark SQL下的UDF和UDAF的具体使用:
* UDF: User Defined Function,用户自定义的函数,函数的输入是一条具体的数据记录,实现上讲就是普通的Scala函数;
* UDAF:User Defined Aggregation Function,用户自定义的聚合函数,函数本身作用于数据集合,能够在聚合操作的基础上进行自定义操作;
* 实质上讲,例如说UDF会被Spark SQL中的Catalyst封装成为Expression,最终会通过eval方法来计算输入的数据Row(此处的Row和DataFrame中的Row没有任何关系)
1)实战编写UDF和UDAF:
package com.dt.sparkimport org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}import org.apache.spark.sql.types._import org.apache.spark.sql.{Row, SQLContext}import org.apache.spark.{SparkConf, SparkContext}object SparkSQLUDFUDAF {def main(args: Array[String]) {System.setProperty("hadoop.home.dir", "G:/datarguru spark/tool/hadoop-2.6.0");val conf = new SparkConf()conf.setAppName("SparkSQLUDFUDAF")conf.setMaster("local")val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)//模拟实际使用数据val bigData = Array("Spark", "Spark", "Hadoop", "Spark", "Hadoop", "Spark", "Spark", "Hadoop", "Spark", "Hadoop")//基于提供的数据创建DataFrameval bigDataRDD = sc.parallelize(bigData)val bigDataRow = bigDataRDD.map(item => Row(item))val structType = StructType(Array(StructField("word", StringType, true)))val bigDataDF = sqlContext.createDataFrame(bigDataRow, structType)bigDataDF.registerTempTable("bigDataTable") //注册成为临时表//通过SQLContext注册UDF,在Scala 2.10.x版本UDF函数最多可以接受22个输入参数sqlContext.udf.register("computeLength", (input: String) => input.length)//直接在SQL语句中使用UDF,就像使用SQL自动的内部函数一样sqlContext.sql("select word, computeLength(word) as length from bigDataTable").show()sqlContext.udf.register("wordCount", new MyUDAF)sqlContext.sql("select word,wordCount(word) as count,computeLength(word) " +"as length from bigDataTable group by word").show()while(true){}}}class MyUDAF extends UserDefinedAggregateFunction{ //ctrl+I实现复写方法/*** 该方法指定具体输入数据的类型* @return*/override def inputSchema: StructType = StructType(Array(StructField("input", StringType, true)))/*** 在进行聚合操作的时候要处理的数据的结果的类型* @return*/override def bufferSchema: StructType = StructType(Array(StructField("count", IntegerType, true)))/*** 指定UDAF函数计算后返回的结果类型* @return*/override def dataType: DataType = IntegerTypeoverride def deterministic: Boolean = true/*** 在Aggregate之前每组数据的初始化结果* @param buffer* @param input*/override def initialize(buffer: MutableAggregationBuffer): Unit = {buffer(0)=0}/*** 在进行聚合的时候有新的值进来,对分组后的聚合如何进行计算* 本地的聚合操作,相当于Hadoop MapReduce模型中的Combiner(这里的Row跟DataFrame的Row无关)* @param buffer* @param input*/override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {buffer(0) = buffer.getAs[Int](0) + 1}/*** 最后在分布式节点进行Local Reduce完成后需要进行全局级别的Merge操作* @param buffer1* @param buffer2*/override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {buffer1(0) = buffer1.getAs[Int](0) + buffer2.getAs[Int](0)}/*** 返回UDAF最后的计算结果* @param buffer* @return*/override def evaluate(buffer: Row): Any = buffer.getAs[Int](0)}
2) UDFRegistration的源码:
/*** Functions for registering user-defined functions. Use [[SQLContext.udf]] to access this.** @since 1.3.0*/class UDFRegistration private[sql] (sqlContext: SQLContext) extends Logging {private val functionRegistry = sqlContext.functionRegistryprotected[sql] def registerPython(name: String, udf: UserDefinedPythonFunction): Unit = {log.debug(s"""| Registering new PythonUDF:| name: $name| command: ${udf.command.toSeq}| envVars: ${udf.envVars}| pythonIncludes: ${udf.pythonIncludes}| pythonExec: ${udf.pythonExec}| dataType: ${udf.dataType}""".stripMargin)functionRegistry.registerFunction(name, udf.builder)}/*** Register a user-defined aggregate function (UDAF).** @param name the name of the UDAF.* @param udaf the UDAF needs to be registered.* @return the registered UDAF.*/def register(name: String,udaf: UserDefinedAggregateFunction): UserDefinedAggregateFunction = {def builder(children: Seq[Expression]) = ScalaUDAF(children, udaf)functionRegistry.registerFunction(name, builder)udaf}// scalastyle:off/* register 0-22 were generated by this script(0 to 22).map { x =>val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"})val typeTags = (1 to x).map(i => s"A${i}: TypeTag").foldLeft("RT: TypeTag")(_ + ", " + _)val inputTypes = (1 to x).foldRight("Nil")((i, s) => {s"ScalaReflection.schemaFor[A$i].dataType :: $s"})println(s"""/*** Register a Scala closure of ${x} arguments as user-defined function (UDF).* @tparam RT return type of UDF.* @since 1.3.0*/def register[$typeTags](name: String, func: Function$x[$types]): UserDefinedFunction = {val dataType = ScalaReflection.schemaFor[RT].dataTypeval inputTypes = Try($inputTypes).getOrElse(Nil)def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes)functionRegistry.registerFunction(name, builder)UserDefinedFunction(func, dataType, inputTypes)}""")}(1 to 22).foreach { i =>val extTypeArgs = (1 to i).map(_ => "_").mkString(", ")val anyTypeArgs = (1 to i).map(_ => "Any").mkString(", ")val anyCast = s".asInstanceOf[UDF$i[$anyTypeArgs, Any]]"val anyParams = (1 to i).map(_ => "_: Any").mkString(", ")println(s"""|/**| * Register a user-defined function with ${i} arguments.| * @since 1.3.0| */|def register(name: String, f: UDF$i[$extTypeArgs, _], returnType: DataType) = {| functionRegistry.registerFunction(| name,| (e: Seq[Expression]) => ScalaUDF(f$anyCast.call($anyParams), returnType, e))|}""".stripMargin)}*//*** Register a Scala closure of 0 arguments as user-defined function (UDF).* @tparam RT return type of UDF.* @since 1.3.0*/def register[RT: TypeTag](name: String, func: Function0[RT]): UserDefinedFunction = {val dataType = ScalaReflection.schemaFor[RT].dataTypeval inputTypes = Try(Nil).getOrElse(Nil)def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes)functionRegistry.registerFunction(name, builder)UserDefinedFunction(func, dataType, inputTypes)}
FunctionRegistry的源码如下:
object FunctionRegistry {type FunctionBuilder = Seq[Expression] => Expressionval expressions: Map[String, (ExpressionInfo, FunctionBuilder)] = Map(// misc non-aggregate functionsexpression[Abs]("abs"),expression[CreateArray]("array"),expression[Coalesce]("coalesce"),expression[Explode]("explode"),expression[Greatest]("greatest"),expression[If]("if"),expression[IsNaN]("isnan"),expression[IsNull]("isnull"),expression[IsNotNull]("isnotnull"),expression[Least]("least"),expression[Coalesce]("nvl"),expression[Rand]("rand"),expression[Randn]("randn"),expression[CreateStruct]("struct"),expression[CreateNamedStruct]("named_struct"),expression[Sqrt]("sqrt"),expression[NaNvl]("nanvl"),// math functionsexpression[Acos]("acos"),expression[Asin]("asin"),expression[Atan]("atan"),expression[Atan2]("atan2"),expression[Bin]("bin"),expression[Cbrt]("cbrt"),expression[Ceil]("ceil"),expression[Ceil]("ceiling"),expression[Cos]("cos"),expression[Cosh]("cosh"),expression[Conv]("conv"),expression[EulerNumber]("e"),expression[Exp]("exp"),expression[Expm1]("expm1"),expression[Floor]("floor"),expression[Factorial]("factorial"),expression[Hypot]("hypot"),expression[Hex]("hex"),expression[Logarithm]("log"),expression[Log]("ln"),expression[Log10]("log10"),expression[Log1p]("log1p"),expression[Log2]("log2"),expression[UnaryMinus]("negative"),expression[Pi]("pi"),expression[Pow]("pow"),expression[Pow]("power"),expression[Pmod]("pmod"),expression[UnaryPositive]("positive"),expression[Rint]("rint"),expression[Round]("round"),expression[ShiftLeft]("shiftleft"),expression[ShiftRight]("shiftright"),expression[ShiftRightUnsigned]("shiftrightunsigned"),expression[Signum]("sign"),expression[Signum]("signum"),expression[Sin]("sin"),expression[Sinh]("sinh"),expression[Tan]("tan"),expression[Tanh]("tanh"),expression[ToDegrees]("degrees"),expression[ToRadians]("radians"),// aggregate functionsexpression[HyperLogLogPlusPlus]("approx_count_distinct"),expression[Average]("avg"),expression[Corr]("corr"),expression[Count]("count"),expression[First]("first"),expression[First]("first_value"),expression[Last]("last"),expression[Last]("last_value"),expression[Max]("max"),expression[Average]("mean"),expression[Min]("min"),expression[StddevSamp]("stddev"),expression[StddevPop]("stddev_pop"),expression[StddevSamp]("stddev_samp"),expression[Sum]("sum"),expression[VarianceSamp]("variance"),expression[VariancePop]("var_pop"),expression[VarianceSamp]("var_samp"),expression[Skewness]("skewness"),expression[Kurtosis]("kurtosis"),// string functionsexpression[Ascii]("ascii"),expression[Base64]("base64"),expression[Concat]("concat"),expression[ConcatWs]("concat_ws"),expression[Encode]("encode"),expression[Decode]("decode"),expression[FindInSet]("find_in_set"),expression[FormatNumber]("format_number"),expression[GetJsonObject]("get_json_object"),expression[InitCap]("initcap"),expression[JsonTuple]("json_tuple"),expression[Lower]("lcase"),expression[Lower]("lower"),expression[Length]("length"),expression[Levenshtein]("levenshtein"),expression[RegExpExtract]("regexp_extract"),expression[RegExpReplace]("regexp_replace"),expression[StringInstr]("instr"),expression[StringLocate]("locate"),expression[StringLPad]("lpad"),expression[StringTrimLeft]("ltrim"),expression[FormatString]("format_string"),expression[FormatString]("printf"),expression[StringRPad]("rpad"),expression[StringRepeat]("repeat"),expression[StringReverse]("reverse"),expression[StringTrimRight]("rtrim"),expression[SoundEx]("soundex"),expression[StringSpace]("space"),expression[StringSplit]("split"),expression[Substring]("substr"),expression[Substring]("substring"),expression[SubstringIndex]("substring_index"),expression[StringTranslate]("translate"),expression[StringTrim]("trim"),expression[UnBase64]("unbase64"),expression[Upper]("ucase"),expression[Unhex]("unhex"),expression[Upper]("upper"),...
可以看出SparkSQL的内置函数也是和UDF一样注册的。
4. Spark SQL Thrift Server实战
The Thrift JDBC/ODBC server implemented here corresponds to the HiveServer2 in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.
打开JDBC/ODBC server:
ps -aux | grep hivehive --service metastore & //先打开hive元数据[1] 28268./sbin/start-thriftserver.sh//Now you can use beeline to test the Thrift JDBC/ODBC server:./bin/beeline//Connect to the JDBC/ODBC server in beeline with:beeline> !connect jdbc:hive2://master:10000//:root//密码为空hive命令
Java通过JDBC访问Thrift Server
package com.dt.sparksql;import java.sql.Connection;import java.sql.DriverManager;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;/*** 演示Java通过JDBC访问Thrift Server,进而访问Spark SQL,进而访问Hive,这是企业级开发中最为常见的方式* @author dt_sparl**/public class SparkSQLJDBC2ThriftServer {public static void main(String[] args) throws SQLException {String sqlTest = "select name from people where age = ?";Connection conn = null;ResultSet resultSet = null;try {Class.forName("org.apache.hive.jdbc.HiveDriver");conn = DriverManager.getConnection("jdbc:hive2://<master>:<10001>/<default>?"+ "hive.server2.transport.mode=http;hive.server2.thrift.http.path=<cliserver>","root", "");PreparedStatement preparedStatement = conn.prepareStatement(sqlTest);preparedStatement.setInt(1, 30);resultSet = preparedStatement.executeQuery();while(resultSet.next()){System.out.println(resultSet.getString(1)); //这里的数据应该保存在parquet中}} catch (ClassNotFoundException e) {// TODO Auto-generated catch blocke.printStackTrace();}finally {resultSet.close();conn.close();}}}
本文深入探讨了SparkSQL的关键特性,包括内置函数、窗口函数、用户自定义函数(UDF和UDAF)以及ThriftServer的实战应用。通过具体案例展示了如何利用这些功能提升数据处理效率。
6886

被折叠的 条评论
为什么被折叠?



