
spark
adream307
这个作者很懒,什么都没留下…
展开
-
[spark]Spark UDT with Codegen UDF
本文介绍自定义一种数据类型Point,并针对Point实现Add操作,并且该Add操作在codegen中实现build.sbtname := "PointUdt"version := "0.1"scalaVersion := "2.12.11"libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0-pr...原创 2020-05-06 15:18:38 · 540 阅读 · 0 评论 -
[spark]使用injectOptimizerRule改写Plan
自定义UDF函数如下spark.udf.register("inc", (x: Long) => x + 1)测试语句如下val df = spark.sql("select sum(inc(vals)) from data")df.explain(true)df.show()上述测试语句输出的LogicalPlan如下== Optimized Logical Plan =...原创 2020-01-22 14:05:12 · 457 阅读 · 0 评论 -
[spark]RewriteDistinctAggregates
如果Aggregate操作中同时包含Distinct与非Distinct操作,优化器可以将该操作改成成两个不包含Distinct的Aggregate假设schema如下create table animal(gkey varchar(128), cat varchar(128), dog varchar(128...原创 2020-01-21 14:37:06 · 340 阅读 · 0 评论 -
[spark]Rewrite SparkSQL Plan
OptPlanTest.scalaimport org.apache.spark.sql.SparkSessionimport org.apache.log4j.Loggerimport org.apache.log4j.Levelpackage org.apache.spark.sql.optplan { import org.apache.spark.rdd.RDD imp...原创 2020-01-20 15:53:48 · 371 阅读 · 0 评论 -
[spark]非udf的自定义函数
参考spark的内置函数,实现非udf的自定义函数MyAdd.scalaimport org.apache.spark.sql.SparkSessionimport org.apache.log4j.Loggerimport org.apache.log4j.Levelpackage org.apache.spark.sql.myfunctions { import org.ap...原创 2020-01-10 16:49:51 · 248 阅读 · 0 评论 -
[spark]单机下的集群模式运行
现有一台服务器,配置信息如下:cpulscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 48On-line CPU(s) list: 0-47Thread(s) per c...原创 2020-01-09 16:17:31 · 409 阅读 · 0 评论 -
[Spark]调用RDD[InternalRow]的filter方法过滤csv文件
import org.apache.spark.sql.SparkSessionobject SqlExample { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .master("local") .appName("Spark sql ...原创 2019-12-30 10:24:44 · 1141 阅读 · 1 评论 -
[Scala]对特定对象实例的方法重写
scala支持在对象实例化后对对象内的特定方法重写,重写只会影响当前示例,对其它示例没有影响,测试代码如下object OverrideTest { class A { def print(): Unit = { println("in A.print") } } def main(args: Array[String]): Unit = { v...原创 2019-12-28 15:27:40 · 408 阅读 · 0 评论 -
[Spark]直接调用RDD的方式实现SparkSQL的Filter操作
使用SQL实现数据过滤import org.apache.spark.sql.SparkSessionobject SqlExample { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName("Spark sql whole stage ...原创 2019-12-27 18:59:58 · 1580 阅读 · 0 评论 -
[Spark]自定义RDD的计算函数
MyRDDTest.scalapackage org.apache.spark.myrdd { import org.apache.spark.{Partition, SparkContext, TaskContext} import scala.reflect.ClassTag import org.apache.spark.rdd._ private[myrdd] cl...原创 2019-12-27 14:43:16 · 943 阅读 · 0 评论 -
[spark]RDD合并
将spark的两个rdd合并成一个rddscala> val rdd1 = sc.parallelize(1 to 10)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24scala> rdd1.collectres0: Arra...原创 2019-12-26 16:25:52 · 3950 阅读 · 0 评论 -
[Spark]自定义RDD
scala源程序//MyRDDTest.scalapackage org.apache.spark.myrdd { import org.apache.spark.{Partition, SparkContext, TaskContext} import scala.reflect.ClassTag import org.apache.spark.rdd._ private...原创 2019-12-27 10:20:47 · 466 阅读 · 0 评论