一、 scala版本
1.1:用javabean类型的rdd创建
val sparkConf = new SparkConf().setMaster("local").setAppName("adaa") val sc = new SparkContext(sparkConf)
var sqlcontext = new SQLContext(sc)
var rdd:RDD[Emp]= sc.parallelize(Array( "1 zhangsan 3000 20", "2 lisi 4000 10", "3 wangwu 8000 20", "4 zhaoliu 1000 20" )).flatMap(line =>{ var arr = line.split("\\s+")
var empno:Int = arr(0).toInt var ename:String = arr(1) var sal:Int = arr(2).toInt var deptno:Int = arr(3).toInt Array(Emp(empno,ename,sal,deptno)) }
var dataframe = sqlcontext.createDataFrame(rdd, Class.forName("a.Emp")) # 注意: 1. var rdd:RDD[Emp] ,此变量中存放的是Emp类的对象, 所以,在通过RDD创建DataFrame时需要相应的加载Emp的类,此操作的语句即:sqlcontext.createDataFrame(rdd, Class.forName("a.Emp")) Class.forName("a.Emp") 是类加载器,来加载rdd中存储的元素的类。 2.rdd.flatmap这个算子,参数是一个函数,这个函数的作用就是将rdd中的每个元素变成Array类型的数组,在flatmap隐藏的代码中会将每个小数组都遍历一遍,放入我们的大数组中,就是将集合压成一个。 # registerTempTable注册成为一张表,从而可以通过DataFrame的sql算子进行数据库的操作。 #在sql语句中的 列名(empno)是怎么来的哪? 是取自createDataFrame(rdd, Class.forName("a.Emp")), 这个类a.Emp的get方法名(def getEmpno = eno),去掉get首字母小写得到的。一定要注意,这是一个默认的取列名的方式。
dataframe.registerTempTable("emp") var newdataframe = sqlcontext.sql("select empno from emp") newdataframe.show() } } |
1.1中要用的类 package a // "1 zhangsan 3000 20", class Emp(empno:Int,ename:String,sal:Int,deptno:Int) { def this(){ this(0,null,0,0) } def getEmpno = empno def getEname = ename def getSal =sal def getDeptno = deptno } object Emp{ def apply() = { new Emp() } def apply(empno:Int,ename:String,sal:Int,deptno:Int) = { new Emp(empno,ename,sal,deptno) } } |
1.2:用Row类型的rdd创建dataframe
val sparkConf = new SparkConf().setMaster("local").setAppName("adaa") val sc = new SparkContext(sparkConf)
var sqlcontext = new SQLContext(sc)
var stringRDD:RDD[String]= sc.parallelize(Array( "1 zhangsan 3000 20", "2 lisi 4000 10", "3 wangwu 8000 20", "4 zhaoliu 1000 20" ))
var rowRDD = stringRDD.flatMap(line => { var arr = line.split("\\s+")
Array (Row(arr(0),arr(1),arr(2),arr(3))) })
//在createDataFrame算子中,对二个参数是StructType类型,第二个参数就是为了明确的写出Row(行)也就是第一个参数,中的每个数据的列名称和数据类型 //StructField 不就是结构的意思吗。 var structType = StructType( Array( StructField("empno",StringType,true), StructField("ename",StringType,true), StructField("sal",StringType,true), StructField("deptno",StringType,true) ) )
var dataframe = sqlcontext.createDataFrame(rowRDD, structType)
dataframe.registerTempTable("emp")
sqlcontext.sql("select * from emp").show
|
1.3:读数据库表数据加载创建dataframe
val sparkConf = new SparkConf().setMaster("local").setAppName("adaa") val sc = new SparkContext(sparkConf)
var sqlcontext = new SQLContext(sc) var options = scala.collection.mutable.Map( "url" -> "jdbc:mysql:///test", "driver" -> "com.mysql.jdbc.Driver", "user" -> "root", "password" -> "123456", "dbtable" -> "emp" ) var empDF = sqlcontext.read.format("jdbc").options(options).load
options("dbtable") = "dept" var deptDF = sqlcontext.read.format("jdbc").options(options).load empDF.show() deptDF.show() |
1.4:dataFrame数据追加到mysql数据库
val conf = new SparkConf().setAppName("MySQL-Demo").setMaster("local") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) //通过并行化创建RDD val userRDD = sc.parallelize( Array( Row("zhangsan",10), Row("lisi",12), Row("wangwu",19) ) ) //通过StructType直接指定每个字段的schema val schema = StructType( List( StructField("name", StringType, true), StructField("age", IntegerType, true)))
val userDataFrame = sqlContext.createDataFrame(userRDD, schema)
//创建Properties存储数据库相关属性 val prop = new Properties() prop.put("user", "root") prop.put("password", "123456") prop.put("driver", "com.mysql.jdbc.Driver")
//将数据追加到数据库 userDataFrame.write.mode("append").jdbc("jdbc:mysql:///test", "user", prop) //将数据直接保存到一张表(这个表不需要创建) //userDataFrame.write.jdbc("jdbc:mysql:///test", "user", prop)
//停止SparkContext sc.stop() |
1.5:dataframe各种查询
创建两个dataframe 用于查询使用
var sparkconf = new SparkConf().setMaster("local").setAppName("name2") var sc = new SparkContext(sparkconf) //创建一个rdd加载数组数据 EMP结构 var emprows:RDD[Row] = sc.parallelize( Array( Row(7301,"enamne1","job1",5000,10), Row(7302,"enamne2","job2",5000,10), Row(7303,"enamne3","job3",5000,20), Row(7304,"enamne4","job4",5000,20), Row(7305,"enamne5","job5",5000,20), Row(7306,"enamne6","job6",5000,30), Row(7307,"enamne7","job7",5000,30) ) ) Row(10,"dname1","loc1") //创建一个rdd加载数组数据 DEPT结构 var deptrows:RDD[Row] = sc.parallelize( Array( Row(10,"dname1","loc1"), Row(20,"dname2","loc2"), Row(30,"dname3","loc3") ) ) var sqlcontext = new SQLContext(sc) var empchema = StructType( Array( StructField("empno",IntegerType,true), StructField("ename",StringType,true), StructField("job",StringType,true), StructField("sal",IntegerType,true), StructField("deptno",IntegerType,true) ) ) var deptchema = StructType( Array( StructField("deptno",IntegerType,true), StructField("dname",StringType,true), StructField("loc",StringType,true) ) )
var empDataframe = sqlcontext.createDataFrame(emprows, empchema) var deptDataframe = sqlcontext.createDataFrame(deptrows, deptchema) |
二、 java版本
2.1:用javabean类型的rdd创建
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local"); sparkConf.setAppName("sadsfv");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
List<Emp> emps = new ArrayList<>();
emps.add(new Emp(1, "zhangsan", 1000, 10)); emps.add(new Emp(2, "liis", 3000, 10)); emps.add(new Emp(3, "wangwu", 1000, 20));
JavaRDD<Emp> empRDD = javaSparkContext.parallelize(emps);
SQLContext sqlContext = new SQLContext(javaSparkContext);
DataFrame dataFrame = sqlContext.createDataFrame(empRDD, Emp.class); dataFrame.registerTempTable("emp");
sqlContext.sql("select * from emp").show();
|
2.2:用Row类型的rdd创建dataframe
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local"); sparkConf.setAppName("sadsfv");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
List<Row> rows = new ArrayList<>(); //不要用Row这个类 来创建对象,用RowFactory(Row的工厂类)来创建。 Row row1 = RowFactory.create("1", "zhangsan", "3000", "20");
Row row2 = RowFactory.create("2", "lisi", "3400", "10");
Row row3 = RowFactory.create("3", "wangwu", "5700", "20"); rows.add(row1); rows.add(row2); rows.add(row3);
JavaRDD<Row> rowRDD = javaSparkContext.parallelize(rows);
SQLContext sqlContext = new SQLContext(javaSparkContext);
StructType structType = new StructType(new StructField[]{ DataTypes.createStructField("empno", DataTypes.StringType, true), DataTypes.createStructField("ename", DataTypes.StringType, true), DataTypes.createStructField("sal", DataTypes.StringType, true), DataTypes.createStructField("deptno", DataTypes.StringType, true) });
DataFrame dataFrame = sqlContext.createDataFrame(rowRDD, structType); dataFrame.registerTempTable("emp");
sqlContext.sql("select * from emp").show();
|
2.3:读数据库表数据加载创建dataframe
SparkConf conf = new SparkConf(). setAppName("JDBCDataSource"). setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); Map<String, String> options = new HashMap<String, String>(); options.put("url", "jdbc:mysql:///test"); options.put("driver", "com.mysql.jdbc.Driver") ; options.put("user", "root"); options.put("password", "123456"); //读取第一个表 options.put("dbtable", "emp"); DataFrame empDF = sqlContext.read().format("jdbc") .options(options).load(); //读取第二个表 options.put("dbtable", "dept");
DataFrame deptDF = sqlContext.read().format("jdbc") .options(options).load(); empDF.show(); deptDF.show(); |
2.4:dataFrame数据追加到mysql数据库
SparkConf conf = new SparkConf(). setAppName("JDBCDataSource"). setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc);
List<Emp>emps = new ArrayList<>(); emps.add(new Emp(null, "ename1", 1000, 10)); emps.add(new Emp(null, "ename2", 2000, 10)); emps.add(new Emp(null, "ename3", 3000, 20)); JavaRDD<Emp> empRDD = sc.parallelize(emps); DataFrame empDataFrame = sqlContext.createDataFrame(empRDD, Emp.class);
Properties options = new Properties(); options.put("driver", "com.mysql.jdbc.Driver") ; options.put("user", "root"); options.put("password", "123456"); //追加 empDataFrame.write().mode("append").jdbc("jdbc:mysql:///test", "emp", options); //empDataFrame.write().jdbc("jdbc:mysql:///test", "emp", options);//直接写不需要创表 sc.close(); |
2.5:dataframe各种查询
加载dataframe:代码
SparkConf conf = new SparkConf(). setAppName("JDBCDataSource"). setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); List<Emp>emps = new ArrayList<>(); emps.add(new Emp(1, "ename1", 1000, 10)); emps.add(new Emp(2, "ename2", 2000, 10)); emps.add(new Emp(3, "ename3", 3000, 20)); JavaRDD<Emp> empRDD = sc.parallelize(emps); DataFrame empDataFrame = sqlContext.createDataFrame(empRDD, Emp.class); |
2.5.1查询表所有数据
empDataFrame.show(); |
2.5.2指定字段查询
empDataFrame.select ( new Column("empno"), new Column("ename"), new Column("deptno") ).show(); |
2.5.3 :dataframe过滤
empDataFrame.filter("empno>2").show(); 结果是: +------+-----+------+----+ |deptno|empno| ename| sal| +------+-----+------+----+ | 20| 3|ename3|3000 | +------+-----+------+----+ |