SparkSQL使用的简单例子

本文通过实例介绍了SparkSQL如何从json、parquet文件及普通RDD创建DataFrame,详细演示了DataFrame的read和save方法,包括设置数据源和保存模式。此外,还展示了直接使用SQL查询数据源以及parquet文件的读写操作,探讨了Schema Merging和通过Hive Metastore管理数据。最后,给出了从jdbc连接MySQL创建DataFrame的例子。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  • 从json文件创建dataFrame
 val df: DataFrame = sqlContext.read.json("hdfs://master:9000/user/spark/data/people.json")
 val people = df.registerTempTable("person")
 val teenegers: DataFrame = sqlContext.sql("select name,age from person")
 teenegers.map(x = "name:" + x(0)+ " " + "age:" + x(1)).collect().foreach(println)
  • 从parquet文件创建dataFrame
 val df: DataFrame = sqlContext.read.parquet("hdfs://master:9000/user/spark/data/namesAndAges.parquet")
 val people = df.registerTempTable("person")
 val teenegers: DataFrame = sqlContext.sql("select name,age from person")
 teenegers.map(x = "name:" + x(0)+ " " + "age:" + x(1)).collect().foreach(println)
  • 从普通RDD创建dataFrame_1
 val people = sc.textFile("hdfs://master:9000/user/spark/data/people.txt")
        .map(_.split(",")).map(p = Person(p(0), p(1).trim.toInt)).toDF
 people.registerTempTable("people")
 val teenagers = sqlContext.sql("select name,age from people")
 teenagers.map(x = "name:" + x(0)+ " " + "age:" + x(1)).collect().foreach(println)
  • 从普通RDD创建dataFrame_2
 val people = sc.textFile("hdfs://master:9000/user/spark/data/people.txt")
 val schemaString = "name age"
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.types.{StructType,StructField,StringType}
 val schema = StructType(schemaString.split(" ").map(fieldName = StructField(fieldName,StringType,true)))
 val rowRDD = people.map(_.split(",")).map(x = Row(x(0),x(1).trim))
 val df: DataFrame = sqlContext.createDataFrame(rowRDD,schema)
 df.registerTempTable("people")val teenagers = sqlContext.sql("select name,age from people")
 teenagers.map(x = "name:" + x(0)+ " " + "age:" + x(1)).collect().foreach(println)
  • 测试dataframe的read和save方法(注意load方法默认是加载parquet文件)
 val df = sqlContext.read.load("hdfs://master:9000/user/spark/data/namesAndAges.parquet")
 df.select("name").write.save("hdfs://master:9000/user/spark/data/name.parquet")
  • 测试dataframe的read和save方法(可通过手动设置数据源和保存测mode)
 val df =sqlContext.read.format("json").load("hdfs://master:9000/user/spark/ data/people.json")
 df.select("age").write.format("parquet").mode(SaveMode.Append).save("hdfs://master:9000/user/spark/data/ages.parquet")
  • 直接使用sql查询数据源
 val df = sqlContext.sql("SELECT * FROM parquet.`hdfs://master:9000/user/spark/data/ages.parquet`")
 df.map(x = "name:" + x(0)).foreach(println)
  • parquest文件的读写
 val people = sc.textFile("hdfs://master:9000/user/spark/data/people.txt").toDF
 people.write.mode(SaveMode.Overwrite).parquet("hdfs://master:9000/user/spark/data/people.parquet")
 val parquetFile = sqlContext.read.parquet("hdfs://master:9000/user/spark/data/people.parquet")
 parquetFile.registerTempTable("parquetFile")
 val teenagers = sqlContext.sql("SELECT name FROM parquetFile")
 teenagers.map(t = "Name: " + t(0)).collect().foreach(println)
  • Schema Merging
 val df1 = sc.makeRDD(1 to 5).map(i = (i, i * 2)).toDF("single", "double")
 df1.write.mode(SaveMode.Overwrite).parquet("hdfs://master:9000/user/spark/data/test_table/key=1")
 df2 = sc.makeRDD(6 to 10).map(i = (i, i * 3)).toDF("single", "triple")
 df2.write.mode(SaveMode.Overwrite).parquet("hdfs://master:9000/user/spark/data/test_table/key=2")
 df3 = sqlContext.read.option("mergeSchema", "true").parquet("hdfs://master:9000/user/spark/data/test_table")
 df3.printSchema()
 df3.show()
  • hive metastore
 val sqlContext = new HiveContext(sc)sqlContext.setConf("spark.sql.shuffle.partitions","5")
 sqlContext.sql("use my_hive")
 sqlContext.sql("create table if not exists sogouInfo (time STRING,id STRING,webAddr STRING,downFlow INT,upFlow INT,url STRING) 
          row format delimited fields terminated by '\t'")
 sqlContext.sql("LOAD DATA LOCAL INPATH '/root/testData/SogouQ1.txt' overwrite INTO TABLE sogouInfo")
 sqlContext.sql("select " +"count(distinct id) as c " +"from sogouInfo " +"group by time order by c desc limit 10")
       .collect().foreach(println)
  • df from jdbc eg:mysql
 val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("driver" - "com.mysql.jdbc.Driver",
             "url" - "jdbc:mysql://192.168.0.65:3306/test?user=root&password=root","dbtable" - "trade_total_info_copy")).load()
jdbcDF.registerTempTable("trade_total_info_copy")
sqlContext.sql("select * from trade_total_info_copy").foreach(println)
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值