感谢大牛的系列文章, 本文只是本人学习过程的记录, 首先向大神致敬!
https://blog.youkuaiyun.com/lovehuangjiaju/article/details/48661847
1.创建文件people.json
{"name":"Michael", "age":27}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
2.上传到HDFS文件系统 目录位置: /data/people.json
hdfs dfs -put ./people.json /data
3.在HDFS中查看文件是否完整,如下
[root@hd-02 ~]# hdfs dfs -cat /data/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
[root@hd-02 ~]#
4.启动Spark Shell , 并执行如下代码:
bin/spark-shell
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df = sqlContext.read.json("hdfs://hd-01:9000/data/people.json")
5.测试
scala> df.show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.select("name").show
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
scala> df.filter( df("age") > 21 ).show
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
scala> df.registerTempTable("people")
scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
teenagers: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
scala> teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Name: Justin