Spark hello word（加载本地文件和加载hdfs文件）

最新推荐文章于 2024-10-06 16:15:50 发布

原创最新推荐文章于 2024-10-06 16:15:50 发布 · 2.9k 阅读

1 ·

CC 4.0 BY-SA版权

大数据专栏收录该内容

11 篇文章

订阅专栏

本文介绍如何使用Spark加载本地文件系统及HDFS中的文件，并演示了从这些位置读取文件的具体步骤，同时展示了如何将处理后的结果保存回本地或HDFS。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本节内容主要完成：
使用sparkshell记载本地文件和hdfs文件

spark处理的文件可能存在于本地文件系统中，也可能存在分布式文件系统中

本地文件加载

创建一个测试文件

[root@sandbox home]# cd /home/guest/

// 在guest 目录下创建一个文件夹
[root@sandbox guest]# mkdir erhuan
// 在 新建的文件夹中创建一个测试文件
[root@sandbox guest]# cd erhuan/
[root@sandbox erhuan]# vi hellospark

启动sparkshell

[root@sandbox erhuan]# spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
17/04/12 14:45:41 INFO SecurityManager: Changing view acls to: root
17/04/12 14:45:41 INFO SecurityManager: Changing modify acls to: root
17/04/12 14:45:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/04/12 14:45:41 INFO HttpServer: Starting HTTP Server
17/04/12 14:45:41 INFO Utils: Successfully started service 'HTTP class server' on port 47623.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.2.1
      /_/

// 省略一堆输出

加载本地文件

Spark context available as sc.
//使用sc.textFile()方法记载文件
scala>  val textFile = sc.textFile("file:///home/guest/erhuan/hellospark")

// 省略一堆输出
textFile: org.apache.spark.rdd.RDD[String] = file:///home/guest/erhuan/hellospark MappedRDD[1] at textFile at <console>:12
//执行一次action操作
scala> textFile.first()
// 省略一堆输出
17/04/12 14:53:27 WARN DomainSocketFactory: The short-circuit local reads feature cannot be 
17/04/12 14:53:27 INFO DAGScheduler: Job 0 finished: first at <console>:15, took 0.306226 s
res0: String = this is a hello word txt
// spark 会记录之前所有的动作但是并不会进行操作，执行action动作后才会启动之前的操作

将结果保存到本地
scala> textFile.saveAsTextFile("file:///home/guest/erhuan/wordres")
17/04/12 14:59:31 INFO DefaultExecutionContext: Starting job: saveAsTextFile at <console>:15
17/04/12 14:59:31 INFO DAGScheduler: Got job 6 (saveAsTextFile at <console>:15) with 2 output partitions (allowLocal=false)
// 省略一堆输出

退出spark-shell，查看"/home/guest/erhuan/hellospark"文件夹下面内容

//退出spark-shell
scala> exit
[root@sandbox erhuan]# cd wordres/
[root@sandbox wordres]# ll
total 4
-rw-r--r-- 1 root root 25 2017-04-12 14:59 part-00000
-rw-r--r-- 1 root root  0 2017-04-12 14:59 part-00001
-rw-r--r-- 1 root root  0 2017-04-12 14:59 _SUCCESS
[root@sandbox wordres]# more part-00000
this is a hello word txt
// 完成spark 对本地文件的加载和写入

加载hdfs文件

//首先向文件拷贝到hdfs上，避免权限问题将 先将文件拷贝到tmp目录下
[root@sandbox tmp]# mv /home/guest/erhuan/hellospark /tmp
[hdfs@sandbox tmp]$ hadoop fs -mkdir -p /user/erhuan
[hdfs@sandbox tmp]$ hadoop fs -put /tmp/hellospark /user/erhuan
//创建一个文件夹然后将本地文件推到hdfs上
[root@sandbox erhuan]# spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
17/04/12 14:45:41 INFO SecurityManager: Changing view acls to: root
17/04/12 14:45:41 INFO SecurityManager: Changing modify acls to: root
17/04/12 14:45:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/04/12 14:45:41 INFO HttpServer: Starting HTTP Server
17/04/12 14:45:41 INFO Utils: Successfully started service 'HTTP class server' on port 47623.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.2.1
      /_/

// 省略一堆输出

加载HDFS文件

Spark context available as sc.
//使用sc.textFile()方法记载文件
scala> val textFile = sc.textFile("/user/erhuan/hellospark")
17/04/12 15:33:29 INFO MemoryStore: ensureFreeSpace(277063) called with curMem=684755, // //省略一堆输出

// 执行一次action 查看是否执行成功
scala> textFile.first()
// 省略一堆输出
17/04/12 15:33:32 INFO DAGScheduler: Job 0 finished: first at <console>:15, took 0.543566 s
res3: String = this is a hello word txt

//写入回来

scala> textFile.saveAsTextFile("/user/erhuan/res")
17/04/12 15:36:34 INFO DefaultExecutionContext: Starting job: saveAsTextFile at <console>:15
17/04/12 15:36:34 INFO DAGScheduler: Got job 1 (saveAsTextFile at <console>:15) with 2 output partitions (allowLocal=false)
// 省略一堆输出

//退出spark-shell
//查看结果
[hdfs@sandbox tmp]$ hadoop fs -ls /user/erhuan/res
Found 3 items
-rw-r--r--   1 hdfs hdfs          0 2017-04-12 15:36 /user/erhuan/res/_SUCCESS
-rw-r--r--   1 hdfs hdfs         25 2017-04-12 15:36 /user/erhuan/res/part-00000
-rw-r--r--   1 hdfs hdfs          0 2017-04-12 15:36 /user/erhuan/res/part-00001
[hdfs@sandbox tmp]$ hadoop fs -cat /user/erhuan/res/part-00000
this is a hello word txt