1.读取本地txt文件
读取本地文件时,需要在文件路径前加上 file:// ,如下代码
[WBQ@westgisB068 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/17 11:09:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://westgisB068:4040
Spark context available as 'sc' (master = local[*], app id = local-1681700998461).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.2
/_/
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val df = sc.textFile("/home/WBQ/1.txt")
df: org.apache.spark.rdd.RDD[String] = /home/WBQ/1.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> df.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.take(RDD.scala:1443)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1484)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.first(RDD.scala:1484)
... 47 elided
Caused by: java.io.IOException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
... 67 more
scala> val df = sc.textFile("file:///home/WBQ/1.txt")
df: org.apache.spark.rdd.RDD[String] = file:///home/WBQ/1.txt MapPartitionsRDD[3] at textFile at <console>:24
scala> df.first()
res2: String = 1 2 3 gfv 4 ff1 2 3 gfv 4 ff
scala>
2.读取本地csv文件
2.1数据传输
将数据传输到当前节点
[WBQ@westgisB064 soft]$ scp -r traffic/ WBQ@10.103.105.68:/home/WBQ/soft/
2.2spark-shell打开终端进行操作
[WBQ@westgisB068 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/17 11:23:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://westgisB068:4040
Spark context available as 'sc' (master = local[*], app id = local-1681701827307).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.2
/_/
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val df = spark.read.format("csv").option("header",value="true").option("encoding","utf-8").load("file:///home/WBQ/soft/data/traffic/part-*")
df: org.apache.spark.sql.DataFrame = [卡号: string, 交易时间: string ... 2 more fields]
scala> df.show(5)
+-------+--------------------+--------+--------+
| 卡号| 交易时间|线路站点|交易类型|
+-------+--------------------+--------+--------+
|3697647|2018-10-01T18:47:...| 大新|地铁出站|
|3697647|2018-10-01T18:35:...|宝安中心|地铁入站|
|3697647|2018-10-01T13:49:...| 大新|地铁入站|
|3697647|2018-10-01T14:03:...|宝安中心|地铁出站|
|5344820|2018-10-17T09:34:...| 罗湖|地铁入站|
+-------+--------------------+--------+--------+
only showing top 5 rows
scala>
3.读取本地文件失败原因
错误提示:输入路径不存在
检查:确实不存在
解决:将数据从其他节点传输到本地节点
scala> df.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.take(RDD.scala:1443)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1484)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.first(RDD.scala:1484)
... 47 elided
Caused by: java.io.IOException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
... 67 more
scala>