文章目录
装载CSV数据源
sparkcontext操作
val conf = new SparkConf().setMaster("local[*]").setAppName("csvDemo")
val sc = SparkContext.getOrCreate(conf)
val lines = sc.textFile("in/users.csv")
/*加载CSV文件,去除head
val line2 = lines.mapPartitionsWithIndex((index, value) => {
if (index == 0)
value.drop(1)
else
value
})
val line3 = line2.map(x=>x.split(","))
for (x<- line3){
println(x.toList)
}*/
/* 加载CSV 去除第一行
val line1 = lines.filter(x=>x.startsWith("user_id")==false).map(x=>x.split(","))
line1.collect().foreach(x=>println(x.toList))*/
sparksession
val sprk = SparkSession.builder().appName("sparksession").master("locla[*]").getOrCreate()
val df = sprk.read.format("csv").option("header", true)
.load("in/users.csv")
df.printSchema()
df.show(10)
// df.select("user_id","birthyear").show(10)
// df.select("user_id","birthyear").printSchema()
//修改birthyear的schma的String类型为Double
val df2 = df.select("user_id","birthyear")
val df3 = df2.withColumn("birthyear", df2("birthyear").cast(DoubleType))
df3.printSchema()
df3.filter(x => !x.isNullAt(1) && x.getDouble(1) > 1995).show(10)
df2.withColumnRenamed("birthyear","birthYear")
装载json数据源
sparkcontext操作
val conf = new SparkConf().setMaster("local[*]").setAppName("json")
val sc = SparkContext.getOrCreate(conf)
val lines: RDD[String] = sc.textFile("in/user.json")
import scala.util.parsing.json.JSON
val rdd = lines.map(x=>JSON.parseFull(x))
rdd.collect().foreach(println)

sparksession操作
val spark = SparkSession.builder().master("local[*]").appName("jsonsession").getOrCreate()
val frame = spark.read.format("json").option("header",true).load("in/user.json")
frame.printSchema()
frame.show()

spark读取jar包执行scala操作
配置文件:

scala代码:
val conf = new SparkConf().setMaster("local[*]").setAppName("json")
val sc = SparkContext.getOrCreate(conf)
val properties = new Properties()
//test.properties存放在Linux上的路径
properties.load(new FileInputStream("/root/test.properties"))
val path: String = properties.getProperty("path")
val savepath: String = properties.getProperty("savepath")
val rdd: RDD[String] = sc.textFile(path)
val result: RDD[(String, Int)] = rdd.flatMap(x => x.split(" ")).map(x => (x, 1)).reduceByKey(_ + _)
result.foreach(println)
result.saveAsTextFile(savepath)
打包:
ProjectStructure-Artifacts-"+"-JARS-from modules with dependencies-Main_class(类)、module(工程名)-Apply-OK
Build-Build Artifact-Build
删除jar包中的安全文件
1、在上传Linux前,手动删除jar包中的安全文件
META-INF文件夹下的 .DSA和.SF文件
2、Linux上删除
zip -d /opt/kb09File/sparkdemo1.jar 'META-INF/*.DSA' 'META-INF/*SF'
读取包执行scala操作
spark-submit --class nj.kb11.HelloWorld --master local[*] ./sparkdemo.jar


本文介绍使用Apache Spark处理CSV及JSON数据的方法,包括通过SparkContext和SparkSession进行数据加载、清洗、转换等操作,并展示了如何调整DataFrame结构及执行过滤查询。
6927





