方法一:
参考资料:https://blog.youkuaiyun.com/GCR8949/article/details/80155064
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.spark.sql.SparkSession
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
val spark = getLocalSparkSession()
val binaryRDD = spark.sparkContext.binaryFiles("XXX.zip")
val dataRDD= binaryRDD.flatMap {
case (name: String, content: PortableDataStream) => val zis = new ZipInputStream(content.open())
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
}
dataRDD.take(10).foreach(println)
spark.read.json(dataRDD).show(100)
方法二:
使用spark.sparkContext.newAPIHadoopRDD
参考资料:https://www.thinbug.com/q/28569788
newAPIHadoopRDD
https://blog.youkuaiyun.com/zpf_940810653842/article/details/104815533