Prediction(1)Data Collection

最新推荐文章于 2025-12-07 15:02:32 发布

原创最新推荐文章于 2025-12-07 15:02:32 发布 · 143 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#json #scala #大数据

Summary 专栏收录该内容

381 篇文章

订阅专栏

本文介绍如何在Zeppelin环境中处理存储于S3桶中的JSON格式数据，包括数据加载、转换为DataFrame、SQL查询及结果展示等关键步骤。

Prediction(1)Data Collection

All the data are in JSON format in S3 buckets.

We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/

I try to do the implementation on zeppelin which is really a useful tool.

Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}" //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"

val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")

That codes can follow the pattern and load all the files.

clicks.registerTempTable("clicks")
//applications.printSchema

The can register the data as a table and print out the schema of the JSON data.

val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()

This can load all the text files in zip format and convert that to and Dataframe

%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications group by SUBSTR(timestamp,0,10), job_id

%sql will give us the ability to write SQLs and display that data below in graph.

val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20' group by SUBSTR(timestamp,0,10), job_id")

import org.apache.spark.sql.functions._

val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))

These command will do the query and sorting for us on Dataframe.

val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)

writes the data back to S3.

Here is the place to check the hadoop cluster
http://localhost:9026/cluster

And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/

References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame