Prediction(1)Data Collection

本文介绍如何在Zeppelin环境中处理存储于S3桶中的JSON格式数据,包括数据加载、转换为DataFrame、SQL查询及结果展示等关键步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Prediction(1)Data Collection

All the data are in JSON format in S3 buckets.

We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/

I try to do the implementation on zeppelin which is really a useful tool.

Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}" //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"

val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")

That codes can follow the pattern and load all the files.

clicks.registerTempTable("clicks")
//applications.printSchema

The can register the data as a table and print out the schema of the JSON data.

val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()

This can load all the text files in zip format and convert that to and Dataframe

%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications group by SUBSTR(timestamp,0,10), job_id

%sql will give us the ability to write SQLs and display that data below in graph.

val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20' group by SUBSTR(timestamp,0,10), job_id")

import org.apache.spark.sql.functions._

val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))

These command will do the query and sorting for us on Dataframe.

val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)

writes the data back to S3.

Here is the place to check the hadoop cluster
http://localhost:9026/cluster

And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/

References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame
> setwd("D:/1/率的meta分析亚组分析") > data <- read_excel("TTS分析(meta回归)region和study design改动 - 对online telephone mail进行合并-增加了质量评估 - 副本.xlsx", 1) > data <- data[data$pre == "Y", ] > # 因子变量转换与参考水平设置 > factor_vars <- c("WHO_region", "Sampling_procedure", "Collection_method", + "CSA_scale1", "CSA_scale2", "populations_f", + "CSA_age1", "sample_size") > > data[factor_vars] <- lapply(data[factor_vars], function(x) { + x <- as.factor(x) + relevel(x, ref = "0") + }) > > # 确保整数类型 > data$x <- as.integer(data$x) > data$n <- as.integer(data$n) > > # 使用metaprop进行Meta分析 > meta_proportion <- metaprop( + event = x, + n = n, + data = data, + prediction = TRUE, + method.ci = "WS", + sm = "PLOGIT" + ) > > # 提取效应量及其方差 > yi <- meta_proportion$TE > vi <- meta_proportion$seTE^2 > > ### 修正1: 使用rma函数拟合初始模型,并处理收敛问题 ### > # 初始全模型公式 > mod_formula <- ~ WHO_region + Sampling_procedure + Collection_method + CSA_scale1 + + CSA_scale2 + populations_f + CSA_age1 + sample_size + item5 + item6 + item9 > > # 拟合模型,增加迭代次数并尝试不同优化器 > rma_full <- rma( + yi = yi, + vi = vi, + mods = mod_formula, + data = data, + method = "REML", + control = list(maxiter=1000, optimizer="nlminb") # 增加迭代次数,选择优化器 + ) 警告信息: 1: 29 studies with NAs omitted from model fitting. 2: Redundant predictors dropped from the model.
08-06
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值