最近遇到一个问题:用sparkstreaming的updateStateBykey算子保存当天状态,要求零点清除状态,为了解决这个问题想到了三个思路:
1 零点重启程序,重启之后spark内存中的数据会被清除
#!/bin/bash
Num=`ps aux|grep SparkSubmit|grep xxxxxx|wc -l`
if [ $Num -eq 1 ];then
PID=`ps aux|grep SparkSubmit|grep xxxxxxxx|awk '{print $2}'`
/usr/bin/kill -9 $PID
sleep 5
Num=`ps aux|grep SparkSubmit|grep xxxxxxxxxx|wc -l`
if [ $Num -eq 0 ];then
nohup /zywa/runthreat/test/job-spark-thread.sh > /zywa/runthreat/test/job-spark-thread.log &
fi
fi
2 spark内部设置定时任务,定时清除内存中的数据。这涉及到两个问题,一个是 spark定时任务的设置,另一个是清除指定内存中的数据 ----实现难度较大,目前还没有实现过
3 状态保存在外部文件中,设置一个定时任务,清除外部文件的数据。我是把它保存在redis当中,然后定时清除redis数据。定时器采用quartz框架。
quartz 代码:
// 设置多个定时任务
public QuartzClient(){ try{ QuartzManager.addJob("job1","xxxx","0 0 0 * * ?"); QuartzManager.addJob("job2","xxxx","0 0 0 * * ?"); QuartzManager.addJob("job3","xxxx","*/30 * * * * ?"); }catch (Exception ex){ } }
/** * 添加一个定时任务,使用默认的任务组名,触发器名,触发器组名 * * @param jobName * 任务名 * @param jobClass * 任务 * @param time * 时间设置,参考quartz说明文档 * @throws SchedulerException * @throws ParseException */ public static void addJob(String jobName, String jobClass, String time) { try { Scheduler sched = gSchedulerFactory.getScheduler(); JobDetail jobDetail = new JobDetail(jobName, JOB_GROUP_NAME, Class.forName(jobClass));// 任务名,任务组,任务执行类 // 触发器 CronTrigger trigger = new CronTrigger(jobName, TRIGGER_GROUP_NAME);// 触发器名,触发器组 trigger.setCronExpression(time);// 触发器时间设定 sched.scheduleJob(jobDetail, trigger); // 启动 if (!sched.isShutdown()){ sched.start(); } } catch (Exception e) { e.printStackTrace(); throw new RuntimeException(e); } }
public class ClearCountOperator implements Job { @Override public void execute(JobExecutionContext jobExecutionContext) throws JobExecutionException { clearCountData(); }
// 清除redis wordCount_appKey 数据 private void clearCountData() { String key =Resources.redisPro.getProperty("count_status"); Map<String, String> resultMap = BaseRedisConnect.hgetAll(key); Set<String> keys = resultMap.keySet(); System.out.println("set size : "+keys.size()); for (String k : keys) { BaseRedisConnect.hdel(key, k); StartAction2.logger().error("------------------------redis WordCount_appKey数据清理成功------------------------"); }spark streaming代码
dStream.foreachRDD(rdd=>{ //定时任务 quoart() })
dStream.foreachRDD{rdd => // 获取redis缓存数据 并广播 val countMap = getMap if(appKeyBroadCast!=null){ appKeyBroadCast.unpersist() } appKeyBroadCast =sc.broadcast(countMap) val analyzeRdd =rdd.map(line =>{ //解析数据并得到key的值: (countService.analyzMess(line,countType),1) }).filter(_._1!="").filter(_._1!=null).reduceByKey(_+_) // 新数据与老数据融合 val statusRDD = updateStateByBroadcastValue(analyzeRdd,appKeyBroadCast)
// 数据存入redis val statusMap = statusRDD.collect()//.toMap.asInstanceOf[mutable.Map[String,String]] var statusMap2: scala.collection.mutable.Map[String, String] = mutable.Map[String,String]() statusMap.foreach(x=>statusMap2.put(x._1.toString,x._2.toString)) import scala.collection.JavaConversions._ setMap(statusMap2)
}
// 读取redis历史数据进行累加 def updateStateByBroadcastValue(ds:RDD[(String,Int)],broadcast: Broadcast[mutable.HashMap[String,Int]]):RDD[(String,Int)]={ val dst = ds.mapPartitions{ iter => for(it <- iter) yield (it._1,it._2+broadcast.value.getOrElse(it._1,0)) } dst }
bug :
1 redis 链接用掉之后没有关掉,导致redis链接数被耗尽
/** * 绝对删除方法(保证删除绝对有效) * Jedis连接使用后返回连接池</span> * @param key */ public static void hdel(String key,String k) { Jedis jedis = null; try{ jedis = getJedis(); while (true) { if (null != jedis) { break; } else { jedis = getJedis(); } } jedis.hdel(key,k); returnResource(jedis); }catch (Exception e){ e.printStackTrace(); }finally { if(jedis!=null){ jedis.close(); } } }
2 spark 广播变量导致的序列化问题
既然数据状态保存在redis中,采用checkpoint就没有太大必要了 果断删除。这个版本的spark checkpoint 机制弊端很大,详细见链接:
http://blog.youkuaiyun.com/u010454030/article/details/54985740