车联网3年的运营时间计算出每辆车开始到达20000公里需要的开始时间、结束时间、运营天数、路程差
1、对每辆车按照车牌分组
2、分组后按照时间先后顺序排序,计算到达20000公里的时间路程,找到排序后的第一个
3、输出过程结果
4、spark参考mr写出相应的代码:有些概念分析如下:
scala> val s="12,8888,999,pppp,lllll"
s: String = 12,8888,999,pppp,lllll
scala> val l = s.split(",")
l: Array[String] = Array(12, 8888, 999, pppp, lllll)
scala> l(1)
res0: String = 8888
scala> l(0)
res1: String = 12
scala> val hh=l(3)+l(1)
hh: String = pppp8888
scala练习题:
val rdd = sc.textFile("/root/temp/zhaoyong.txt")
val rdd111 = rdd.map(_.split(",")).map(x=> (x(1),x(0))).groupByKey
scala> val rdd221 = rdd111.map(line =>{
| val top = line._2.toList
| val first = top.sorted.reverse.take(1)
| val first1 = first(0)
| val last = top.sorted.take(1)
| val last1 = last(0)
| (line._1,first1+";"+last1)
| })
进行扩展
val rdd441 = rdd.map(_.split(",")).map(x=> (x(1),(x(0),x(2)))).groupByKey
scala> val rdd442=rdd441.map(line => {
| val top = line._2.toArray.sortBy(_._2)(Ordering[String].reverse) //借助与能排序功:进行纠正这里可以按照_._2第二个进行排序,也可以按照_._1第一个时间进行排序,都是String类型的。 line._2本来是[(String, String)]这种类型,通过toArray转换成外面套着一层Array形式的。即双层数组的第二层是这样转化而成的。value sortBy is not a member of Iterable[(String, String)],其中不加reverse是顺序排列
| val top1 =top.take(1)
| val value = top1.toArray //本身就是数组没有必要这行代码。
| val v =value(0)
| (line._1,v)
| })
val rdd000 = sc.textFile("/root/temp/zhaoyong.txt")
val rdd001 = rdd.map(_.split(",")).map(x=> (x(1),(x(0),x(2)))).groupByKey
val rdd002=rdd001.map(line => {
| val top = line._2.toArray.sortBy(_._1)(Ordering[String].reverse)
| (line._1,top)
| })
Array[(String, Array[(String, String)])] = Array((LLV1CPA14F0031952,Array((2019-01-01 23:59:39.322,21721.0), (2018-01-01 23:59:39.322,2.0))), (LLV1CPA1XF0032426,Array((2019-01-01 23:59:49.649,28056.0), (2017-01-01 23:59:49.649,3.0))), (LLV1CRB17H0000272,Array((2019-01-01 23:59:55.871,22487.0), (2018-01-01 23:59:55.871,1.0))))
本身上面代码(x(0),x(2))就是一个数组,groupByKey外面又产生了外面一层数组,line._1,跟生成的二维数组,封装在一起算数组中的一个元素。
但是取出二维数组中值最大的那个数组如下:
val rdd002=rdd001.map(line => {
| val top = line._2.toArray.sortBy(_._1)(Ordering[String].reverse)
| val top1 = top.take(1)
| val v = top1(0)
| (line._1,v)
| })
Array[(String, (String, String))] = Array((LLV1CPA14F0031952,(2019-01-01 23:59:39.322,21721.0)), (LLV1CPA1XF0032426,(2019-01-01 23:59:49.649,28056.0)), (LLV1CRB17H0000272,(2019-01-01 23:59:55.871,22487.0)))很清晰看到结构发生了变化1、外面套的那层Array没有,后面小括号少了一层,产生的v依然是一个数组结构。
相反如果我们仅仅取take(1)那么整个Array外面的结构没有变化,只是元素少了而已,里面的具体元素数组还得通过数组下标top1(0)来获取出来。
val rdd002=rdd001.map(line => {
| val top = line._2.toArray.sortBy(_._1)(Ordering[String].reverse)
| val top1 = top.take(1)
| (line._1,top1)
| })
Array[(String, Array[(String, String)])] = Array((LLV1CPA14F0031952,Array((2019-01-01 23:59:39.322,21721.0))), (LLV1CPA1XF0032426,Array((2019-01-01 23:59:49.649,28056.0))), (LLV1CRB17H0000272,Array((2019-01-01 23:59:55.871,22487.0))))
上面的上面我门看到Array[(String, (String, String))] = Array((LLV1CPA14F0031952,(2019-01-01 23:59:39.322,21721.0)), (LLV1CPA1XF0032426,(2019-01-01 23:59:49.649,28056.0)), (LLV1CRB17H0000272,(2019-01-01 23:59:55.871,22487.0))) 是一个(String,String)这样的数据结构,想着 v.toArray报错 value toArray is not a member of (String, String)但是v.toString.split(",")是可以转化成数组的。代码结果如下:
val rdd002 =rdd001.map(line => {
| val top = line._2.toArray.sortBy(_._1)(Ordering[String].reverse)
| val top1 = top.take(1)
| val v = top1(0)
| val arr = v.toString.split(",")
| (line._1,arr(0))
| })
rdd002: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[23] at map at <console>:28
scala> rdd002.collect
res21: Array[(String, String)] = Array((LLV1CPA14F0031952,(2019-01-01 23:59:39.322), (LLV1CPA1XF0032426,(2019-01-01 23:59:49.649), (LLV1CRB17H0000272,(2019-01-01 23:59:55.871))
(line._1,arr(0).toString) 加上和不加上出来的结果是一样的
Array((LLV1CPA14F0031952,(2019-01-01 23:59:39.322),
(LLV1CPA1XF0032426,(2019-01-01 23:59:49.649),
(LLV1CRB17H0000272,(2019-01-01 23:59:55.871)
)
括号是不能进行配对的,可以放在excel中进行过滤掉。rdd002.saveAsTextFile("/data/zhaoyong/temp/rdd002")跟看到的是所见即所得。
进行扩展2
val rdd = sc.textFile("/root/temp/zhaoyong.txt")
val rdd551 = rdd.map(_.split(",")).map(x=> (x(1),(x(0),x(2)))).groupByKey
val rdd552=rdd551.map(line => {
val top = line._2.toArray
var b = 0.0
var sss=Array("a","b")
for(s <-top){
val arr = s.toString.split(",")
val va= arr(1)
val va1 = va.replace(")","").toDouble
if(va1>b){
b = va1
sss=arr
}
}
val result =sss(0)+" ,"+sss(1)
(line._1,result)
})
进行扩展3
package com.me.exe
import org.apache.spark.{SparkConf, SparkContext}
object fenxi3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("fenxi").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("/root/temp/zhaoyong.txt")
val rdd111 = rdd.map(_.split(",")).map(x => (x(1), x(0) + "," + x(2))).groupByKey
val rdd552 = rdd111.map(line => {
val top=line._2.toArray
var b=0.0
var kaishi ="2100"
var start=Array("", "")
var sss=Array("", "")
for (s <- top){
val arr=s.toString.split(",")
val va=arr(1)
if(arr(0)<kaishi){
kaishi=arr(0)
start = arr
}
val va1 = va.replace(")", "").toDouble
if (va1>20000){
b=va1
sss=arr
}
}
val result = start(0)+","+start(1)+";"+sss(0) + " ," + sss(1)
(line._1, result)
})
rdd552.saveAsTextFile("/root/temp/fenxi9")
sc.stop()
}
}
进行扩展4(添加时间的格式化)
package com.me.exe
import org.apache.spark.{SparkConf, SparkContext}
import java.text.SimpleDateFormat
import java.util.Date
object fenxi4{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("fenxi").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("/root/temp/zhaoyong.txt")
val rdd111 = rdd.map(_.split(",")).map(x => (x(1), x(0) + "," + x(2))).groupByKey
val rdd552 = rdd111.map(line => {
val top=line._2.toArray
var b=0.0
var kaishi ="2100"
var start=Array("", "")
var sss=Array("", "")
for (s <- top){
val arr=s.toString.split(",")
val va=arr(1)
if(arr(0)<kaishi){
kaishi=arr(0)
start = arr
}
val va1 = va.replace(")", "").toDouble
if (va1>20000){
b=va1
sss=arr
}
}
var result = start(0)+";"+start(1)+"\t"+sss(0) + ";" + sss(1)
if(start(0) != "" && sss(0) != ""){
val newtime :Long= new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(start(0)).getTime
val endtime :Long= new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(sss(0)).getTime
val tianshu = (endtime - newtime)/1000/3600/24
val tianshu1 = "\t"+tianshu
result =result+tianshu1
}
(line._1, result)
})
rdd552.saveAsTextFile("/root/temp/fenxi10")
sc.stop()
}
}
进行扩展5(在toDouble转化前加上判断)
注意:saveAsTextFile的结果都是拥有( ),包括world count程序输出结果的时候,可以在excel中进行过滤
package com.me.exe
import org.apache.spark.{SparkConf, SparkContext}
import java.text.SimpleDateFormat
import java.util.Date
object fenxi4{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("fenxi").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("/root/temp/all5.csv")
val rdd111 = rdd.map(_.split(",")).map(x => (x(1), x(0) + "," + x(2))).groupByKey
val rdd552 = rdd111.map(line => {
val top=line._2.toArray
var b=0.0
var kaishi ="2100"
var start=Array("", "")
var sss=Array("", "")
for (s <- top) {
val arr = s.toString.split(",")
val va = arr(1)
if (arr(0) < kaishi) {
kaishi = arr(0)
start = arr
}
val temp = va.replace(")", "")
var va1 = 0.0
if (temp != null && temp != "null" && temp != "" && temp != "odo" ){
va1 = temp.toDouble
}
if (va1>20000.0){
if((b>20000.0 && va1 < b) || b == 0.0 ) {
b = va1
sss = arr
}
}
}
var result = start(0)+";"+start(1)+"\t"+sss(0) + ";" + sss(1)
if(start(0) != "" && sss(0) != ""){
val newtime :Long= new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(start(0)).getTime
val endtime :Long= new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(sss(0)).getTime
val tianshu = (endtime - newtime)/1000/3600/24
val tianshu1 = "\t"+tianshu
result =result+tianshu1
}
(line._1, result)
})
rdd552.saveAsTextFile("/root/temp/fenxi11")
sc.stop()
}
}
可以通过这种方式去掉括号
显然,结果中含有该死的括号!所以需要想办法去掉括号!
val inputData = sc.textFile(inputFile)
inputData.flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_+_)
.map(line => {
val word = line._1
val cnt = line._2
word + "\t" + cnt
})
.saveAsTextFile(outputFile)
2019-01-01 23:59:49.649,LLV1CPA1XF0032426,28056.0,2019-01-01 23:59:46
2019-01-01 23:59:55.871,LLV1CRB17H0000272,22487.0,2019-01-01 23:59:52
2019-01-01 23:59:39.322,LLV1CPA14F0031952,21721.0,2019-01-01 23:59:35
2017-01-01 23:59:49.649,LLV1CPA1XF0032426,3.0,2019-01-01 23:59:46
2018-01-01 23:59:55.871,LLV1CRB17H0000272,1.0,2019-01-01 23:59:52
2018-01-01 23:59:39.322,LLV1CPA14F0031952,2.0,2019-01-01 23:59:35
最后对应的spark完整代码:
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.{SparkConf, SparkContext}
object fenxi5{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("fenxi").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("/root/temp/all5.csv")
val rdd111 = rdd.map(_.split(",")).map(x => (x(1), x(0) + "," + x(2))).groupByKey
val rdd552 = rdd111.map(line => {
val top=line._2.toArray
var b=0.0
var kaishi ="2100"
var start=Array("", "")
var sss=Array("", "")
for (s <- top) {
val arr = s.toString.split(",")
val va = arr(1)
if (arr(0) < kaishi) {
kaishi = arr(0)
start = arr
}
val temp = va.replace(")", "")
var va1 = 0.0
if (temp != null && temp != "null" && temp != "" && temp != "odo" ){
va1 = temp.toDouble
}
if (va1>20000){
b=va1
sss=arr
}
}
var result = start(0)+";"+start(1)+"\t"+sss(0) + ";" + sss(1)
if(start(0) != "" && sss(0) != ""){
val newtime :Long= new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(start(0)).getTime
val endtime :Long= new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(sss(0)).getTime
val tianshu = (endtime - newtime)/1000/3600/24
val tianshu1 = "\t"+tianshu
result =result+tianshu1
}
(line._1, result)
})
rdd552.saveAsTextFile("/root/temp/fenxi11")
sc.stop()
}
}
对应的mapperduce代码如下所示:
package demo.mr.batterys.iov;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MultiTableQueryMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key1, Text value1, Context context)
throws IOException, InterruptedException {
String data = value1.toString();
if("".equals(data) || data == null){
return;
}
String[] words = data.split(",");
if(words.length != 4 || "".equals(words[1])){
return;
}
String result = words[0]+","+words[2];
context.write(new Text(words[1]), new Text(result));
}
}
package demo.mr.batterys.iov;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MultiTableQueryReducer extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text k3, Iterable<Text> v3, Context context)
throws IOException, InterruptedException {
TreeMap<String,Double> h = new TreeMap<String,Double>();
for(Text v:v3){
String str = ""+v;
String arr[] = str.split(",");
double d=0;
try {
d = Double.parseDouble(arr[1]);
h.put(arr[0], d);
} catch (NumberFormatException e) {
e.printStackTrace();
d=0;
}
/*if(d>20000){
flag =1;
//break;
}*/
}
/*if(flag == 0){
return;
}*/
/*for(Text v:v3){
String str = ""+v;
String arr[] = str.split(",");
double d = Double.parseDouble(arr[1]);
h.put(arr[0], d);
if(d>20000){
break;
}
}*/
int i=0;
String kaishi = "";
double kaishilc = 0 ,jieshulc = 0,lccha =0;
String jieshu ="";int isfindBig = 0;int isfindSamll = 0;
for(Map.Entry<String,Double> en: h.entrySet()){
i++;
if(isfindSamll == 0 && en.getValue() > 0){
kaishi = en.getKey();
kaishi = kaishi.trim();
kaishilc = en.getValue();
isfindSamll = 1;
}
if((isfindBig == 0 && en.getValue() > 20000) || (i == h.size() && isfindBig == 0)){
jieshu = en.getKey();
jieshu=jieshu.trim();
jieshulc = en.getValue();
isfindBig = 1;
break;
}
}
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long ts1 =0;
try {
Date date1 = simpleDateFormat.parse(kaishi);
ts1 = date1.getTime();
} catch (ParseException e) {
e.printStackTrace();
}
long ts2 =0;
try {
Date date2 = simpleDateFormat.parse(jieshu);
ts2 = date2.getTime();
} catch (ParseException e) {
e.printStackTrace();
}
long cha = (ts2-ts1)/1000/60/60/24;
lccha = jieshulc -kaishilc ;
String result = kaishi+"\t"+jieshu+"\t"+kaishilc+"\t"+jieshulc+"\t"+lccha+"\t"+cha;
//输出
context.write(k3, new Text(result));
}
}
package demo.mr.batterys.iov;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MultiTableQueryMain {
public static void main(String[] args) throws Exception {
// 创建一个job:job = map + reduce
Job job = Job.getInstance(new Configuration());
//指定任务的入口
job.setJarByClass(MultiTableQueryMain.class);
//指定任务的Mapper和输出的数据类型: k2 v2
job.setMapperClass(MultiTableQueryMapper.class);
job.setMapOutputKeyClass(Text.class); //指定k2
job.setMapOutputValueClass(Text.class); //指定v2
//指定任务的Reducer和输出的数据类型: k4 v4
job.setReducerClass(MultiTableQueryReducer.class);
job.setOutputKeyClass(Text.class); //指定k4
job.setOutputValueClass(Text.class); //指定v4
//指定输入的路径(map)、输出的路径(reduce)
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//执行任务
job.waitForCompletion(true);
}
}