1. 项目背景
上一篇文章我们讲解了搭建数据挖掘可视化系统(大数据数据挖掘系统可视化设计艺术-优快云博客)的主要内容,通过构建算子工作流可以完成数据挖掘的工作。这时这里的算子可以提交到本地计算请求,也可以提交大数据集群做计算,本文将详细介绍搭建本地计算请求的过程。
2.拓扑排序核心算法
2.1算法描述
拓扑排序是针对有向无环图(DAG)的一种排序方法,它可以用来解决任务调度、依赖关系等问题。
核心算法逻辑描述如下:
1. 初始化一个数组inDegree,用来记录每个顶点的入度数(即有多少其他顶点指向该顶点)。
2. 遍历图的所有顶点,统计每个顶点的入度数,并将其存入inDegree数组中。
3. 初始化一个队列queue,并将所有入度为0的顶点放入队列中。
4. 初始化一个空数组result,用于存储排序后的顶点。
5. 当队列不为空时,执行以下操作:
- 取出队首顶点v,并将其放入result数组中。
- 遍历v的所有邻接顶点u:
- 将u的入度减1。
- 若u的入度变为0,将u放入队列中。
6. 当队列为空时,如果result数组的长度等于图的顶点数,说明拓扑排序成功,返回result;否则,说明图中存在环,拓扑排序失败。
拓扑排序的时间复杂度为O(V+E),其中V为顶点数,E为边数。
2.2代码实现
/**
* @param links 节点关系
* @param opMapper 算子id -> Operator映射
* @param flag
* @return
*/
private List<String> splitLinks(JSONArray links, HashMap<String, Operator> opMapper, boolean flag) {
List<String> flow = new ArrayList<>();
List<String> fullFlow = new ArrayList<>();
HashMap<String, Integer> fromMap = new HashMap<>();
HashMap<String, Integer> toMap = new HashMap<>();
for (Object obj : links) {
JSONObject link = (JSONObject) obj;
JSONObject from = link.getJSONObject("from");
JSONObject to = link.getJSONObject("to");
(opMapper.get(from.getString("uuid"))).addTo(to.getString("uuid"));
(opMapper.get(to.getString("uuid"))).addFrom(from.getString("uuid"));
Ports p = (opMapper.get(to.getString("uuid"))).getPort(to.getString("name"));
if (p != null) {
p.setTargetData(from.getString("uuid"), from.getString("name"));
}
if (fromMap.get(from.getString("uuid")) == null) {
fromMap.put(from.getString("uuid"), 1);
} else {
fromMap.put(from.getString("uuid"), fromMap.get(from.getString("uuid")) + 1);
}
if (toMap.get(to.getString("uuid")) == null) {
toMap.put(to.getString("uuid"), 1);
} else {
toMap.put(to.getString("uuid"), toMap.get(to.getString("uuid")) + 1);
}
}
for (Entry<String, Integer> entry : fromMap.entrySet()) {
if (entry.getValue() != null && toMap.get(entry.getKey()) == null) {
flow.add(entry.getKey());
}
}
for (String uuid : flow) {
this.recursiveFlow(opMapper, uuid, fullFlow, flag);
}
return fullFlow;
}
/**
* 递归查找from算子
*
* @param opMapper
* @param id
* @param fullFlow
* @param flag
*/
private void recursiveFlow(HashMap<String, Operator> opMapper, String id, List<String> fullFlow, boolean flag) {
if (!fullFlow.contains(id) &&
(flag || exceptionMapper.get(opMapper.get(id).getName()) == null)) {
fullFlow.add(id);
}
Set<String> next = opMapper.get(id).getTo();
for (String n : next) {
this.recursiveFrom(opMapper, n, fullFlow, flag);
this.recursiveFlow(opMapper, n, fullFlow, flag);
}
}
/**
* 递归查找from算子
*
* @param opMapper
* @param id
* @param fullFlow
* @param flag
*/
private void recursiveFrom(HashMap<String, Operator> opMapper, String id, List<String> fullFlow, boolean flag) {
Operator o = opMapper.get(id);
for (String fromId : o.getFrom()) {
if (!fullFlow.contains(fromId)) {
this.recursiveFrom(opMapper, fromId, fullFlow, flag);
}
}
if (!fullFlow.contains(id) && (flag || exceptionMapper.get(opMapper.get(id).getName()) == null)) {
fullFlow.add(id);
}
}
3.算子的定义
trait Operator {
def description(): String
def parameterTypes(): Seq[ParameterType[_]]
def inputPorts(): InputPorts
def outputPorts(): OutPutPorts
}
4.算子本地引擎的执行逻辑的代码
private JSONObject executeOperator(HashMap<String, Operator> opMapper, List<String> workflow, SQLContext sqlContext, StringBuffer sb) {
String text = "";
JSONObject returnMessage = new JSONObject();
JSONArray executeResult = new JSONArray();
JSONArray workflowResult = new JSONArray();
JSONObject flow = new JSONObject();
String msg = "执行成功";
try {
for (String id : workflow) {
long start = System.currentTimeMillis();
flow = new JSONObject();
flow.put("uuid", id);
flow.put("status", "wait");
flow.put("msg", "未执行");
Operator o = opMapper.get(id);
Class op = this.getOperatorClass(o);
Constructor c = op.getDeclaredConstructor(OperatorMeta.class);
c.setAccessible(true);
//初始化kr-spark-operator定义的算子
Object operator = c.newInstance(new OperatorMeta(o.getName(), "false"));
text = o.getName() + ":" + o.getUuid();
sb.append(text).append("开始执行算子过程......\n");
this.getInitParameter(o, op, operator);
this.setInPortData(opMapper, o, op, operator);
Method m = op.getMethod("doWork", Parameters.class, OperatorIOHandler.class, SQLContext.class, org.slf4j.Logger.class);
m.invoke(operator, o.getParameters(), ioHandler, sqlContext, logger);
this.setOutPortData(executeResult, o, op, operator);
long end = System.currentTimeMillis();
log.info("执行算子:" + text + String.format(", 算子运行时长: %dms", end - start));
sb.append("执行算子:").append(text).append(String.format(", 算子运行时长: %dms", end - start)).append("\n");
flow.put("status", "complete");
flow.put("msg", "执行完成");
flow.put("start", start);
flow.put("end", end);
workflowResult.add(flow);
}
} catch (Exception ex) {
Throwable targetEx = ex;
if (ex instanceof InvocationTargetException) {
targetEx = ((InvocationTargetException) ex).getTargetException();
}
log.error(text + ", 执行过程报错;", targetEx);
sb.append("执行错误:").append(text).append(",错误信息:").append(targetEx.getMessage()).append("\n");
// 日志下载 涉及到这个字段 这里记录了算子执行错误的堆栈信息
msg = "算子[" + text + "]执行失败,错误信息:\n" + this.getExceptionTrace(targetEx);
flow.put("status", "failed");
flow.put("msg", targetEx.getMessage());
workflowResult.add(flow);
}
returnMessage.put("workflow", workflowResult);
returnMessage.put("data", executeResult);
returnMessage.put("msg", msg);
return returnMessage;
}
5.总结
本文只是描述搭建的主要流程,也为大家提供一个思路。如果想要了解详细的搭建留存请联系我哦