场景
有一个128M大文本,约1亿3000万字符,现在需要对这个文本中的id进行替换,一共有489814个id,同时按资源类型分有117个资源,即需要对这个大文本进行117*489814=57,308,238次replaceAll,按原有逻辑几乎执行了3天还没有执行完
原有替换逻辑,采用了双层for循环,进行替换,
for (Map.Entry<String, Object> entry : result.entrySet()) {
String valueString = JSONObject.toJSONString(entry.getValue());
long startOneceCopyTime = System.currentTimeMillis();
log.info("###version install {} copy start {} ", datappId,entry.getKey());
if (StringUtils.isNotEmpty(valueString)) {
for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
String oldId = idEntry.getKey();
String newId = String.valueOf(idEntry.getValue());
if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
//10位以内的id为非法数据 不替换
log.warn("###version install {} unsupport oldId:{} ##",datappId, oldId);
continue;
}
valueString = valueString.replaceAll(oldId, newId);
}
Object parse = JSONObject.parse(valueString);
entry.setValue(parse);
}
long endOneceCopyTime = System.currentTimeMillis();
log.info("###version install {} copy {} step {}/{} ,spend time {} ms",datappId, entry.getKey(), ++count, result.size(), endOneceCopyTime - startOneceCopyTime);
}
-
replaceAll操作的频繁使用:对每个 oldId 和 newId 的替换操作在较大的数据集中会产生较高的开销。
-
嵌套循环:对于每一个 result的 entry,都对 newIdMap 进行循环遍历,这个是 O(n²) 的复杂度,数据量大时执行效率会降低。
-
每次修改 Value都要解析 JSON:JSONObject.parse 频繁进行 JSON 解析也是性能瓶颈。
-
第一次优化
除去按表循环的逻辑,直接整个resultMap转成大JSON然后进行replaceAll,循环次数从57,308,238次减少至489814次
String resultString = JSONObject.toJSONString(result); for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) { String oldId = idEntry.getKey(); String newId = String.valueOf(idEntry.getValue()); long startOneceCopyTime = System.currentTimeMillis(); log.info("###version install {} copy start {},oldId:{},newId:{} ", datappId,count,oldId,newId); if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) { //10位以内的id为非法数据 不替换 log.warn("###version install {} unsupport oldId:{} ##",datappId, oldId); continue; } resultString = resultString.replaceAll(oldId, newId); long endOneceCopyTime = System.currentTimeMillis(); log.info("###version install {} copy {} end ,spend:{} ", datappId,count,endOneceCopyTime-startOneceCopyTime); count++; }
-
可以看到优化后每次耗时平均在200ms,一共需要489814*200/1000=97,962.8S,约等于27H,时间还是很长
第二次优化
采用正则替代replaceAll
String resultString = JSONObject.toJSONString(result); Map<String, String> idReplaceMap = new HashMap<>(); // 预处理映射表,过滤掉无效的 ID for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) { String oldId = idEntry.getKey(); String newId = String.valueOf(idEntry.getValue()); if (StringUtils.isNotEmpty(oldId) && StringUtils.isNotEmpty(newId) && oldId.length() >= 10) { idReplaceMap.put(oldId, newId); } else { log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId); } } // 构建正则表达式,用于一次性替换多个 oldId StringBuilder regexBuilder = new StringBuilder(); for (String oldId : idReplaceMap.keySet()) { regexBuilder.append(Pattern.quote(oldId)).append("|"); } if (regexBuilder.length() > 0) { // 去掉最后一个 "|" regexBuilder.setLength(regexBuilder.length() - 1); Pattern pattern = Pattern.compile(regexBuilder.toString()); Matcher matcher = pattern.matcher(resultString); // 使用 StringBuilder 执行替换操作 StringBuffer sb = new StringBuffer(); while (matcher.find()) { long startOneceCopyTime = System.currentTimeMillis(); log.info("###version install {} copy {} start ", datappId,count++); String oldId = matcher.group(); String newId = idReplaceMap.getOrDefault(oldId, oldId); matcher.appendReplacement(sb, newId); long endOneceCopyTime = System.currentTimeMillis(); log.info("###version install {} copy end ,spend:{} ", datappId, endOneceCopyTime - startOneceCopyTime); } matcher.appendTail(sb); resultString = sb.toString(); }
-
可以看到优化后每次耗时平均在100ms,约需要14H,时间还是很长
第三次优化
采用服务命令进行文件替换的方式,每次耗时平均在500ms左右, 489814*500/1000=97,962.8S,约等于68H,相较于内容直接替换更慢,但是次方案是替换同一个文件,所以可以进行并行操作,于是进行第四次优化
String resultString = JSONObject.toJSONString(result); com.sdata.foundation.core.util.StorageUtil.writeTextFile(resultString, datappIdFile.getPath(), "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON); for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) { String oldId = idEntry.getKey(); String newId = String.valueOf(idEntry.getValue()); long startOneceCopyTime = System.currentTimeMillis(); log.info("###version install {} copy start {},oldId:{},newId:{} ", datappId,count,oldId,newId); if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) { log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId); continue; } String command = String.format("sed -i 's/%s/%s/g' %s", oldId, newId, datappIdFile.getPath()+File.separator+"allResource" + VersionControlConstants.PackageZipFileSuffix.JSON); log.info("Running command: {}", command); try { Process process = Runtime.getRuntime().exec(command); process.waitFor(); // 等待命令执行完毕 log.info("Replaced oldId:{} with newId:{} in file", oldId, newId); } catch (IOException | InterruptedException e) { log.error("Error executing sed command", e); } long endOneceCopyTime = System.currentTimeMillis(); log.info("###version install {} copy {} end ,spend:{} ", datappId,count,endOneceCopyTime-startOneceCopyTime); count++; } resultString = ImOrExportUtil.readFromFile(new File(datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON)); result = JSONObject.parseObject(resultString,Map.class);
第四次优化
沿用第三次优化的方式,采用并行操作,采用线程池,设置10个线程。设置多线程后执行执行占用CPU过高导致系统无法访问
局限性:
采用了linux的sed命令进行了文件内容替换,如果部署到其他类型服务器可能不支持此命令
//复制替换id long startCopyTime = System.currentTimeMillis(); log.info("###version install {},has {} resources, {} ids ready to copy", datappId,result.size(), newIdMap.size()); final int[] count = {0}; String resultString = JSONObject.toJSONString(result); com.sdata.foundation.core.util.StorageUtil.writeTextFile(resultString, datappIdFile.getPath(), "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON); int batchSize = 1000; // Number of tasks per batch ExecutorService executorService = Executors.newFixedThreadPool(5); // You can adjust the pool size based on your system List<Map.Entry<String, Object>> entries = new ArrayList<>(newIdMap.entrySet()); // Convert the entry set to a list for (int i = 0; i < entries.size(); i += batchSize) { // Process in batches of 1000 List<Map.Entry<String, Object>> batch = entries.subList(i, Math.min(i + batchSize, entries.size())); log.info("####batch replace id,batchSize:{}###",batch.size()); for (Map.Entry<String, Object> idEntry : batch) { executorService.submit(() -> { String oldId = idEntry.getKey(); String newId = String.valueOf(idEntry.getValue()); // long startOneceCopyTime = System.currentTimeMillis(); if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) { log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId); return; } String command = String.format("sed -i 's/%s/%s/g' %s", oldId, newId, datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON); log.info("Running command: {}", command); try { shellCommandExecutor.exec(command); } catch (Exception e) { log.error("Error executing sed command", e); } // long endOneceCopyTime = System.currentTimeMillis(); log.info("#######completed {} task#######", count[0]); count[0] = count[0] +1; }); } } executorService.shutdown(); try { // Wait for all tasks to complete if (!executorService.awaitTermination(1, TimeUnit.HOURS)) { log.warn("Some tasks did not finish within the time limit."); executorService.shutdownNow(); } } catch (InterruptedException e) { log.error("Task execution interrupted.", e); executorService.shutdownNow(); } log.info("All tasks completed. Proceeding with the next steps..."); resultString = ImOrExportUtil.readFromFile(new File(datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON)); result = JSONObject.parseObject(resultString,Map.class); long endCopyTime = System.currentTimeMillis(); log.info("###version install {} data copy end, time:{}ms",datappId, endCopyTime - startCopyTime);
第五次优化
public String replaceIds(String text, Map<String, Object> idMap) { StringBuilder result = new StringBuilder(); int i = 0; // 遍历整个文本 long startOneceCopyTime = System.currentTimeMillis(); while (i < text.length()) { boolean replaced = false; // 遍历map中的每个旧ID,尝试匹配 for (Map.Entry<String, Object> entry : idMap.entrySet()) { String oldId = entry.getKey(); String newId = String.valueOf(entry.getValue()); // 如果当前位置可以匹配到旧ID if (i + oldId.length() <= text.length() && text.substring(i, i + oldId.length()).equals(oldId)) { result.append(newId); // 追加新ID i += oldId.length(); // 跳过已匹配的旧ID部分 replaced = true; break; // 一旦匹配成功,停止继续检查其他ID } } // 如果没有匹配到任何旧ID,则追加当前字符 if (!replaced) { result.append(text.charAt(i)); i++; } if(i%100000==0){ long endOneceCopyTime = System.currentTimeMillis(); log.info("#######completed {} task,spend:{}#######", i,endOneceCopyTime-startOneceCopyTime); } } return result.toString(); }
第六次优化
使用 Trie(字典树)数据结构:Trie(前缀树)能够高效地进行字符串前缀匹配,避免在每个字符位置进行遍历查找。将 48 万个旧 ID 存储在 Trie 中,替换时可以逐字符匹配,避免遍历每个 ID,通过 Trie 树的匹配能够减少查找时间,降低复杂度。
// Trie节点定义 class TrieNode { Map<Character, TrieNode> children = new HashMap<>(); String newId; // 当匹配到完整的旧ID时,存储相应的新ID } class Trie { private TrieNode root; private int matchedOldIdLength; // 用于记录匹配的旧ID的长度 public Trie() { root = new TrieNode(); } // 插入旧ID和对应的新ID public void insert(String oldId, String newId) { TrieNode node = root; for (char c : oldId.toCharArray()) { node.children.putIfAbsent(c, new TrieNode()); node = node.children.get(c); } node.newId = newId; } // 从文本的某个位置开始匹配旧ID,返回新ID(如果找到) public String match(String text, int start) { TrieNode node = root; int i = start; matchedOldIdLength = 0; while (i < text.length() && node.children.containsKey(text.charAt(i))) { node = node.children.get(text.charAt(i)); i++; matchedOldIdLength++; // 如果找到完整的旧ID,返回对应的新ID if (node.newId != null) { return node.newId; } } // 没有找到完整的旧ID matchedOldIdLength = 0; return null; } // 获取匹配的旧ID的长度 public int getMatchedOldIdLength() { return matchedOldIdLength; } } public String replaceIds(String text, Trie trie) { StringBuilder result = new StringBuilder(); int i = 0; long startOneceCopyTime = System.currentTimeMillis(); while (i < text.length()) { // 查找从当前位置开始是否能匹配旧ID String newId = trie.match(text, i); if (newId != null) { // 如果匹配到旧ID,替换为新ID result.append(newId); i += trie.getMatchedOldIdLength(); // 移动索引,跳过旧ID部分 } else { // 没有匹配到旧ID,保留当前字符 result.append(text.charAt(i)); i++; } if(i%100000==0){ long endOneceCopyTime = System.currentTimeMillis(); log.info("#######completed {} task,spend:{}#######", i,endOneceCopyTime-startOneceCopyTime); } } return result.toString(); }
-
优化后id替换耗时降至10579ms,约10s,整体安装50w数据约20min
-
问题解决