大数据量替换文本内容replaceAll性能过慢的替代方案

场景

        有一个128M大文本,约1亿3000万字符,现在需要对这个文本中的id进行替换,一共有489814个id,同时按资源类型分有117个资源,即需要对这个大文本进行117*489814=57,308,238次replaceAll,按原有逻辑几乎执行了3天还没有执行完

原有替换逻辑,采用了双层for循环,进行替换,

for (Map.Entry<String, Object> entry : result.entrySet()) {
    String valueString = JSONObject.toJSONString(entry.getValue());
    long startOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy start {} ", datappId,entry.getKey());
    if (StringUtils.isNotEmpty(valueString)) {
        for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
            String oldId = idEntry.getKey();
            String newId = String.valueOf(idEntry.getValue());
            if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
                //10位以内的id为非法数据  不替换
                log.warn("###version install {} unsupport oldId:{} ##",datappId, oldId);
                continue;
            }
            valueString = valueString.replaceAll(oldId, newId);
        }
        Object parse = JSONObject.parse(valueString);    
        entry.setValue(parse);
    }
    long endOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy {} step {}/{} ,spend time {} ms",datappId, entry.getKey(), ++count, result.size(), endOneceCopyTime - startOneceCopyTime);
}
  • replaceAll操作的频繁使用:对每个 oldId 和 newId 的替换操作在较大的数据集中会产生较高的开销。

  • 嵌套循环:对于每一个 result的 entry,都对 newIdMap 进行循环遍历,这个是 O(n²) 的复杂度,数据量大时执行效率会降低。

  • 每次修改 Value都要解析 JSON:JSONObject.parse 频繁进行 JSON 解析也是性能瓶颈。

  • 第一次优化

    除去按表循环的逻辑,直接整个resultMap转成大JSON然后进行replaceAll,循环次数从57,308,238次减少至489814次

    String resultString = JSONObject.toJSONString(result);
    for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
        String oldId = idEntry.getKey();
        String newId = String.valueOf(idEntry.getValue());
        long startOneceCopyTime = System.currentTimeMillis();
        log.info("###version install {} copy start {},oldId:{},newId:{} ", datappId,count,oldId,newId);
        if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
            //10位以内的id为非法数据  不替换
            log.warn("###version install {} unsupport oldId:{} ##",datappId, oldId);
            continue;
        }
        resultString = resultString.replaceAll(oldId, newId);
        long endOneceCopyTime = System.currentTimeMillis();
        log.info("###version install {} copy {} end ,spend:{} ", datappId,count,endOneceCopyTime-startOneceCopyTime);
        count++;
    }

  • 可以看到优化后每次耗时平均在200ms,一共需要489814*200/1000=97,962.8S,约等于27H,时间还是很长

    第二次优化

    采用正则替代replaceAll

    String resultString = JSONObject.toJSONString(result);
    Map<String, String> idReplaceMap = new HashMap<>();
    
    // 预处理映射表,过滤掉无效的 ID
    for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
        String oldId = idEntry.getKey();
        String newId = String.valueOf(idEntry.getValue());
        if (StringUtils.isNotEmpty(oldId) && StringUtils.isNotEmpty(newId) && oldId.length() >= 10) {
            idReplaceMap.put(oldId, newId);
        } else {
            log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId);
        }
    }
    
    // 构建正则表达式,用于一次性替换多个 oldId
    StringBuilder regexBuilder = new StringBuilder();
    for (String oldId : idReplaceMap.keySet()) {
        regexBuilder.append(Pattern.quote(oldId)).append("|");
    }
    
    if (regexBuilder.length() > 0) {
        // 去掉最后一个 "|"
        regexBuilder.setLength(regexBuilder.length() - 1);
        Pattern pattern = Pattern.compile(regexBuilder.toString());
        Matcher matcher = pattern.matcher(resultString);
    
        // 使用 StringBuilder 执行替换操作
        StringBuffer sb = new StringBuffer();
    
        while (matcher.find()) {
            long startOneceCopyTime = System.currentTimeMillis();
            log.info("###version install {} copy {} start ", datappId,count++);
            String oldId = matcher.group();
            String newId = idReplaceMap.getOrDefault(oldId, oldId);
            matcher.appendReplacement(sb, newId);
            long endOneceCopyTime = System.currentTimeMillis();
            log.info("###version install {} copy end ,spend:{} ", datappId, endOneceCopyTime - startOneceCopyTime);
        }
        matcher.appendTail(sb);
    
        resultString = sb.toString();
    
    }

  • 可以看到优化后每次耗时平均在100ms,约需要14H,时间还是很长

    第三次优化

    采用服务命令进行文件替换的方式,每次耗时平均在500ms左右, 489814*500/1000=97,962.8S,约等于68H,相较于内容直接替换更慢,但是次方案是替换同一个文件,所以可以进行并行操作,于是进行第四次优化

    String resultString = JSONObject.toJSONString(result);
    com.sdata.foundation.core.util.StorageUtil.writeTextFile(resultString, datappIdFile.getPath(), "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
    for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
        String oldId = idEntry.getKey();
        String newId = String.valueOf(idEntry.getValue());
        long startOneceCopyTime = System.currentTimeMillis();
        log.info("###version install {} copy start {},oldId:{},newId:{} ", datappId,count,oldId,newId);
        if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
            log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId);
            continue;
        }
    
        String command = String.format("sed -i 's/%s/%s/g' %s", oldId, newId, datappIdFile.getPath()+File.separator+"allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
        log.info("Running command: {}", command);
    
        try {
            Process process = Runtime.getRuntime().exec(command);
            process.waitFor(); // 等待命令执行完毕
            log.info("Replaced oldId:{} with newId:{} in file", oldId, newId);
        } catch (IOException | InterruptedException e) {
            log.error("Error executing sed command", e);
        }
        long endOneceCopyTime = System.currentTimeMillis();
        log.info("###version install {} copy {} end ,spend:{} ", datappId,count,endOneceCopyTime-startOneceCopyTime);
        count++;
    }
    resultString = ImOrExportUtil.readFromFile(new File(datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON));
    result = JSONObject.parseObject(resultString,Map.class);

    第四次优化

    沿用第三次优化的方式,采用并行操作,采用线程池,设置10个线程。设置多线程后执行执行占用CPU过高导致系统无法访问

    局限性:

    采用了linux的sed命令进行了文件内容替换,如果部署到其他类型服务器可能不支持此命令

                  //复制替换id
                            long startCopyTime = System.currentTimeMillis();
                            log.info("###version install {},has {} resources, {} ids ready to copy", datappId,result.size(), newIdMap.size());
                            final int[] count = {0};
                            String resultString = JSONObject.toJSONString(result);
                            com.sdata.foundation.core.util.StorageUtil.writeTextFile(resultString, datappIdFile.getPath(), "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
                            int batchSize = 1000;  // Number of tasks per batch
    
                            ExecutorService executorService = Executors.newFixedThreadPool(5);  // You can adjust the pool size based on your system
    
                            List<Map.Entry<String, Object>> entries = new ArrayList<>(newIdMap.entrySet());  // Convert the entry set to a list
                            for (int i = 0; i < entries.size(); i += batchSize) {
                                // Process in batches of 1000
                                List<Map.Entry<String, Object>> batch = entries.subList(i, Math.min(i + batchSize, entries.size()));
                                log.info("####batch replace id,batchSize:{}###",batch.size());
                                for (Map.Entry<String, Object> idEntry : batch) {
                                    executorService.submit(() -> {
                                        String oldId = idEntry.getKey();
                                        String newId = String.valueOf(idEntry.getValue());
    //                                    long startOneceCopyTime = System.currentTimeMillis();
                                        if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
                                            log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId);
                                            return;
                                        }
                                        String command = String.format("sed -i 's/%s/%s/g' %s", oldId, newId, datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
                                        log.info("Running command: {}", command);
    
                                        try {
                                            shellCommandExecutor.exec(command);
                                        } catch (Exception e) {
                                            log.error("Error executing sed command", e);
                                        }
    //                                    long endOneceCopyTime = System.currentTimeMillis();
                                        log.info("#######completed {} task#######", count[0]);
                                        count[0] = count[0] +1;
                                    });
                                }
                            }
                            executorService.shutdown();
                            try {
                                // Wait for all tasks to complete
                                if (!executorService.awaitTermination(1, TimeUnit.HOURS)) {
                                    log.warn("Some tasks did not finish within the time limit.");
                                    executorService.shutdownNow();
                                }
                            } catch (InterruptedException e) {
                                log.error("Task execution interrupted.", e);
                                executorService.shutdownNow();
                            }
    
    
                            log.info("All tasks completed. Proceeding with the next steps...");
                            resultString = ImOrExportUtil.readFromFile(new File(datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON));
                            result = JSONObject.parseObject(resultString,Map.class);
                            long endCopyTime = System.currentTimeMillis();
                            log.info("###version install {} data copy end, time:{}ms",datappId, endCopyTime - startCopyTime);

    第五次优化

    public  String replaceIds(String text, Map<String, Object> idMap) {
        StringBuilder result = new StringBuilder();
        int i = 0;
    
        // 遍历整个文本
        long startOneceCopyTime = System.currentTimeMillis();
        while (i < text.length()) {
            boolean replaced = false;
    
            // 遍历map中的每个旧ID,尝试匹配
            for (Map.Entry<String, Object> entry : idMap.entrySet()) {
                String oldId = entry.getKey();
                String newId = String.valueOf(entry.getValue());
    
                // 如果当前位置可以匹配到旧ID
                if (i + oldId.length() <= text.length() && text.substring(i, i + oldId.length()).equals(oldId)) {
                    result.append(newId); // 追加新ID
                    i += oldId.length(); // 跳过已匹配的旧ID部分
                    replaced = true;
                    break; // 一旦匹配成功,停止继续检查其他ID
                }
            }
    
            // 如果没有匹配到任何旧ID,则追加当前字符
            if (!replaced) {
                result.append(text.charAt(i));
                i++;
            }
            if(i%100000==0){
                long endOneceCopyTime = System.currentTimeMillis();
                log.info("#######completed {} task,spend:{}#######", i,endOneceCopyTime-startOneceCopyTime);
            }
        }
    
        return result.toString();
    }

    第六次优化

    使用 Trie(字典树)数据结构:Trie(前缀树)能够高效地进行字符串前缀匹配,避免在每个字符位置进行遍历查找。将 48 万个旧 ID 存储在 Trie 中,替换时可以逐字符匹配,避免遍历每个 ID,通过 Trie 树的匹配能够减少查找时间,降低复杂度。

    // Trie节点定义
    class TrieNode {
        Map<Character, TrieNode> children = new HashMap<>();
        String newId; // 当匹配到完整的旧ID时,存储相应的新ID
    }
    
    class Trie {
        private TrieNode root;
        private int matchedOldIdLength;  // 用于记录匹配的旧ID的长度
    
        public Trie() {
            root = new TrieNode();
        }
    
        // 插入旧ID和对应的新ID
        public void insert(String oldId, String newId) {
            TrieNode node = root;
            for (char c : oldId.toCharArray()) {
                node.children.putIfAbsent(c, new TrieNode());
                node = node.children.get(c);
            }
            node.newId = newId;
        }
    
        // 从文本的某个位置开始匹配旧ID,返回新ID(如果找到)
        public String match(String text, int start) {
            TrieNode node = root;
            int i = start;
            matchedOldIdLength = 0;
    
            while (i < text.length() && node.children.containsKey(text.charAt(i))) {
                node = node.children.get(text.charAt(i));
                i++;
                matchedOldIdLength++;
    
                // 如果找到完整的旧ID,返回对应的新ID
                if (node.newId != null) {
                    return node.newId;
                }
            }
    
            // 没有找到完整的旧ID
            matchedOldIdLength = 0;
            return null;
        }
    
        // 获取匹配的旧ID的长度
        public int getMatchedOldIdLength() {
            return matchedOldIdLength;
        }
    }
    public  String replaceIds(String text, Trie trie) {
        StringBuilder result = new StringBuilder();
        int i = 0;
        long startOneceCopyTime = System.currentTimeMillis();
        while (i < text.length()) {
            // 查找从当前位置开始是否能匹配旧ID
            String newId = trie.match(text, i);
    
            if (newId != null) {
                // 如果匹配到旧ID,替换为新ID
                result.append(newId);
                i += trie.getMatchedOldIdLength();  // 移动索引,跳过旧ID部分
            } else {
                // 没有匹配到旧ID,保留当前字符
                result.append(text.charAt(i));
                i++;
            }
            if(i%100000==0){
                long endOneceCopyTime = System.currentTimeMillis();
                log.info("#######completed {} task,spend:{}#######", i,endOneceCopyTime-startOneceCopyTime);
            }
        }
    
        return result.toString();
    }

  • 优化后id替换耗时降至10579ms,约10s,整体安装50w数据约20min

  • 问题解决
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值