大数据量替换文本内容replaceAll性能过慢的替代方案

LEO_Loe

已于 2024-11-21 15:35:02 修改

阅读量1k

点赞数 27

文章标签： java linux 开发语言

于 2024-11-21 15:33:47 首次发布

本文链接：https://blog.youkuaiyun.com/LEO_Loe/article/details/142217368

版权

场景

有一个128M大文本，约1亿3000万字符，现在需要对这个文本中的id进行替换，一共有489814个id，同时按资源类型分有117个资源，即需要对这个大文本进行117*489814=57,308,238次replaceAll，按原有逻辑几乎执行了3天还没有执行完

原有替换逻辑，采用了双层for循环，进行替换，

for (Map.Entry<String, Object> entry : result.entrySet()) {
    String valueString = JSONObject.toJSONString(entry.getValue());
    long startOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy start {} ", datappId,entry.getKey());
    if (StringUtils.isNotEmpty(valueString)) {
        for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
            String oldId = idEntry.getKey();
            String newId = String.valueOf(idEntry.getValue());
            if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
                //10位以内的id为非法数据  不替换
                log.warn("###version install {} unsupport oldId:{} ##",datappId, oldId);
                continue;
            }
            valueString = valueString.replaceAll(oldId, newId);
        }
        Object parse = JSONObject.parse(valueString);    
        entry.setValue(parse);
    }
    long endOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy {} step {}/{} ,spend time {} ms",datappId, entry.getKey(), ++count, result.size(), endOneceCopyTime - startOneceCopyTime);
}

replaceAll操作的频繁使用：对每个 oldId 和 newId 的替换操作在较大的数据集中会产生较高的开销。
嵌套循环：对于每一个 result的 entry，都对 newIdMap 进行循环遍历，这个是 O(n²) 的复杂度，数据量大时执行效率会降低。
每次修改 Value都要解析 JSON：JSONObject.parse 频繁进行 JSON 解析也是性能瓶颈。

第一次优化

除去按表循环的逻辑，直接整个resultMap转成大JSON然后进行replaceAll，循环次数从57,308,238次减少至489814次

String resultString = JSONObject.toJSONString(result);
for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
    String oldId = idEntry.getKey();
    String newId = String.valueOf(idEntry.getValue());
    long startOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy start {},oldId:{},newId:{} ", datappId,count,oldId,newId);
    if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
        //10位以内的id为非法数据  不替换
        log.warn("###version install {} unsupport oldId:{} ##",datappId, oldId);
        continue;
    }
    resultString = resultString.replaceAll(oldId, newId);
    long endOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy {} end ,spend:{} ", datappId,count,endOneceCopyTime-startOneceCopyTime);
    count++;
}

可以看到优化后每次耗时平均在200ms,一共需要489814*200/1000=97,962.8S,约等于27H，时间还是很长

第二次优化

采用正则替代replaceAll

String resultString = JSONObject.toJSONString(result);
Map<String, String> idReplaceMap = new HashMap<>();

// 预处理映射表，过滤掉无效的 ID
for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
    String oldId = idEntry.getKey();
    String newId = String.valueOf(idEntry.getValue());
    if (StringUtils.isNotEmpty(oldId) && StringUtils.isNotEmpty(newId) && oldId.length() >= 10) {
        idReplaceMap.put(oldId, newId);
    } else {
        log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId);
    }
}

// 构建正则表达式，用于一次性替换多个 oldId
StringBuilder regexBuilder = new StringBuilder();
for (String oldId : idReplaceMap.keySet()) {
    regexBuilder.append(Pattern.quote(oldId)).append("|");
}

if (regexBuilder.length() > 0) {
    // 去掉最后一个 "|"
    regexBuilder.setLength(regexBuilder.length() - 1);
    Pattern pattern = Pattern.compile(regexBuilder.toString());
    Matcher matcher = pattern.matcher(resultString);

    // 使用 StringBuilder 执行替换操作
    StringBuffer sb = new StringBuffer();

    while (matcher.find()) {
        long startOneceCopyTime = System.currentTimeMillis();
        log.info("###version install {} copy {} start ", datappId,count++);
        String oldId = matcher.group();
        String newId = idReplaceMap.getOrDefault(oldId, oldId);
        matcher.appendReplacement(sb, newId);
        long endOneceCopyTime = System.currentTimeMillis();
        log.info("###version install {} copy end ,spend:{} ", datappId, endOneceCopyTime - startOneceCopyTime);
    }
    matcher.appendTail(sb);

    resultString = sb.toString();

}

可以看到优化后每次耗时平均在100ms,约需要14H，时间还是很长

第三次优化

采用服务命令进行文件替换的方式，每次耗时平均在500ms左右, 489814*500/1000=97,962.8S,约等于68H，相较于内容直接替换更慢，但是次方案是替换同一个文件，所以可以进行并行操作，于是进行第四次优化

String resultString = JSONObject.toJSONString(result);
com.sdata.foundation.core.util.StorageUtil.writeTextFile(resultString, datappIdFile.getPath(), "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
for (Map.Entry<String, Object> idEntry : newIdMap.entrySet()) {
    String oldId = idEntry.getKey();
    String newId = String.valueOf(idEntry.getValue());
    long startOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy start {},oldId:{},newId:{} ", datappId,count,oldId,newId);
    if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
        log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId);
        continue;
    }

    String command = String.format("sed -i 's/%s/%s/g' %s", oldId, newId, datappIdFile.getPath()+File.separator+"allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
    log.info("Running command: {}", command);

    try {
        Process process = Runtime.getRuntime().exec(command);
        process.waitFor(); // 等待命令执行完毕
        log.info("Replaced oldId:{} with newId:{} in file", oldId, newId);
    } catch (IOException | InterruptedException e) {
        log.error("Error executing sed command", e);
    }
    long endOneceCopyTime = System.currentTimeMillis();
    log.info("###version install {} copy {} end ,spend:{} ", datappId,count,endOneceCopyTime-startOneceCopyTime);
    count++;
}
resultString = ImOrExportUtil.readFromFile(new File(datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON));
result = JSONObject.parseObject(resultString,Map.class);

第四次优化

沿用第三次优化的方式，采用并行操作，采用线程池，设置10个线程。设置多线程后执行执行占用CPU过高导致系统无法访问

局限性：

采用了linux的sed命令进行了文件内容替换，如果部署到其他类型服务器可能不支持此命令

              //复制替换id
                        long startCopyTime = System.currentTimeMillis();
                        log.info("###version install {},has {} resources, {} ids ready to copy", datappId,result.size(), newIdMap.size());
                        final int[] count = {0};
                        String resultString = JSONObject.toJSONString(result);
                        com.sdata.foundation.core.util.StorageUtil.writeTextFile(resultString, datappIdFile.getPath(), "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
                        int batchSize = 1000;  // Number of tasks per batch

                        ExecutorService executorService = Executors.newFixedThreadPool(5);  // You can adjust the pool size based on your system

                        List<Map.Entry<String, Object>> entries = new ArrayList<>(newIdMap.entrySet());  // Convert the entry set to a list
                        for (int i = 0; i < entries.size(); i += batchSize) {
                            // Process in batches of 1000
                            List<Map.Entry<String, Object>> batch = entries.subList(i, Math.min(i + batchSize, entries.size()));
                            log.info("####batch replace id,batchSize:{}###",batch.size());
                            for (Map.Entry<String, Object> idEntry : batch) {
                                executorService.submit(() -> {
                                    String oldId = idEntry.getKey();
                                    String newId = String.valueOf(idEntry.getValue());
//                                    long startOneceCopyTime = System.currentTimeMillis();
                                    if (StringUtils.isEmpty(oldId) || StringUtils.isEmpty(newId) || oldId.length() < 10) {
                                        log.warn("###version install {} unsupported oldId:{} ##", datappId, oldId);
                                        return;
                                    }
                                    String command = String.format("sed -i 's/%s/%s/g' %s", oldId, newId, datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON);
                                    log.info("Running command: {}", command);

                                    try {
                                        shellCommandExecutor.exec(command);
                                    } catch (Exception e) {
                                        log.error("Error executing sed command", e);
                                    }
//                                    long endOneceCopyTime = System.currentTimeMillis();
                                    log.info("#######completed {} task#######", count[0]);
                                    count[0] = count[0] +1;
                                });
                            }
                        }
                        executorService.shutdown();
                        try {
                            // Wait for all tasks to complete
                            if (!executorService.awaitTermination(1, TimeUnit.HOURS)) {
                                log.warn("Some tasks did not finish within the time limit.");
                                executorService.shutdownNow();
                            }
                        } catch (InterruptedException e) {
                            log.error("Task execution interrupted.", e);
                            executorService.shutdownNow();
                        }


                        log.info("All tasks completed. Proceeding with the next steps...");
                        resultString = ImOrExportUtil.readFromFile(new File(datappIdFile.getPath() + File.separator + "allResource" + VersionControlConstants.PackageZipFileSuffix.JSON));
                        result = JSONObject.parseObject(resultString,Map.class);
                        long endCopyTime = System.currentTimeMillis();
                        log.info("###version install {} data copy end, time:{}ms",datappId, endCopyTime - startCopyTime);

第五次优化

public  String replaceIds(String text, Map<String, Object> idMap) {
    StringBuilder result = new StringBuilder();
    int i = 0;

    // 遍历整个文本
    long startOneceCopyTime = System.currentTimeMillis();
    while (i < text.length()) {
        boolean replaced = false;

        // 遍历map中的每个旧ID，尝试匹配
        for (Map.Entry<String, Object> entry : idMap.entrySet()) {
            String oldId = entry.getKey();
            String newId = String.valueOf(entry.getValue());

            // 如果当前位置可以匹配到旧ID
            if (i + oldId.length() <= text.length() && text.substring(i, i + oldId.length()).equals(oldId)) {
                result.append(newId); // 追加新ID
                i += oldId.length(); // 跳过已匹配的旧ID部分
                replaced = true;
                break; // 一旦匹配成功，停止继续检查其他ID
            }
        }

        // 如果没有匹配到任何旧ID，则追加当前字符
        if (!replaced) {
            result.append(text.charAt(i));
            i++;
        }
        if(i%100000==0){
            long endOneceCopyTime = System.currentTimeMillis();
            log.info("#######completed {} task,spend:{}#######", i,endOneceCopyTime-startOneceCopyTime);
        }
    }

    return result.toString();
}

第六次优化

使用 Trie（字典树）数据结构：Trie（前缀树）能够高效地进行字符串前缀匹配，避免在每个字符位置进行遍历查找。将 48 万个旧 ID 存储在 Trie 中，替换时可以逐字符匹配，避免遍历每个 ID，通过 Trie 树的匹配能够减少查找时间，降低复杂度。

// Trie节点定义
class TrieNode {
    Map<Character, TrieNode> children = new HashMap<>();
    String newId; // 当匹配到完整的旧ID时，存储相应的新ID
}

class Trie {
    private TrieNode root;
    private int matchedOldIdLength;  // 用于记录匹配的旧ID的长度

    public Trie() {
        root = new TrieNode();
    }

    // 插入旧ID和对应的新ID
    public void insert(String oldId, String newId) {
        TrieNode node = root;
        for (char c : oldId.toCharArray()) {
            node.children.putIfAbsent(c, new TrieNode());
            node = node.children.get(c);
        }
        node.newId = newId;
    }

    // 从文本的某个位置开始匹配旧ID，返回新ID（如果找到）
    public String match(String text, int start) {
        TrieNode node = root;
        int i = start;
        matchedOldIdLength = 0;

        while (i < text.length() && node.children.containsKey(text.charAt(i))) {
            node = node.children.get(text.charAt(i));
            i++;
            matchedOldIdLength++;

            // 如果找到完整的旧ID，返回对应的新ID
            if (node.newId != null) {
                return node.newId;
            }
        }

        // 没有找到完整的旧ID
        matchedOldIdLength = 0;
        return null;
    }

    // 获取匹配的旧ID的长度
    public int getMatchedOldIdLength() {
        return matchedOldIdLength;
    }
}
public  String replaceIds(String text, Trie trie) {
    StringBuilder result = new StringBuilder();
    int i = 0;
    long startOneceCopyTime = System.currentTimeMillis();
    while (i < text.length()) {
        // 查找从当前位置开始是否能匹配旧ID
        String newId = trie.match(text, i);

        if (newId != null) {
            // 如果匹配到旧ID，替换为新ID
            result.append(newId);
            i += trie.getMatchedOldIdLength();  // 移动索引，跳过旧ID部分
        } else {
            // 没有匹配到旧ID，保留当前字符
            result.append(text.charAt(i));
            i++;
        }
        if(i%100000==0){
            long endOneceCopyTime = System.currentTimeMillis();
            log.info("#######completed {} task,spend:{}#######", i,endOneceCopyTime-startOneceCopyTime);
        }
    }

    return result.toString();
}