49、Java正则表达式的高级应用与版本差异

最新推荐文章于 2025-11-27 11:35:18 发布

bean

最新推荐文章于 2025-11-27 11:35:18 发布

阅读量57

点赞数

CC 4.0 BY-SA版权

分类专栏：精通正则表达式：文本处理的艺术与科学文章标签： Java 正则表达式高级应用

本文链接：https://blog.youkuaiyun.com/bean/article/details/149385789

精通正则表达式：文本处理的艺术与科学专栏收录该内容

57 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

Java正则表达式的高级应用与版本差异

1. Java正则表达式基础方法

1.1 `quote` 方法

quote 方法是 Java 1.5 新增的静态方法，其作用是返回一个适合作为 Pattern.compile 正则表达式参数的字符串，该字符串能匹配作为参数提供的字面文本。例如：

Pattern.quote("main()");

上述代码会返回字符串 \Qmain()\E ，当它作为正则表达式使用时，会被解释为 \Q main() \E ，从而匹配原始参数 main() 。

1.2 `matches` 方法

matches 方法同样是静态方法，用于返回一个布尔值，指示正则表达式是否能精确匹配文本。本质上，它等同于 Pattern.compile(regex).matcher(text).matches() 。示例如下：

boolean result = Pattern.matches("\\d+", "123");

如果需要传递编译选项，或者获取比匹配是否成功更多的信息，就需要使用前面提到的方法。另外，如果该方法会被多次调用（例如在循环或其他频繁调用的代码中），将正则表达式预编译为 Pattern 对象会更高效。

1.3 `split` 方法

1.3.1 单参数 `split` 方法

split 方法接受一个 CharSequence 类型的文本，并返回一个字符串数组，这些字符串由模式的正则表达式匹配结果分隔。例如：

String[] result = Pattern.compile("\\.").split("209.204.146.22");

上述代码会返回一个包含四个字符串 {"209", "204", "146", "22"} 的数组，这些字符串由文本中三个 \. 的匹配结果分隔。此方法不仅可以按单个字符分割，还能按任意正则表达式分割。例如，按非字母数字字符分割字符串为“单词”：

String[] result = Pattern.compile("\\W+").split("What’s up, Doc");

对于非 ASCII 文本，可能需要使用 \P{L}+ 或 [^\p{L}\p{N}]+ 作为正则表达式。

1.3.2 相邻匹配产生的空元素

如果正则表达式能在文本开头匹配， split 方法返回的第一个字符串将是空字符串。同样，如果正则表达式能连续匹配多次，相邻匹配“分隔”的零长度文本将返回空字符串。例如：

String[] result = Pattern.compile("\\s+,\\s+").split(", one, two , ,, 3");

此代码按逗号和周围的空白字符分割，返回一个包含五个字符串的数组： {"", "one", "two", "", "3"} 。最后，列表末尾可能出现的空字符串会被抑制。例如：

String[] result = Pattern.compile(":").split(":xx:");

该代码仅产生两个字符串： {"", "xx"} 。若要保留末尾的空元素，可使用双参数版本的 split 方法。

1.3.3 双参数 `split` 方法

双参数的 split 方法提供了对模式应用次数的控制，以及对可能产生的末尾空元素的处理方式。参数 limit 的含义取决于其值：
- limit 小于 0 ：保留数组中的末尾空元素。例如：

String[] result = Pattern.compile(":").split(":xx:", -1);

上述代码返回一个包含三个字符串的数组： {"", "xx", ""} 。
- limit 等于 0 ：与不指定 limit 效果相同，即抑制末尾空元素。
- limit 大于 0 ： split 方法最多返回 limit 个元素的数组，这意味着正则表达式最多应用 limit - 1 次。例如，要从字符串 Friedl,Jeffrey,Eric Francis,America,Ohio,Rootstown 中分离出三个姓名组件，可将字符串分割为四部分：

String[] NameInfo = Pattern.compile(",").split("Friedl,Jeffrey,Eric Francis,America,Ohio,Rootstown", 4);

这样做的原因是为了提高效率，避免进行不必要的匹配和字符串创建操作。

2. Java正则表达式的高级应用示例

2.1 为图像标签添加宽度和高度属性

此示例展示了一个较为高级的原地搜索和替换操作，用于更新 HTML，确保所有图像标签都具有 WIDTH 和 HEIGHT 属性。代码如下：

// Matcher for isolating <img> tags
Matcher mImg = Pattern.compile("(?id)<IMG\\s+(.+?)/?>").matcher(html);
// Matchers that isolate the SRC, WIDTH, and HEIGHT attributes within a tag (with very naïve regexes)
Matcher mSrc = Pattern.compile("(?ix)\\bSRC =(\\S+)").matcher(html);
Matcher mWidth = Pattern.compile("(?ix)\\bWIDTH =(\\S+)").matcher(html);
Matcher mHeight = Pattern.compile("(?ix)\\bHEIGHT=(\\S+)").matcher(html);
int imgMatchPointer = 0; // The first search begins at the start of the string
while (mImg.find(imgMatchPointer)) {
    imgMatchPointer = mImg.end(); // Next image search starts from where this one ended
    // Look for our attributes within the body of the just-found image tag
    Boolean hasSrc = mSrc.region(mImg.start(1), mImg.end(1)).find();
    Boolean hasHeight = mHeight.region(mImg.start(1), mImg.end(1)).find();
    Boolean hasWidth = mWidth.region(mImg.start(1), mImg.end(1)).find();
    // If we have a SRC attribute, but are missing WIDTH and/or HEIGHT...
    if (hasSrc && (!hasWidth || !hasHeight)) {
        java.awt.image.BufferedImage i = javax.imageio.ImageIO.read(new java.net.URL(mSrc.group(1)));
        String size; // Will hold the missing WIDTH and/or HEIGHT attributes
        if (hasWidth) {
            // We’re told the width, so compute the height that maintains the proper aspect ratio
            size = "height='" + (int)(Integer.parseInt(mWidth.group(1)) * i.getHeight() / i.getWidth()) + "' ";
        } else if (hasHeight) {
            // We’re told the height, so compute the width that maintains the proper aspect ratio
            size = "width='" + (int)(Integer.parseInt(mHeight.group(1)) * i.getWidth() / i.getHeight()) + "' ";
        } else {
            // We’re told neither, so just insert the actual size
            size = "width='" + i.getWidth() + "' " + "height='" + i.getHeight() + "' ";
        }
        html.insert(mImg.start(1), size); // Update the HTML in place
        imgMatchPointer += size.length(); // Account for the new text in mImg’s eyes
    }
}

需要注意的是，此示例为了专注于原地搜索和替换，对所处理的 HTML 做了一些简单假设，例如正则表达式不允许属性等号周围有空格，也不允许属性值加引号，且未处理相对 URL 和格式错误的 URL 以及图像获取代码可能抛出的异常。

2.2 多模式验证 HTML

以下是一个 Java 版本的验证 HTML 子集的程序，该程序使用 usePattern 方法动态更改匹配器的模式，允许多个以 \G 开头的模式“接力”处理字符串：

Pattern pAtEnd = Pattern.compile("\\G\\z");
Pattern pWord = Pattern.compile("\\G\\w+");
Pattern pNonHtml = Pattern.compile("\\G[^\\w<>&]+");
Pattern pImgTag = Pattern.compile("\\G(?i)<img\\s+([^>]+)>");
Pattern pLink = Pattern.compile("\\G(?i)<A\\s+([^>]+)>");
Pattern pLinkX = Pattern.compile("\\G(?i)</A>");
Pattern pEntity = Pattern.compile("\\G&(#\\d+|\\w+);");
Boolean needClose = false;
Matcher m = pAtEnd.matcher(html); // Any Pattern object can create our Matcher object
while (!m.usePattern(pAtEnd).find()) {
    if (m.usePattern(pWord).find()) {
        // have a word or number in m.group() -- can now check for profanity, etc...
    } else if (m.usePattern(pImgTag).find()) {
        // have an image tag -- can check that it’s appropriate...
    } else if (!needClose && m.usePattern(pLink).find()) {
        // have a link anchor -- can validate it...
        needClose = true;
    } else if (needClose && m.usePattern(pLinkX).find()) {
        System.out.println("/LINK [" + m.group() + "]");
        needClose = false;
    } else if (m.usePattern(pEntity).find()) {
        // Allow entities like &gt; and &#123;
    } else if (m.usePattern(pNonHtml).find()) {
        // Other (non-word) non-HTML stuff -- simply allow it
    } else {
        // Nothing matched at this point, so it must be an error. Grab a dozen or so characters
        // at our current location so that we can issue an informative error message
        m.usePattern(Pattern.compile("\\G(?s).{1,12}")).find();
        System.out.println("Bad char before '" + m.group() + "'");
        System.exit(1);
    }
}
if (needClose) {
    System.out.println("Missing Final </A>");
    System.exit(1);
}

由于 java.util.regex 存在一个 bug，“非 HTML”匹配尝试即使未匹配也会“消耗”目标文本的一个字符，因此将非 HTML 检查移到了最后。

2.3 单参数 `find` 方法解决匹配位置问题

为解决上述 bug 导致的匹配位置错误问题，可手动跟踪“当前位置”，并使用单参数形式的 find 方法在正确位置开始匹配：

Pattern pWord = Pattern.compile("\\G\\w+");
Pattern pNonHtml = Pattern.compile("\\G[^\\w<>&]+");
Pattern pImgTag = Pattern.compile("\\G(?i)<img\\s+([^>]+)>");
Pattern pLink = Pattern.compile("\\G(?i)<A\\s+([^>]+)>");
Pattern pLinkX = Pattern.compile("\\G(?i)</A>");
Pattern pEntity = Pattern.compile("\\G&(#\\d+|\\w+);");
Boolean needClose = false;
Matcher m = pWord.matcher(html); // Any Pattern object can create our Matcher object
Integer currentLoc = 0;
// Begin at the start of the string
while (currentLoc < html.length()) {
    if (m.usePattern(pWord).find(currentLoc)) {
        // have a word or number in m.group() -- can now check for profanity, etc...
    } else if (m.usePattern(pNonHtml).find(currentLoc)) {
        // Other (non-word) non-HTML stuff -- simply allow it
    } else if (m.usePattern(pImgTag).find(currentLoc)) {
        // have an image tag -- can check that it’s appropriate...
    } else if (!needClose && m.usePattern(pLink).find(currentLoc)) {
        // have a link anchor -- can validate it...
        needClose = true;
    } else if (needClose && m.usePattern(pLinkX).find(currentLoc)) {
        System.out.println("/LINK [" + m.group() + "]");
        needClose = false;
    } else if (m.usePattern(pEntity).find(currentLoc)) {
        // Allow entities like &gt; and &#123;
    } else {
        // Nothing matched at this point, so it must be an error. Grab a dozen or so characters
        // at our current location so that we can issue an informative error message
        m.usePattern(Pattern.compile("\\G(?s).{1,12}")).find(currentLoc);
        System.out.println("Bad char at '" + m.group() + "'");
        System.exit(1);
    }
    currentLoc = m.end(); // The ‘current location’ is now where the previous match ended
}
if (needClose) {
    System.out.println("Missing Final </A>");
    System.exit(1);
}

此方法使用重置匹配器的 find 版本，若需要考虑区域，可在每次 find 之前插入适当的 region 调用。

2.4 解析逗号分隔值（CSV）文本

以下是 java.util.regex 版本的 CSV 解析示例，使用占有量词替代原子括号，使代码更简洁：

String regex = 
    "\\G(?:[^;,])" +
    "(?:" +
    "# Either a double-quoted field..." +
    "\" # field’s opening quote" +
    "([^\"]++(?:\"\"[^\"]++)++)" +
    "\" # field’s closing quote" +
    "| #... or..." +
    "# some non-quote/non-comma text..." +
    "([^\",]++)" +
    ")";
// Create a matcher for the CSV line of text, using the regex above.
Matcher mMain = Pattern.compile(regex, Pattern.COMMENTS).matcher(line);
// Create a matcher for """, with dummy text for the time being.
Matcher mQuote = Pattern.compile("\"\"").matcher("");
while (mMain.find()) {
    String field;
    if (mMain.start(2) >= 0) {
        field = mMain.group(2); // The field is unquoted, so we can use it as is.
    } else {
        // The field is quoted, so we must replace paired double quotes with one double quote.
        field = mQuote.reset(mMain.group(1)).replaceAll("\"");
    }
    // We can now work with field...
    System.out.println("Field [" + field + "]");
}

此方法比之前的 Java 版本更高效，原因一是正则表达式更高效，二是使用单个匹配器并通过 reset 方法重用，避免每次创建和销毁新的匹配器。

3. Java版本差异

3.1 Java 1.4.2 与 1.5.0 的差异

3.1.1 新方法

Java 1.5.0 新增了许多方法，主要是为了支持匹配器区域的新概念。Java 1.4.2 中缺少的区域相关匹配器方法包括：
- region
- regionStart
- regionEnd
- useAnchoringBounds
- hasAnchoringBounds
- useTransparentBounds
- hasTransparentBounds

此外，其他缺少的匹配器方法还有：
- toMatchResult
- hitEnd
- requireEnd
- usePattern
- toString

还有一个静态方法 Pattern.quote 在 Java 1.4.2 中也不存在。

3.1.2 Unicode 支持差异

Java 1.4.2 到 1.5.0 的 Unicode 支持有以下变化：
- 版本升级 ：Unicode 支持从 Java 1.4.2 的 Unicode 版本 3.0.0 升级到 Java 1.5.0 的 Unicode 版本 4.0.0，这影响了字符定义、属性以及 Unicode 块的定义。
- 块名引用增强 ：Java 1.5.0 增加了两种块名引用形式。一种是在官方块名前加 In ，如 \p{InHangul Jamo} ；另一种是在 Java 标识符形式的块名前加 In ，如 \p{InHangul_Jamo} 。
- 修复 bug ：Java 1.5.0 修复了 Java 1.4.2 中一个奇怪的 bug，该 bug 要求将阿拉伯演示形式 - B 和拉丁扩展 - B 块引用为 \p{InArabicPresentationForms-Bound} 和 \p{InLatinExtended-Bound} 。
- 新增支持 ：Java 1.5.0 增加了对 Java Character 类 isSomething 方法的正则表达式支持。

3.2 Java 1.5.0 与 1.6 的差异

Java 1.6（当前为第二个 beta 版本）与 Java 1.5.0 相比，只有两个与正则表达式相关的小变化：
- 新增支持 ：Java 1.6 增加了对 Pi 和 Pf Unicode 类别的支持。
- 修复问题 ：Java 1.6 修复了 \Q...\E 结构的问题，使其在字符类中也能可靠工作。

综上所述，Java 正则表达式在不同版本中不断发展和完善，开发者在使用时需根据具体需求和版本特性进行选择。通过这些高级应用和版本差异的了解，能更好地利用 Java 正则表达式解决实际问题。

总结

以下是一个简单的流程图，展示了多模式验证 HTML 的流程：

graph TD;
    A[开始] --> B{是否匹配到结尾};
    B -- 否 --> C{是否匹配到单词};
    C -- 是 --> D[处理单词];
    C -- 否 --> E{是否匹配到图像标签};
    E -- 是 --> F[处理图像标签];
    E -- 否 --> G{是否匹配到链接开始};
    G -- 是 --> H[处理链接开始];
    G -- 否 --> I{是否匹配到链接结束};
    I -- 是 --> J[处理链接结束];
    I -- 否 --> K{是否匹配到实体};
    K -- 是 --> L[处理实体];
    K -- 否 --> M{是否匹配到非 HTML 内容};
    M -- 是 --> N[处理非 HTML 内容];
    M -- 否 --> O[输出错误信息并退出];
    D --> B;
    F --> B;
    H --> B;
    J --> B;
    L --> B;
    N --> B;
    B -- 是 --> P[结束];

希望这些内容能帮助你更好地掌握 Java 正则表达式的使用和版本差异。如果你有任何疑问或需要进一步的帮助，请随时留言。

4. 实际应用中的注意事项与优化建议

4.1 正则表达式性能优化

在实际应用中，正则表达式的性能至关重要。以下是一些性能优化的建议：
- 预编译正则表达式 ：如前文所述，当一个正则表达式需要多次使用时，应将其预编译为 Pattern 对象。例如：

Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("123");
boolean isMatch = matcher.matches();

这样可以避免每次使用时都进行编译，提高效率。
- 使用合适的量词 ：选择合适的量词可以减少不必要的回溯。例如，使用占有量词（如 ++ ）可以避免回溯，提高匹配速度。在解析 CSV 文本的示例中，就使用了占有量词：

String regex = 
    "\\G(?:[^;,])" +
    "(?:" +
    "\" # field’s opening quote" +
    "([^\"]++(?:\"\"[^\"]++)++)" +
    "\" # field’s closing quote" +
    "| " +
    "([^\",]++)" +
    ")";

避免复杂的嵌套结构 ：复杂的嵌套结构会增加正则表达式的复杂度和匹配时间。尽量简化正则表达式的结构，使其易于理解和维护。

4.2 异常处理

在使用 Java 正则表达式进行高级应用时，异常处理是必不可少的。例如，在为图像标签添加宽度和高度属性的示例中，可能会遇到网络异常或图像读取异常：

try {
    java.awt.image.BufferedImage i = javax.imageio.ImageIO.read(new java.net.URL(mSrc.group(1)));
    // 处理图像
} catch (java.io.IOException e) {
    System.err.println("Failed to read image: " + e.getMessage());
}

通过捕获异常并进行适当的处理，可以提高程序的健壮性。

4.3 兼容性问题

由于 Java 不同版本在正则表达式方面存在差异，因此在开发过程中需要考虑兼容性问题。例如，如果你的应用需要支持 Java 1.4.2 版本，就不能使用 Java 1.5.0 新增的方法。可以通过以下方式进行版本检查：

String javaVersion = System.getProperty("java.version");
if (javaVersion.startsWith("1.4")) {
    // 使用 Java 1.4.2 兼容的代码
} else {
    // 使用 Java 1.5.0 或更高版本的代码
}

5. 常见问题解答

5.1 如何处理特殊字符

在正则表达式中，一些字符具有特殊含义，如 . 、 * 、 + 等。如果需要匹配这些特殊字符本身，需要使用转义字符 \ 。例如，要匹配字符串中的 . ，可以使用 \\. 。另外，还可以使用 Pattern.quote 方法来处理包含特殊字符的字符串：

String text = "main()";
String quotedText = Pattern.quote(text);
Pattern pattern = Pattern.compile(quotedText);
Matcher matcher = pattern.matcher(text);
boolean isMatch = matcher.matches();

5.2 如何处理多行文本

默认情况下，正则表达式的 ^ 和 $ 分别匹配字符串的开头和结尾。如果需要匹配多行文本中的每一行的开头和结尾，可以使用 Pattern.MULTILINE 标志：

String text = "line1\nline2\nline3";
Pattern pattern = Pattern.compile("^line", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    System.out.println("Found: " + matcher.group());
}

5.3 如何提高正则表达式的可读性

正则表达式通常比较复杂，为了提高其可读性，可以使用注释和多行字符串。例如，在解析 CSV 文本的示例中，使用了 Pattern.COMMENTS 标志来允许注释：

String regex = 
    "\\G(?:[^;,])" +
    "(?:" +
    "# Either a double-quoted field..." +
    "\" # field’s opening quote" +
    "([^\"]++(?:\"\"[^\"]++)++)" +
    "\" # field’s closing quote" +
    "| #... or..." +
    "# some non-quote/non-comma text..." +
    "([^\",]++)" +
    ")";
Pattern pattern = Pattern.compile(regex, Pattern.COMMENTS);

6. 总结与展望

6.1 总结

本文全面介绍了 Java 正则表达式的基础方法、高级应用以及不同版本之间的差异。通过 quote 、 matches 、 split 等方法，开发者可以方便地进行字符串的匹配、分割等操作。高级应用示例展示了如何利用正则表达式解决实际问题，如为图像标签添加属性、验证 HTML、解析 CSV 文本等。同时，了解 Java 不同版本在正则表达式方面的差异，有助于开发者选择合适的版本和方法。

6.2 展望

随着 Java 技术的不断发展，正则表达式的功能可能会进一步增强。未来可能会有更多的 Unicode 类别支持、更高效的匹配算法等。开发者可以持续关注 Java 官方文档和相关技术论坛，及时了解最新的发展动态。同时，结合实际需求，灵活运用正则表达式，为开发工作带来更多便利。

以下是一个流程图，展示了正则表达式性能优化的步骤：

graph TD;
    A[开始] --> B{是否多次使用};
    B -- 是 --> C[预编译正则表达式];
    B -- 否 --> D{正则表达式是否复杂};
    C --> D;
    D -- 是 --> E[简化结构，使用合适量词];
    D -- 否 --> F[正常使用];
    E --> F;
    F --> G[结束];

希望通过本文的介绍，你能对 Java 正则表达式有更深入的理解和掌握，在实际开发中能够灵活运用这些知识解决问题。如果你在使用过程中遇到任何问题，欢迎随时交流。