【待翻译】java.util.regex.Pattern

本文深入解析Java中的正则表达式,从基础语法、量化符、分组、元字符到高级特性如零宽断言、回溯、非贪婪匹配等进行全面介绍,并提供实例演示如何高效使用正则表达式进行字符串操作。
java.util.regex.Pattern

Patterns are compiled regular expressions. In many cases, convenience methods such as String.matches, String.replaceAll and String.split will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.

// String convenience methods:
boolean sawFailures = s.matches("Failures: \\d+");
String farewell = s.replaceAll("Hello, (\\S+)", "Goodbye, $1");
String[] fields = s.split(":");

// Direct use of Pattern:
Pattern p = Pattern.compile("Hello, (\\S+)");
Matcher m = p.matcher(inputString);
while (m.find()) { // Find each match in turn; String can't do this.
String name = m.group(1); // Access a submatch group; String can't do this.
}

Regular expression syntax
Java supports a subset of Perl 5 regular expression syntax. An important gotcha is that Java has no regular expression literals, and uses plain old string literals instead. This means that you need an extra level of escaping. For example, the regular expression \s+ has to be represented as the string "\\s+".

Escape sequences
\ Quote the following metacharacter (so \. matches a literal .).
\Q Quote all following metacharacters until \E.
\E Stop quoting metacharacters (started by \Q).
\\ A literal backslash.
\uhhhh The Unicode character U+hhhh (in hex).
\xhh The Unicode character U+00hh (in hex).
\cx The ASCII control character ^x (so \cH would be ^H, U+0008).
\a The ASCII bell character (U+0007).
\e The ASCII ESC character (U+001b).
\f The ASCII form feed character (U+000c).
\n The ASCII newline character (U+000a).
\r The ASCII carriage return character (U+000d).
\t The ASCII tab character (U+0009).

Character classes
It's possible to construct arbitrary character classes using set operations: [abc] Any one of a, b, or c. (Enumeration.)
[a-c] Any one of a, b, or c. (Range.)
[^abc] Any character except a, b, or c. (Negation.)
[[a-f][0-9]] Any character in either range. (Union.)
[[a-z]&&[jkl]] Any character in both ranges. (Intersection.)

Most of the time, the built-in character classes are more useful: \d Any digit character.
\D Any non-digit character.
\s Any whitespace character.
\S Any non-whitespace character.
\w Any word character.
\W Any non-word character.
\p{NAME} Any character in the class with the given NAME.
\P{NAME} Any character not in the named class.

There are a variety of named classes:

Unicode category names, prefixed by Is. For example \p{IsLu}} for all uppercase letters.
POSIX class names. These are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit'.
Unicode block names, as used by java.lang.Character.UnicodeBlock.forName prefixed by In. For example \p{InHebrew}} for all characters in the Hebrew block.
Character method names. These are all non-deprecated methods from java.lang.Character whose name starts with is, but with the is replaced by java. For example, \p{javaLowerCase}}.
Quantifiers
Quantifiers match some number of instances of the preceding regular expression. * Zero or more.
? Zero or one.
+ One or more.
{n} Exactly n.
{n,} At least n.
{n,m} At least n but not more than m.

Quantifiers are "greedy" by default, meaning that they will match the longest possible input sequence. There are also non-greedy quantifiers that match the shortest possible input sequence. They're same as the greedy ones but with a trailing ?: *? Zero or more (non-greedy).
?? Zero or one (non-greedy).
+? One or more (non-greedy).
{n}? Exactly n (non-greedy).
{n,}? At least n (non-greedy).
{n,m}? At least n but not more than m (non-greedy).

Quantifiers allow backtracking by default. There are also possessive quantifiers to prevent backtracking. They're same as the greedy ones but with a trailing +: *+ Zero or more (possessive).
?+ Zero or one (possessive).
++ One or more (possessive).
{n}+ Exactly n (possessive).
{n,}+ At least n (possessive).
{n,m}+ At least n but not more than m (possessive).

Zero-width assertions
^ At beginning of line.
$ At end of line.
\A At beginning of input.
\b At word boundary.
\B At non-word boundary.
\G At end of previous match.
\z At end of input.
\Z At end of input, or before newline at end.

Look-around assertions
Look-around assertions assert that the subpattern does (positive) or doesn't (negative) match after (look-ahead) or before (look-behind) the current position, without including the matched text in the containing match. The maximum length of possible matches for look-behind patterns must not be unbounded.

(?=a) Zero-width positive look-ahead.
(?!a) Zero-width negative look-ahead.
(?<=a) Zero-width positive look-behind.
(?<!a) Zero-width negative look-behind.

Groups
(a) A capturing group.
(?:a) A non-capturing group.
(?>a) An independent non-capturing group. (The first match of the subgroup is the only match tried.)
\n The text already matched by capturing group n.

See Matcher.group for details of how capturing groups are numbered and accessed.

Operators
ab Expression a followed by expression b.
a|b Either expression a or expression b.

Flags
(?dimsux-dimsux:a) Evaluates the expression a with the given flags enabled/disabled.
(?dimsux-dimsux) Evaluates the rest of the pattern with the given flags enabled/disabled.

The flags are: i CASE_INSENSITIVE case insensitive matching
d UNIX_LINES only accept '\n' as a line terminator
m MULTILINE allow ^ and $ to match beginning/end of any line
s DOTALL allow . to match '\n' ("s" for "single line")
u UNICODE_CASE enable Unicode case folding
x COMMENTS allow whitespace and comments

Either set of flags may be empty. For example, (?i-m) would turn on case-insensitivity and turn off multiline mode, (?i) would just turn on case-insensitivity, and (?-m) would just turn off multiline mode.

Note that on Android, UNICODE_CASE is always on: case-insensitive matching will always be Unicode-aware.

There are two other flags not settable via this mechanism: CANON_EQ and LITERAL. Attempts to use CANON_EQ on Android will throw an exception.

Implementation notes
The regular expression implementation used in Android is provided by ICU. The notation for the regular expressions is mostly a superset of those used in other Java language implementations. This means that existing applications will normally work as expected, but in rare cases Android may accept a regular expression that is not accepted by other implementations.

In some cases, Android will recognize that a regular expression is a simple special case that can be handled more efficiently. This is true of both the convenience methods in String and the methods in Pattern.

See Also:
Matcher
<think>我们正在处理一个正则表达式错误:`PatternSyntaxException: Illegal repetition near index 1`,并且错误的正则表达式是`^.{GCBH}`。 错误原因分析: 1. 在正则表达式中,`{`和`}`用于指定重复次数,例如`a{3}`表示匹配3个a。但是,在`^.{GCBH}`中,`{GCBH}`不是合法的重复语法,因为花括号内的内容必须是数字或者数字范围(如`{3}`, `{3,5}`),或者是某些特定的模式(如`{n}`, `{n,}`, `{n,m}`)。这里出现了字母,所以会抛出非法重复(Illegal repetition)的异常。 2. 用户意图可能是:用变量`GCBH`的值(即字符串"S186701S")作为要匹配的前缀,然后替换掉。但是,在正则表达式中,我们无法直接将一个变量放入模式中,除非我们进行字符串拼接。而且,这里用户还使用了`^.{GCBH}`,这显然不是正确的用法。 3. 实际上,用户可能想要删除字符串`TH`中开头部分等于`GCBH`字符串的内容。注意,`GCBH`的值是"S186701S",而`TH`的值是"S186701S-T0101",所以用户可能想删除开头的"S186701S",得到"-T0101"。 但是,当前的正则表达式写法是完全错误的。 解决方案: 我们应该使用正确的正则表达式,并且注意,如果我们要匹配一个固定的字符串(即字面量),我们应该使用`Pattern.quote`来转义任何可能被解释为正则元字符的字符。但是,这里我们不需要使用量词,而是直接匹配整个字符串。 因此,我们可以这样写: ```java String GCBH = "S186701S"; String TH = "S186701S-T0101"; // 我们想要删除开头的GCBH字符串,注意:如果TH不是以GCBH开头,则不会替换 String X = TH.replaceFirst("^" + Pattern.quote(GCBH), ""); ``` 解释: - `replaceFirst`用于替换第一个匹配项(因为我们只需要匹配开头)。 - `^`表示字符串的开头。 - `Pattern.quote(GCBH)`将字符串`GCBH`中的特殊字符转义,使其作为字面量处理。例如,如果`GCBH`中包含点(.)或星号(*)等,它们不会被解释为正则元字符。 但是,注意:在原始代码中,用户使用了`replaceAll`,但我们这里使用`replaceFirst`,因为只需要匹配一次(在开头)。当然,也可以使用`replaceAll`,但需要确保只匹配开头一次,所以`replaceFirst`更合适。 如果用户确实想用正则表达式,并且`GCBH`字符串中不包含正则特殊字符,也可以直接拼接: ```java String X = TH.replaceFirst("^" + GCBH, ""); ``` 但是,为了安全起见,使用`Pattern.quote`是更好的做法。 另外,如果用户不想使用正则表达式,也可以使用字符串操作: ```java if (TH.startsWith(GCBH)) { String X = TH.substring(GCBH.length()); } else { String X = TH; } ``` 这样更直接,且没有正则表达式的复杂性和潜在错误。 根据用户提供的错误信息,我们得知错误是因为在正则表达式中使用了非法重复(`{GCBH}`)。因此,修正的核心是避免这种非法的正则语法。 引用[1]和[3]都提到了`Illegal repetition`,这通常是因为在量词`{}`中使用了非数字的内容。而引用[2]则是一个Unicode属性错误,与本问题无关。 因此,我们给出两种解决方案: 1. 使用正则表达式(但正确拼接)并转义字面量。 2. 使用字符串操作(substring)。 推荐使用第二种,因为意图是删除固定前缀,使用字符串操作更简单高效。 代码示例: ```java String GCBH = "S186701S"; String TH = "S186701S-T0101"; String X; if (TH.startsWith(GCBH)) { X = TH.substring(GCBH.length()); } else { X = TH; // 如果不是以GCBH开头,则保持不变 } System.out.println(X); // 输出: -T0101 ``` 这样就不会出现正则表达式错误了。 相关问题: 1.Java中如何安全地使用字符串作为正则表达式模式? 2. 处理字符串前缀时,使用正则表达式和字符串操作哪个更好? 3. 如何避免`PatternSyntaxException`异常? 4. 在字符串替换中,`replace`、`replaceAll`和`replaceFirst`有什么区别?</think>针对 Java 正则表达式 `^.{GCBH}` 导致的 `PatternSyntaxException: Illegal repetition near index 1` 错误,以下是详细解决方案: ### 错误原因分析 1. **非法重复语法**: ```java TH.replaceAll("^.{GCBH}", ""); // 错误! ``` - `{GCBH}` 是非法正则量词,量词必须为数字格式(如 `{4}`)或范围(如 `{1,5}`) - Java 将 `{G` 视为非法重复语法触发异常[^1][^3] 2. **变量未正确引用**: - 意图可能是用变量 `GCBH` 的值("S186701S")作为匹配模式 - 但直接写入 `{GCBH}` 会被当作字面字符而非变量引用 --- ### 解决方案 #### 方案 1:正确使用变量(推荐) ```java String GCBH = "S186701S"; String TH = "S186701S-T0101"; // 转义特殊字符后拼接变量 String regex = "^" + Pattern.quote(GCBH); String X = TH.replaceFirst(regex, ""); System.out.println(X); // 输出: -T0101 ``` - **关键点**: - `Pattern.quote()` 转义变量中的正则特殊字符(如 `.` `*` 等)[^3] - 使用 `replaceFirst()` 替代 `replaceAll()` 更高效(只需匹配开头一次) #### 方案 2:固定长度匹配(若需通用前缀) ```java // 匹配开头9个字符(GCBH长度) String X = TH.replaceFirst("^.{9}", ""); ``` - 适用场景:当 `GCBH` 长度固定且无特殊字符时 #### 方案 3:非正则的字符串操作 ```java String X = TH.startsWith(GCBH) ? TH.substring(GCBH.length()) : TH; ``` - **优势**:完全避免正则,性能更高 - **输出**:`-T0101` --- ### 错误预防指南 | **错误写法** | **修复方案** | **原理** | |---------------------|----------------------------------|------------------------------| | `^.{GCBH}` | `^ + Pattern.quote(GCBH)` | 转义特殊字符并正确拼接变量 | | `{ABC}` | 改为数字量词如 `{3}` | 量词必须为数字 | | 未处理边界符 `^` | 添加 `(?m)` 启用多行模式 | 确保 `^` 匹配每行开头 | | 未处理元字符 `.` | 用 `\\.` 转义 | `.` 在正则中表示任意字符 | > **最佳实践**: > 使用 [Regex Tester](https://regex101.com/) 在线测试正则表达式,或添加异常处理: > ```java > try { > TH.replaceAll(regex, ""); > } catch (PatternSyntaxException e) { > System.err.println("正则错误: " + e.getMessage()); > } > ``` --- ### 根本问题解析 `Illegal repetition` 异常通常由以下原因触发: 1. **量词格式错误**:`{x}` 中 `x` 必须为整数 ```java // 错误示例:字母代替数字 "a{xyz}".matches("a{xyz}"); // 抛出 PatternSyntaxException[^1][^3] ``` 2. **特殊字符未转义**:如 `{` `}` `.` `*` 等未用反斜杠转义 ```java // 正确转义示例 "price: $5.99".replaceAll("\\$", "¥"); // 输出: price: ¥5.99 ``` 3. **Unicode 属性错误**:`\p{...}` 包含无效属性名时触发 `Incorrect Unicode property`[^2]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值