java过滤四字节和六字节特殊字符

本文探讨了在Java 6及以后版本中如何使用正则表达式匹配Unicode字符,特别是位于辅助平面(Astral Planes)的字符。文章详细解释了有效代理对的匹配原理,并提供了正确处理这些字符的正则表达式示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

java7版本中可以这样写:

source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]""*");

java6和java7版本中可以这样写:

source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");

 

Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

 

This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

 

Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

 

Behind the scene

 

Assuming that you are running your regex on Oracle's implementation, your regex

 

"([\ud800-\udbff\udc00-\udfff])"

 

is compiled as such:

 

StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
  Pattern.union. A ∪ B:
    Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
    BitClass. Match any of these 1 character(s):
      [U+002D]
  SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match

 

The character class is parsed as \ud800-\udbff\udc00, -, \udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

 

Wrong solution

 

There is no point in writing:

 

"[\ud800-\udbff][\udc00-\udfff]"

 

Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

 

Solution

 

If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

 

input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");

 

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

 

The regex above compiles to:

 

StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
  Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
  Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match

 

Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

 

// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")

 

Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.

 


 

From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

 

input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");

 

This regex also compiles to the same structure as above.

 

本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值