对字符串进行验证之前先进行规范化
应用系统中经常对字符串会进行各种规则的验证,不过由于字符串信息在java6中是基于unicode的4.0版本的,而java7则是unicode的6.0.0版本。
unicode的规范化格式有几种,每种的处理方式有些不一样。
NFC
Unicode 规范化格式 C。如果未指定 normalization-type,那么会执行 Unicode 规范化。
NFD
Unicode 规范化格式 D。
NFKC
Unicode 规范化格式 KC。
NFKD
Unicode 规范化格式 KD。
如果我们对输入字符串先进行验证,再规范化,Normalizer.normalize将unicode的文本转成等价的规范化格式内容,下面这个用Pattern.compile("[<>]")验证不通过,
// String s may be user controllable
// \uFE64 is normalized to < and \uFE65 is normalized to > using NFKC
String s = "\uFE64" + "script" + "\uFE65";
// Validate
Pattern pattern = Pattern.compile("[<>]"); // Check for angle brackets
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
// Found black listed tag
throw new IllegalStateException();
} else {
// . . .
}
// Normalize
s = Normalizer.normalize(s, Form.NFKC);
如果对输入字符串先进行规范化在进行验证,使用Pattern.compile("[<>]")验证就能正确判断出来,抛出IllegalStateException异常,正确过滤有问题的输入文本,
String s = "\uFE64" + "script" + "\uFE65";
// Normalize
s = Normalizer.normalize(s, Form.NFKC);
// Validate
Pattern pattern = Pattern.compile("[<>]");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
// Found black listed tag
throw new IllegalStateException();
} else {
// . . .
}
java中的Normalizer类
public final class Normalizer {
private Normalizer() {};
/**
* This enum provides constants of the four Unicode normalization forms
* that are described in
* <a href="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
* Unicode Standard Annex #15 — Unicode Normalization Forms</a>
* and two methods to access them.
*
* @since 1.6
*/
public static enum Form {
/**
* Canonical decomposition.
*/
NFD,
/**
* Canonical decomposition, followed by canonical composition.
*/
NFC,
/**
* Compatibility decomposition.
*/
NFKD,
/**
* Compatibility decomposition, followed by canonical composition.
*/
NFKC
}
/**
* Normalize a sequence of char values.
* The sequence will be normalized according to the specified normalization
* from.
* @param src The sequence of char values to normalize.
* @param form The normalization form; one of
* {@link java.text.Normalizer.Form#NFC},
* {@link java.text.Normalizer.Form#NFD},
* {@link java.text.Normalizer.Form#NFKC},
* {@link java.text.Normalizer.Form#NFKD}
* @return The normalized String
* @throws NullPointerException If <code>src</code> or <code>form</code>
* is null.
*/
public static String normalize(CharSequence src, Form form) {
return NormalizerBase.normalize(src.toString(), form);
}
/**
* Determines if the given sequence of char values is normalized.
* @param src The sequence of char values to be checked.
* @param form The normalization form; one of
* {@link java.text.Normalizer.Form#NFC},
* {@link java.text.Normalizer.Form#NFD},
* {@link java.text.Normalizer.Form#NFKC},
* {@link java.text.Normalizer.Form#NFKD}
* @return true if the sequence of char values is normalized;
* false otherwise.
* @throws NullPointerException If <code>src</code> or <code>form</code>
* is null.
*/
public static boolean isNormalized(CharSequence src, Form form) {
return NormalizerBase.isNormalized(src.toString(), form);
}
}