Java正则表达式:全面解析与应用实践
正则表达式在字符串处理中的应用
在Java编程中,字符串和正则表达式是紧密相连的。下面通过一个代码示例来展示如何使用正则表达式进行字符串的匹配和替换:
public static void main (String [] argv)
{
String input = "Thanks, thanks very much";
String regex = "([Tt])hanks";
Pattern pattern = Pattern.compile (regex);
Matcher matcher = pattern.matcher (input);
StringBuffer sb = new StringBuffer();
// Loop while matches are encountered
while (matcher.find()) {
if (matcher.group(1).equals ("T")) {
matcher.appendReplacement (sb, "Thank you");
} else {
matcher.appendReplacement (sb, "thank you");
}
}
// Complete the transfer to the StringBuffer
matcher.appendTail (sb);
// Print the result
System.out.println (sb.toString());
// Let's try that again using the $n escape in the replacement
sb.setLength (0);
matcher.reset();
String replacement = "$1hank you";
// Loop while matches are encountered
while (matcher.find()) {
matcher.appendReplacement (sb, replacement);
}
// Complete the transfer to the StringBuffer
matcher.appendTail (sb);
// Print the result
System.out.println (sb.toString());
// and once more, the easy way (because this example is simple)
System.out.println (matcher.replaceAll (replacement));
// one last time, using only the String
System.out.println (input.replaceAll (regex, replacement));
}
上述代码的操作步骤如下:
1. 定义输入字符串 input 和正则表达式 regex 。
2. 使用 Pattern.compile(regex) 编译正则表达式,得到 Pattern 对象。
3. 通过 pattern.matcher(input) 创建 Matcher 对象。
4. 利用 matcher.find() 方法查找匹配项,并根据条件进行替换。
5. 使用 matcher.appendReplacement 和 matcher.appendTail 完成替换和结果输出。
String类的正则表达式方法
Java的 String 类提供了一些方便的正则表达式操作方法,这些方法实际上是对 Pattern 或 Matcher 类方法的封装。以下是 String 类相关方法的部分API列表:
package java.lang;
public final class String
implements java.io.Serializable, Comparable, CharSequence
{
// This is a partial API listing
public boolean matches (String regex)
public String [] split (String regex)
public String [] split (String regex, int limit)
public String replaceFirst (String regex, String replacement)
public String replaceAll (String regex, String replacement)
}
这些方法的对应关系如下表所示:
| String方法签名 | java.util.regex等效方法 |
| — | — |
| input.matches (String regex) | Pattern.matches (String regex, CharSequence input) |
| input.split (String regex) | pat.split (CharSequence input) |
| input.split (String regex, int limit) | pat.split (CharSequence input, int limit) |
| input.replaceFirst (String regex, String replacement) | match.replaceFirst (String replacement) |
| input.replaceAll (String regex, String replacement) | match.replaceAll (String replacement) |
需要注意的是,截至JDK 1.4,这些正则表达式便利方法不会缓存任何表达式或进行其他优化。如果需要反复应用相同的模式匹配操作,使用 java.util.regex 中的类会更高效。
Java正则表达式语法
java.util.regex 包支持的正则表达式语法在JDK 1.4中发布。该类库完全支持Unicode,并遵循 Unicode技术报告 #18 中的指南。其语法与Perl类似,但并不完全相同,主要区别在于 java.util.regex 不支持在表达式中嵌入Perl代码,并且新增了占有量词。
以下是Java正则表达式语法的快速参考:
单个字符
| 语法 | 匹配内容 |
|---|---|
| x | 字符x(只要x不是正则表达式语法中有特殊含义的标点字符) |
| \p | 标点字符p |
| \ | 反斜杠字符 |
| \n | 换行符 \u000A |
| \t | 制表符 \u0009 |
| \r | 回车符 \u000D |
| \f | 换页符 \u000C |
| \e | 转义字符 \u001B |
| \a | 响铃(警报)字符 \u0007 |
| \uhhhh | 十六进制代码为hhhh的Unicode字符 |
| \xhh | 十六进制代码为hh的字符 |
| \0n | 八进制代码为n的字符 |
| \0nn | 八进制代码为nn的字符 |
| \0nnn | 八进制代码为nnn的字符(nnn <= 377) |
| \cx | 控制字符 ^x |
字符类
| 语法 | 匹配内容 |
|---|---|
| […] | 方括号内的任意一个字符。字符可以直接指定,也支持指定字符范围,以及交集、并集和减法运算符 |
| [^…] | 不在方括号内的任意一个字符 |
| [a-z0-9] | 字符范围:a到z(包含)或0到9(包含)之间的字符 |
| [0-9[a-fA-F]] | 类的并集:等同于 [0-9a-fA-F] |
| [a-z&&[aeiou]] | 类的交集:等同于 [aeiou] |
| [a-z&&[^aeiou]] | 减法:a到z之间的字符,除了元音字母 |
其他语法
还有一些其他重要的语法,如序列、替代、分组和引用,以及重复和锚点等:
- 序列、替代、分组和引用
- xy :匹配x后接y。
- x|y :匹配x或y。
- (...) :分组,将括号内的子表达式组合成一个单元,可与 * 、 + 、 ? 、 | 等一起使用,同时“捕获”匹配该组的字符供后续使用。
- (?:...) :仅分组,不捕获匹配的文本。
- \n :匹配捕获组编号为n首次匹配时的相同字符。
- 重复
- x? :x出现零次或一次。
- x* :x出现零次或多次。
- x+ :x出现一次或多次。
- x{n} :x恰好出现n次。
- x{n,} :x出现n次或更多次。
- x{n,m} :x至少出现n次,最多出现m次。
- 锚点
- ^ :输入字符串的开头,或如果指定了 MULTILINE 标志,则为字符串或任何新行的开头。
- $ :输入字符串的结尾,或如果指定了 MULTILINE 标志,则为字符串或行的结尾。
- \b :单词边界。
- \B :非单词边界。
- \A :输入字符串的开头,与 ^ 类似,但无论设置什么标志,都从不匹配新行的开头。
- \Z :输入字符串的结尾,忽略任何尾随的行终止符。
- \z :输入字符串的结尾,包括任何行终止符。
- \G :上一次匹配的结尾。
- (?=x) :正向预查断言,要求后续字符匹配x,但不将这些字符包含在匹配中。
- (?!x) :负向预查断言,要求后续字符不匹配模式x。
- (?<=x) :正向回顾断言,要求紧接在该位置之前的字符匹配x,但不将这些字符包含在匹配中,x必须是具有固定字符数的模式。
- (?<!x) :负向回顾断言,要求紧接在该位置之前的字符不匹配x,x必须是具有固定字符数的模式。
此外,还有一些杂项语法,如占有分组 (?>x) 、标志设置 (?onflags-offflags) 等。重复字符被称为贪婪量词,如果在量词后加问号则变为“勉强量词”,加加号则变为“占有量词”。锚点不匹配字符,而是匹配字符之间的零宽度位置。
面向对象的文件grep实现
下面介绍一个面向对象的文件grep实现,通过 Grep 类可以使用正则表达式扫描不同文件中的相同模式。以下是完整的代码示例:
package com.ronsoft.books.nio.regex;
import java.io.File;
import java.io.FileReader;
import java.io.LineNumberReader;
import java.io.IOException;
import java.util.List;
import java.util.LinkedList;
import java.util.Iterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* A file searching class, similar to grep, which returns information
* about lines matched in the specified files. Instances of this class
* are tied to a specific regular expression pattern and may be applied
* repeatedly to multiple files. Instances of Grep are thread safe,
* they may be shared.
*
* @author Michael Daudel (mgd@ronsoft.com) (original)
* @author Ron Hitchens (ron@ronsoft.com) (hacked)
*/
public class Grep
{
// the pattern to use for this instance
private Pattern pattern;
/**
* Instantiate a Grep object for the given pre-compiled Pattern
* object.
* @param pattern A java.util.regex.Pattern object specifying the
* pattern to search for.
*/
public Grep (Pattern pattern)
{
this.pattern = pattern;
}
/**
* Instantiate a Grep object and compile the given regular
* expression string.
* @param regex The regular expression string to compile into a
* Pattern for internal use.
* @param ignoreCase If true, pass Pattern.CASE_INSENSITIVE to the
* Pattern constuctor so that seaches will be done without regard
* to alphabetic case. Note, this only applies to the ASCII
* character set. Use embedded expressions to set other options.
*/
public Grep (String regex, boolean ignoreCase)
{
this.pattern = Pattern.compile (regex,
(ignoreCase) ? Pattern.CASE_INSENSITIVE : 0);
}
/**
* Instantiate a Grep object with the given regular expression
* string, with default options.
*/
public Grep (String regex)
{
this (regex, false);
}
/**
* Perform a grep on the given file.
* @param file A File object denoting the file to scan for the
* regex given when this Grep instance was constructed.
* @return A type-safe array of Grep.MatchedLine objects describing
* the lines of the file matched by the pattern.
* @exception IOException If there is a problem reading the file.
*/
public MatchedLine [] grep (File file)
throws IOException
{
List list = grepList (file);
MatchedLine matches [] = new MatchedLine [list.size()];
list.toArray (matches);
return (matches);
}
/**
* Perform a grep on the given file.
* @param file A String filename denoting the file to scan for the
* regex given when this Grep instance was constructed.
* @return A type-safe array of Grep.MatchedLine objects describing
* the lines of the file matched by the pattern.
* @exception IOException If there is a problem reading the file.
*/
public MatchedLine [] grep (String fileName)
throws IOException
{
return (grep (new File (fileName)));
}
/**
* Perform a grep on the given list of files. If a given file
* cannot be read, it will be ignored as if empty.
* @param files An array of File objects to scan.
* @return A type-safe array of Grep.MatchedLine objects describing
* the lines of the file matched by the pattern.
*/
public MatchedLine [] grep (File [] files)
{
List aggregate = new LinkedList();
for (int i = 0; i < files.length; i++) {
try {
List temp = grepList (files [i]);
aggregate.addAll (temp);
} catch (IOException e) {
// ignore I/O exceptions
}
}
MatchedLine matches [] = new MatchedLine [aggregate.size()];
aggregate.toArray (matches);
return (matches);
}
/**
* Encapsulation of a matched line from a file. This immutable
* object has five read-only properties:
* <li>getFile(): The File this match pertains to.</li>
* <li>getLineNumber(): The line number (1-relative) within the
* file where the match was found.</li>
* <li>getLineText(): The text of the matching line</li>
* <li>start(): The index within the line where the matching
* pattern begins.</li>
* <li>end(): The index, plus one, of the end of the matched
* character sequence.</li>
*/
public static class MatchedLine
{
private File file;
private int lineNumber;
private String lineText;
private int start;
private int end;
MatchedLine (File file, int lineNumber, String lineText,
int start, int end)
{
this.file = file;
this.lineNumber = lineNumber;
this.lineText = lineText;
this.start = start;
this.end = end;
}
public File getFile()
{
return (this.file);
}
public int getLineNumber()
{
return (this.lineNumber);
}
public String getLineText()
{
return (this.lineText);
}
public int start()
{
return (this.start);
}
public int end()
{
return (this.end);
}
}
/**
* Run the grepper on the given File.
* @return A (non-type-safe) List of MatchedLine objects.
*/
private List grepList (File file)
throws IOException
{
if ( ! file.exists()) {
throw new IOException ("Does not exist: " + file);
}
if ( ! file.isFile()) {
throw new IOException ("Not a regular file: " + file);
}
if ( ! file.canRead()) {
throw new IOException ("Unreadable file: " + file);
}
LinkedList list = new LinkedList();
FileReader fr = new FileReader (file);
LineNumberReader lnr = new LineNumberReader (fr);
Matcher matcher = this.pattern.matcher ("");
String line;
while ((line = lnr.readLine()) != null) {
matcher.reset (line);
if (matcher.find()) {
list.add (new MatchedLine (file,
lnr.getLineNumber(), line,
matcher.start(), matcher.end()));
}
}
lnr.close();
return (list);
}
/**
* Test code to run grep operations. Accepts two command-line
* options: -i or --ignore-case, compile the given pattern so
* that case of alpha characters is ignored. Or -1, which runs
* the grep operation on each individual file, rather that passing
* them all to one invocation. This is just to test the different
* methods. The printed ouptut is slightly different when -1 is
* specified.
*/
public static void main (String [] argv)
{
// Set defaults
boolean ignoreCase = false;
boolean onebyone = false;
List argList = new LinkedList(); // to gather args
// Loop through the args, looking for switches and saving
// the patterns and filenames
for (int i = 0; i < argv.length; i++) {
if (argv [i].startsWith ("-")) {
if (argv [i].equals ("-i")
|| argv [i].equals ("--ignore-case"))
{
ignoreCase = true;
}
if (argv [i].equals ("-1")) {
onebyone = true;
}
continue;
}
// not a switch, add it to the list
argList.add (argv [i]);
}
// Enough args to run?
if (argList.size() < 2) {
System.err.println ("usage: [options] pattern filename ...");
return;
}
// First arg on the list will be taken as the regex pattern.
// Pass the pattern to the new Grep object, along with the
// current value of the ignore case flag.
Grep grepper = new Grep ((String) argList.remove (0),
ignoreCase);
// somewhat arbitrarily split into two ways of calling the
// grepper and printing out the results
if (onebyone) {
Iterator it = argList.iterator();
// Loop through the filenames and grep them
while (it.hasNext()) {
String fileName = (String) it.next();
// Print the filename once before each grep
System.out.println (fileName + ":");
MatchedLine [] matches = null;
// Catch exceptions
try {
matches = grepper.grep (fileName);
} catch (IOException e) {
System.err.println ("\t*** " + e);
}
if (matches != null) {
for (MatchedLine match : matches) {
System.out.println (" Line " + match.getLineNumber() + ": " + match.getLineText());
}
}
}
} else {
File [] files = new File[argList.size()];
for (int i = 0; i < argList.size(); i++) {
files[i] = new File((String) argList.get(i));
}
MatchedLine [] matches = grepper.grep(files);
for (MatchedLine match : matches) {
System.out.println (match.getFile().getName() + ": Line " + match.getLineNumber() + ": " + match.getLineText());
}
}
}
}
Grep 类的使用步骤如下:
1. 创建 Grep 对象,可以传入预编译的 Pattern 对象,也可以传入正则表达式字符串,还可以指定是否忽略大小写。
2. 调用 grep 方法,可以传入 File 对象、文件名或文件数组,该方法会返回一个 Grep.MatchedLine 对象数组,描述匹配到的文件行信息。
3. MatchedLine 类封装了匹配行的相关信息,包括文件对象、行号、行文本、匹配模式的起始和结束索引。
通过以上内容,我们全面了解了Java中正则表达式的使用,包括字符串处理、语法规则以及面向对象的文件搜索实现,希望这些知识能帮助你在实际编程中更好地运用正则表达式。
Java正则表达式:全面解析与应用实践(续)
正则表达式在实际应用中的优化策略
在实际开发中,正则表达式的性能优化至关重要。以下是一些优化策略:
1. 避免不必要的回溯 :回溯是正则表达式匹配过程中常见的性能瓶颈。例如,贪婪量词在匹配时会尽可能多地匹配字符,当后续匹配失败时会进行回溯。可以使用占有量词或勉强量词来减少回溯。例如,将 .* 替换为 .*? 或 .*+ 。
2. 缓存正则表达式 :如果需要多次使用相同的正则表达式,应该缓存编译后的 Pattern 对象,避免重复编译。示例代码如下:
import java.util.regex.Pattern;
public class RegexCacheExample {
private static final Pattern CACHED_PATTERN = Pattern.compile("your_regex_here");
public static void main(String[] args) {
// 使用缓存的Pattern对象进行匹配
java.util.regex.Matcher matcher = CACHED_PATTERN.matcher("your_input_string");
if (matcher.find()) {
System.out.println("Match found!");
}
}
}
- 简化正则表达式 :尽量使用简单的正则表达式,避免复杂的嵌套和过多的分组。复杂的正则表达式不仅难以理解,还会增加匹配的时间复杂度。
正则表达式在数据验证中的应用
正则表达式在数据验证方面有着广泛的应用,例如验证邮箱地址、手机号码等。以下是一些常见的数据验证示例:
1. 验证邮箱地址
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class EmailValidator {
private static final String EMAIL_REGEX = "^[a-zA-Z0-9_+&*-]+(?:\\.[a-zA-Z0-9_+&*-]+)*@(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{2,7}$";
private static final Pattern EMAIL_PATTERN = Pattern.compile(EMAIL_REGEX);
public static boolean isValidEmail(String email) {
Matcher matcher = EMAIL_PATTERN.matcher(email);
return matcher.matches();
}
public static void main(String[] args) {
String testEmail = "example@example.com";
if (isValidEmail(testEmail)) {
System.out.println(testEmail + " is a valid email address.");
} else {
System.out.println(testEmail + " is not a valid email address.");
}
}
}
- 验证手机号码
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class PhoneValidator {
private static final String PHONE_REGEX = "^1[3-9]\\d{9}$";
private static final Pattern PHONE_PATTERN = Pattern.compile(PHONE_REGEX);
public static boolean isValidPhone(String phone) {
Matcher matcher = PHONE_PATTERN.matcher(phone);
return matcher.matches();
}
public static void main(String[] args) {
String testPhone = "13800138000";
if (isValidPhone(testPhone)) {
System.out.println(testPhone + " is a valid phone number.");
} else {
System.out.println(testPhone + " is not a valid phone number.");
}
}
}
正则表达式在文本处理中的高级应用
除了基本的匹配和替换,正则表达式还可以用于更复杂的文本处理任务,如提取特定信息、分割文本等。
1. 提取特定信息 :从一段文本中提取所有的URL。
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class UrlExtractor {
private static final String URL_REGEX = "(https?|ftp)://[^\\s/$.?#].[^\\s]*";
private static final Pattern URL_PATTERN = Pattern.compile(URL_REGEX);
public static void extractUrls(String text) {
Matcher matcher = URL_PATTERN.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
public static void main(String[] args) {
String text = "Check out these websites: https://www.example.com and http://test.org";
extractUrls(text);
}
}
- 分割文本 :根据特定的分隔符分割文本。
import java.util.Arrays;
public class TextSplitter {
public static void main(String[] args) {
String text = "apple,orange,banana";
String[] fruits = text.split(",");
System.out.println(Arrays.toString(fruits));
}
}
正则表达式与其他技术的结合应用
在实际开发中,正则表达式常常与其他技术结合使用,以实现更强大的功能。
1. 与文件处理结合 :结合前面介绍的 Grep 类,我们可以对文件内容进行更复杂的处理。例如,统计文件中包含特定关键词的行数。
import com.ronsoft.books.nio.regex.Grep;
import java.io.IOException;
public class KeywordLineCounter {
public static void main(String[] args) {
String regex = "your_keyword_here";
Grep grepper = new Grep(regex);
try {
com.ronsoft.books.nio.regex.Grep.MatchedLine[] matches = grepper.grep("your_file_path_here");
System.out.println("Number of lines containing the keyword: " + matches.length);
} catch (IOException e) {
System.err.println("Error reading file: " + e.getMessage());
}
}
}
- 与网络编程结合 :在网络爬虫中,正则表达式可以用于解析HTML页面,提取所需的信息。以下是一个简单的示例,用于提取HTML页面中的所有链接:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class HtmlLinkExtractor {
public static void main(String[] args) {
try {
URL url = new URL("https://www.example.com");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuilder htmlContent = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
htmlContent.append(line);
}
reader.close();
String html = htmlContent.toString();
String linkRegex = "<a\\s+href\\s*=\\s*\"([^\"]+)\"";
Pattern pattern = Pattern.compile(linkRegex);
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
} catch (IOException e) {
System.err.println("Error fetching HTML page: " + e.getMessage());
}
}
}
总结
正则表达式是Java编程中一个强大的工具,它在字符串处理、数据验证、文本处理等方面都有着广泛的应用。通过深入理解Java正则表达式的语法规则、掌握优化策略以及与其他技术的结合应用,我们可以在实际开发中更加高效地使用正则表达式。希望本文所介绍的内容能帮助你更好地运用正则表达式解决实际问题。
附录:正则表达式语法流程图
graph TD;
A[单个字符] --> B[普通字符];
A --> C[转义字符];
C --> C1[\n 换行符];
C --> C2[\t 制表符];
C --> C3[\r 回车符];
C --> C4[\uhhhh Unicode字符];
D[字符类] --> E[方括号类];
D --> F[预定义字符类];
E --> E1[[a-z] 字符范围];
E --> E2[[^a-z] 取反];
F --> F1[\d 数字];
F --> F2[\s 空白字符];
F --> F3[\w 单词字符];
G[重复] --> H[贪婪量词];
G --> I[勉强量词];
G --> J[占有量词];
H --> H1[x* 零次或多次];
H --> H2[x+ 一次或多次];
H --> H3[x? 零次或一次];
I --> I1[x*? 勉强零次或多次];
J --> J1[x*+ 占有零次或多次];
K[锚点] --> L[^ 开头];
K --> M[$ 结尾];
K --> N[\b 单词边界];
O[分组和引用] --> P[(...) 捕获组];
O --> Q[(?:...) 非捕获组];
O --> R[\n 反向引用];
S[替代] --> T[x|y 匹配x或y];
这个流程图展示了Java正则表达式的主要语法分类,包括单个字符、字符类、重复、锚点、分组和引用以及替代等。通过这个流程图,可以更直观地理解正则表达式的语法结构。
超级会员免费看
34

被折叠的 条评论
为什么被折叠?



