tr1.regex 了解

最新推荐文章于 2022-03-13 16:15:14 发布

原创最新推荐文章于 2022-03-13 16:15:14 发布 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

C++ 专栏收录该内容

83 篇文章

订阅专栏

// 2014.1.21

我的理解regex有3点：写regex表达式匹配查找替换

如果不需要太复杂的功能一般下面的文章就可以搞定：

tr1.regex通配符
1：. 单个任意字符除了\n 要匹配\n用 [.\n]
2：^ 匹配行开头
3：$ 匹配行结尾
4：() 定义个一个表达式
5：* 表示前面的元素可以重复任意次
6：+ 表示前面的元素可以重复任意次(大于0)
7：? 表示前面的元素可以重复1次或0次
8：{} 手工指定元素重复次数{n} n次, {n,} 至少n次, {n, m} n-m次
9：[] 用于定义字符集合可以列出单字符可以定义范围可以是集合的补集
10：\\ 转义字符
11：| 逻辑或匹配两侧元素之一

下面依次是: 字母数字字母空白字符(包括\r\n 等) 按键数字数字字母数字外加特殊字符
小写可打印字符标点符号空白字符空白字符大写 XX(w不明白) 十六进制字符(0-9 A-F a-f)
template<>
const _Cl_names<char> _Regex_traits<char>::_Names[] =
{ // map class names to numeric constants
_REGEX_CHAR_CLASS_NAME("alnum", _Regex_traits<char>::_Ch_alnum),
_REGEX_CHAR_CLASS_NAME("alpha", _Regex_traits<char>::_Ch_alpha),
_REGEX_CHAR_CLASS_NAME("blank", _Regex_traits<char>::_Ch_blank),
_REGEX_CHAR_CLASS_NAME("cntrl", _Regex_traits<char>::_Ch_cntrl),
_REGEX_CHAR_CLASS_NAME("d", _Regex_traits<char>::_Ch_digit),
_REGEX_CHAR_CLASS_NAME("digit", _Regex_traits<char>::_Ch_digit),
_REGEX_CHAR_CLASS_NAME("graph", _Regex_traits<char>::_Ch_graph),
_REGEX_CHAR_CLASS_NAME("lower", _Regex_traits<char>::_Ch_lower),
_REGEX_CHAR_CLASS_NAME("print", _Regex_traits<char>::_Ch_print),
_REGEX_CHAR_CLASS_NAME("punct", _Regex_traits<char>::_Ch_punct),
_REGEX_CHAR_CLASS_NAME("space", _Regex_traits<char>::_Ch_space),
_REGEX_CHAR_CLASS_NAME("s", _Regex_traits<char>::_Ch_space),
_REGEX_CHAR_CLASS_NAME("upper", _Regex_traits<char>::_Ch_upper),
_REGEX_CHAR_CLASS_NAME("w", (_STD ctype_base::mask)(-1)),
_REGEX_CHAR_CLASS_NAME("xdigit", _Regex_traits<char>::_Ch_xdigit),
{0, 0, 0},
};

下面是详细的文档：
Regular Expression Grammar
1:Element
An ordinary character that matches the same character in the target sequence
A wildcard character '.' that matches any character in the target sequence except a newline 除了\n
A bracket expression of the form "[ expr]", which matches a character or a collation element in the target sequence that is also in the set defined by the expression expr, or of the form "[^ expr]", which matches a character or a collation element in the target sequence that is not in the set defined by the expression expr.
The expression expr can contain any combination of the following things:
An individual character. Adds that character to the set defined by expr.
A character range of the form " ch1- ch2". Adds the characters that are represented by values in the closed range [ ch1, ch2] to the set defined by expr.
A character class of the form "[: name:]". Adds the characters in the named class to the set defined by expr.
An equivalence class of the form "[= elt=]". Adds the collating elements that are equivalent to elt to the set defined by expr.
A collating symbol of the form "[. elt.]". Adds the collation element elt to the set defined by expr.
An anchor. Anchor '^' matches the beginning of the target sequence; anchor '$' matches the end of the target sequence.
An identity escape of the form "\ k", which matches the character k in the target sequence
其他复杂的情况不予考虑
2:Repetition 重复性
{ min, max}
"*". Equivalent to "{0,unbounded}".
"?". Equivalent to "{0,1}".
"+". Equivalent to "{1,unbounded}".
3:Concatenation 串联
Regular expression elements, with or without repetition counts, can be concatenated to form longer regular expressions. The resulting expression matches a target sequence that is a concatenation of the sequences that are matched by the individual elements. For example, "a{2,3}b" matches the target sequence "aab" and the target sequence "aaab", but does not match the target sequence "ab" or the target sequence "aaaab".
上面这句话说的是：匹配到的字符串序列式由一个个单个字符组成的；其二串联(ab){2} 是abab 而不是aabb或是其他
4:Alternation 交替
In all regular expression grammars except BRE and grep, a concatenated regular expression can be followed by the character '|' and another concatenated regular expression. Any number of concatenated regular expressions can be combined in this manner. The resulting expression matches any target sequence that matches one or more of the concatenated regular expressions.
When more than one of the concatenated regular expressions matches the target sequence, ECMAScript chooses the first of the concatenated regular expressions that matches the sequence as the match ( first match); the other regular expression grammars choose the one that achieves the longest match. For example, "ab|cd" matches the target sequence "ab" and the target sequence "cd", but does not match the target sequence "abd" or the target sequence "acd".
In grep and egrep, a newline character ('\n') can be used to separate alternations.
5:Subexpression
In BRE and grep, a subexpression is a concatenation. In the other regular expression grammars, a subexpression is an alternation.
这个有待验证以确定子表达式是串联还是交替

Semantic Details
Anchor
An anchor matches a position in the target string, not a character. A '^' matches the beginning of the target string, and a '$' matches the end of the target string.

Back Reference
A back reference is a backslash that is followed by a decimal value N. It matches the contents of the Nth capture group. The value of N must not be more than the number of capture groups that precede the back reference. In BRE and grep, the value of N is determined by the decimal digit that follows the backslash. In ECMAScript, the value of N is determined by all the decimal digits that immediately follow the backslash. Therefore, in BRE and grep, the value of N is never more than 9, even if the regular expression has more than nine capture groups. In ECMAScript, the value of N is unbounded.
Bracket Expression
A bracket expression defines a set of characters and collating elements. When the bracket expression begins with the character '^' the match succeeds if no elements in the set match the current character in the target sequence. Otherwise, the match succeeds if any one of the elements in the set matches the current character in the target sequence.

The set of characters can be defined by listing any combination of individual characters, character ranges, character classes, equivalence classes, and collating symbols.

Capture Group
A capture group marks its contents as a single unit in the regular expression grammar and labels the target text that matches its contents. The label that is associated with each capture group is a number, which is determined by counting the opening parentheses that mark capture groups up to and including the opening parenthesis that marks the current capture group. In this implementation, the maximum number of capture groups is 31.
"((a+)(b+))(c+)" matches the target sequence "aabbbc" and associates capture group 1 with the subsequence "aabbb", capture group 2 with the subsequence "aa", capture group 3 with "bbb", and capture group 4 with the subsequence "c".
说白了就是括号对
Character Class
A character class in a bracket expression adds all the characters in the named class to the character set that is defined by the bracket expression. To create a character class, use "[:" followed by the name of the class followed by ":]". Internally, names of character classes are recognized by calling id = traits.lookup_classname. A character ch belongs to such a class if traits.isctype(ch, id) returns true. The default regex_traits template supports the class names in the following table.
比较重要
[:alnum:] lowercase letters, uppercase letters, and digits
[:alpha:] lowercase letters and uppercase letters
[:blank:] space or tab
[:cntrl:] the file format escape characters
[:digit:] digits
[:graph:] lowercase letters, uppercase letters, digits, and punctuation
[:lower:] lowercase letters
[:print:] lowercase letters, uppercase letters, digits, punctuation, and space
[:punct:] punctuation
[:space:] space
[:upper:] uppercase characters
[:xdigit:] digits, 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F'
[:d:] same as digit
[:s:] same as space
[:w:] same as alnum
Character Range
[a-b]
A character range in a bracket expression adds all the characters in the range to the character set that is defined by the bracket expression. To create a character range, put the character '-' between the first and last characters in the range. Doing this puts into the set all characters that have a numeric value that is more than or equal to the numeric value of the first character, and less than or equal to the numeric value of the last character. Notice that this set of added characters depends on the platform-specific representation of characters. If the character '-' occurs at the beginning or the end of a bracket expression, or as the first or last character of a character range, it represents itself.
这个地方的例子很经典最好看看msdn
Collating Element 排列元素
A collating element is a multi-character sequence that is treated as a single character
Collating Symbol
A collating symbol in a bracket expression adds a collating element to the set that is defined by the bracket expression. To create a collating symbol, use "[." followed by the collating element followed by ".]".
[..]
Control Escape Sequence
A control escape sequence is a backslash followed by the letter 'c' followed by one of the letters 'a' through 'z' or 'A' through 'Z'. It matches the ASCII control character that is named by that letter. For example, "\ci" matches the target sequence "\x09", because <ctrl-i> has the value 0x09.
DSW Character Escape
上面\c+字母表示控制转义序列这个dsw转义就常规了
表示的意思就是用简写代替上面的字符类
Escape Sequence Equivalent Named Class Default Named Class
"\d" "[[:d:]]" "[[:digit:]]"
"\D" "[^[:d:]]" "[^[:digit:]]"
"\s" "[[:s:]]" "[[:space:]]"
"\S" "[^[:s:]]" "[^[:space:]]"
"\w" "[[:w:]]" "[a-zA-Z0-9_]"*
"\W" "[^[:w:]]" "[^a-zA-Z0-9_]"*
*ASCII character set
Equivalence Class
An equivalence class in a bracket expression adds all the characters and collating elements that are equivalent to the collating element in the equivalence class definition to the set that is defined by the bracket expression. To create an equivalence class, use "[=" followed by a collating element followed by "=]". Internally, two collating elements elt1 and elt2 are equivalent if traits.transform_primary(elt1.begin(), elt1.end()) == traits.transform_primary(elt2.begin(), elt2.end()).
File Format Escape
A file format escape consists of the usual C language character escape sequences, "\\", "\a", "\b", "\f", "\n", "\r", "\t", "\v".These have the usual meanings, that is, backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively. In ECMAScript, "\a" and "\b" are not allowed. ("\\" is allowed, but it is an identity escape, not a file format escape).
Hexadecimal Escape Sequence
A hexadecimal escape sequence is a backslash followed by the letter 'x' followed by two hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence that has the value that is specified by the two digits. For example, "\x41" matches the target sequence "A" when ASCII character encoding is used.

Identity Escape
An identity escape is a backslash followed by a single character. It matches that character. It is required when the character has a special meaning; by using the identity escape, the special meaning is removed
Individual Character
An individual character in a bracket expression adds that character to the character set that is defined by the bracket expression. Anywhere in a bracket expression except at the beginning, a '^' represents itself.
Negative Assert

Matching and Searching
Format Flags

看例子：来源网上

#include <iostream>
#include <regex>		// <tr1.regex>

using namespace std;

int main()
{
	string str_ip     = "127.0.0.1";
	string str_new_ip = "192.168.1.106";
	string str_regex  = "^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}$";
		//"(\\d{1,3}\\.){3}\\d{1,3}";
	// string str_regex2  = "(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)";
	string str_regex2 = ""; 

	// 表达式选项 - 忽略大小写
	regex_constants::syntax_option_type f1 = regex_constants::icase;

	// 编译一个正则表达式语句 
	regex reg_Express(str_regex, f1);

	// 保存查找的结果 
	smatch ms;

	// 判断是否全行匹配
	if (regex_match(str_ip, ms, reg_Express))
		cout << "regex： " << str_regex << "匹配：" << str_ip
			<< "success" << endl;

	// 查找
	if (regex_search(str_ip, ms, reg_Express))
	{
		for(size_t i= 0; i < ms.size(); ++i)    
		{    
			cout<<"第"<<i<<"个结果:\""<<ms.str(i)<<"\" - ";    
			cout<<"起始位置:"<<ms.position(i)<<"长度"<<ms.length(i)<<std::endl;    
		}

		// 替换
		str_ip = str_ip.replace(ms[0].first, ms[0].second, str_new_ip);
		cout << str_ip << endl;
	}

	string str_new_ip_2 = "abc";
	string str_new_text = regex_replace(str_ip, reg_Express, str_new_ip_2);
	cout << str_new_text << endl;


	cin.get();
	return 0;
}

看着代码其实没什么好讲的