Regular Expressions 正则表达式(一)

本文介绍了正则表达式的基本概念,包括元字符和字面量,并讨论了Oracle数据库中支持的POSIX元字符和Perl影响的扩展。通过示例展示了如何使用正则表达式进行模式匹配,以及在数据完整性约束中的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 Regular Expressions

Regular expressions enable you to search for patterns in string data by using standardized syntax conventions. You specify a regular expression by means of the following types of characters:

  • Metacharacters, which are operators that specify search algorithms

  • Literals, which are the characters for which you are searching

    ***正则表达式由2部分组成,一个是元字符+需要寻找的字符


A regular expression can specify complex patterns of character sequences. For example, the following regular expression searches for the literals f or ht, thet literal, the p literal optionally followed by the s literal, and finally the colon (:) literal:

(f|ht)tps?:

The parentheses are metacharacters that group a series of pattern elements to a single element; the pipe symbol (|) matches one of the alternatives in the group. The question mark (?) is a metacharacter indicating that the preceding pattern, in this case the s character, is optional. Thus, the preceding regular expression matches the http:https:ftp:, and ftps: strings.


Metacharacters:


POSIX Metacharacters in Oracle Database Regular Expressions

Table 4-2 lists the list of metacharacters supported for use in regular expressions passed to SQL regular expression functions and conditions. These metacharacters conform to the POSIX standard; any differences in behavior from the standard are noted in the "Description" column.

Table 4-2 POSIX Metacharacters in Oracle Database Regular Expressions

SyntaxOperator NameDescriptionExample

.

Any Character — Dot

Matches any character in the database character set. If the nflag is set, it matches the newline character. The newline is recognized as the linefeed character (\x0a) on UNIX and Windows or the carriage return character (\x0d) on Macintosh platforms.

Note: In the POSIX standard, this operator matches any English character except NULL and the newline character.

The expression a.b matches the strings abbacb, and adb, but does not match acc.

+

One or More — Plus Quantifier

Matches one or more occurrences of the preceding subexpression.

The expression a+ matches the strings aaa, and aaa, but does not match bbb.

?

Zero or One — Question Mark Quantifier

Matches zero or one occurrence of the preceding subexpression.

The expression ab?c matches the strings abc and ac, but does not match abbc.

*

Zero or More — Star Quantifier

Matches zero or more occurrences of the preceding subexpression. By default, a quantifier match is greedy because it matches as many times as possible while still allowing the rest of the match to succeed.

The expression ab*c matches the strings acabc, and abbc, but does not match abb.

{m}

Interval—Exact Count

Matches exactly m occurrences of the preceding subexpression.

The expression a{3} matches the strings aaa, but does not match aa.

{m,}

Interval—At Least Count

Matches at least m occurrences of the preceding subexpression.

The expression a{3,} matches the strings aaa and aaaa, but does not match aa.

{m,n}

Interval—Between Count

Matches at least m, but not more than n occurrences of the preceding subexpression.

The expression a{3,5} matches the strings aaaaaaa, andaaaaa, but does not match aa.

[ ... ]

Matching Character List

Matches any single character in the list within the brackets. The following operators are allowed within the list, but other metacharacters included are treated as literals:

  • Range operator: -

  • POSIX character class: [: :]

  • POSIX collation element: [. .]

  • POSIX character equivalence class: [= =]

A dash (-) is a literal when it occurs first or last in the list, or as an ending range point in a range expression, as in [#--]. A right bracket (]) is treated as a literal if it occurs first in the list.

Note: In the POSIX standard, a range includes all collation elements between the start and end of the range in the linguistic definition of the current locale. Thus, ranges are linguistic rather than byte values ranges; the semantics of the range expression are independent of character set. In Oracle Database, the linguistic range is determined by the NLS_SORTinitialization parameter.

The expression [abc] matches the first character in the stringsallbill, and cold, but does not match any characters in doll.

[^ ... ]

Non-Matching Character List

Matches any single character not in the list within the brackets. Characters not in the non-matching character list are returned as a match. Refer to the description of the Matching Character List operator for an account of metacharacters allowed in the character list.

The expression [^abc] matches the character d in the stringabcdef, but not the character ab, or c. The expression [^abc]+matches the sequence def in the string abcdef, but not ab, orc.

The expression [^a-i] excludes any character between a and ifrom the search result. This expression matches the characterj in the string hij, but does not match any characters in the string abcdefghi.

|

Or

Matches one of the alternatives.

The expression a|b matches character a or character b.

( ... )

Subexpression or Grouping

Treats the expression within parentheses as a unit. The subexpression can be a string of literals or a complex expression containing operators.

The expression (abc)?def matches the optional string abc, followed by def. Thus, the expression matches abcdefghi anddef, but does not match ghi.

\n

Backreference

Matches the nth preceding subexpression, that is, whatever is grouped within parentheses, where n is an integer from 1 to 9. The parentheses cause an expression to be remembered; a backreference refers to it. A backreference counts subexpressions from left to right, starting with the opening parenthesis of each preceding subexpression. The expression is invalid if the source string contains fewer than nsubexpressions preceding the \n.

Oracle supports the backreference expression in the regular expression pattern and the replacement string of theREGEXP_REPLACE function.

The expression (abc|def)xy\1 matches the strings abcxyabcand defxydef, but does not match abcxydef or abcxy.

A backreference enables you to search for a repeated string without knowing the actual string ahead of time. For example, the expression ^(.*)\1$ matches a line consisting of two adjacent instances of the same string.

\

Escape Character

Treats the subsequent metacharacter in the expression as a literal. Use a backslash (\) to search for a character that is normally treated as a metacharacter. Use consecutive backslashes (\\) to match the backslash literal itself.

The expression \+ searches for the plus character (+). It matches the plus character in the string abc+def, but does not match abcdef.

^

Beginning of Line Anchor

Matches the beginning of a string (default). In multiline mode, it matches the beginning of any line within the source string.

The expression ^def matches def in the string defghi but does not match def in abcdef.

$

End of Line Anchor

Matches the end of a string (default). In multiline mode, it matches the beginning of any line within the source string.

The expression def$ matches def in the string abcdef but does not match def in the string defghi.

[:class:]

POSIX Character Class

Matches any character belonging to the specified POSIX character class. You can use this operator to search for characters with specific formatting such as uppercase characters, or you can search for special characters such as digits or punctuation characters. The full set of POSIX character classes is supported.

Note: In English regular expressions, range expressions often indicate a character class. For example, [a-z] indicates any lowercase character. This convention is not useful in multilingual environments, where the first and last character of a given character class may not be the same in all languages. Oracle supports the character classes in Table 4-3 based on character class definitions in Globalization classification data.

The expression [[:upper:]]+ searches for one or more consecutive uppercase characters. This expression matchesDEF in the string abcDEFghi but does not match the stringabcdefghi.

[.element.]

POSIX Collating Element Operator

Specifies a collating element to use in the regular expression. The element must be a defined collating element in the current locale. Use any collating element defined in the locale, including single-character and multicharacter elements. TheNLS_SORT initialization parameter determines supported collation elements.This operator lets you use a multicharacter collating element in cases where only one character would otherwise be allowed. For example, you can ensure that the collating element ch, when defined in a locale such as Traditional Spanish, is treated as one character in operations that depend on the ordering of characters.

The expression [[.ch.]] searches for the collating element chand matches ch in string chabc, but does not match cdefg. The expression [a-[.ch.]] specifies the range a to ch.

[=character=]

POSIX Character Equivalence Class

Matches all characters that are members of the same character equivalence class in the current locale as the specified character.

The character equivalence class must occur within a character list, so the character equivalence class is always nested within the brackets for the character list in the regular expression.

Usage of character equivalents depends on how canonical rules are defined for your database locale. Refer to the Oracle Database Globalization Support Guide for more information on linguistic sorting and string searching.

The expression [[=n=]] searches for characters equivalent to nin a Spanish locale. It matches both N and ñ in the string El Niño.



Perl-Influenced Extensions in Oracle Regular Expressions

Table 4-4 describes Perl-influenced metacharacters supported in Oracle Database regular expression functions and conditions. These metacharacters are not in the POSIX standard, but are common at least partly due to the popularity of Perl. Note that Perl character class matching is based on the locale model of the operating system, whereas Oracle Database regular expressions are based on the language-specific data of the database. In general, a regular expression involving locale data cannot be expected to produce the same results between Perl and Oracle Database.

Table 4-4 Perl-Influenced Extensions in Oracle Regular Expressions

Reg. Exp.Matches . . .Example

\d

A digit character. It is equivalent to the POSIX class [[:digit:]].

The expression ^\(\d{3}\) \d{3}-\d{4}$ matches (650) 555-1212 but does not match 650-555-1212.

\D

A non-digit character. It is equivalent to the POSIX class [^[:digit:]].

The expression \w\d\D matches b2b and b2_ but does not match b22.

\w

A word character, which is defined as an alphanumeric or underscore (_) character. It is equivalent to the POSIX class [[:alnum:]_]. Note that if you do not want to include the underscore character, you can use the POSIX class [[:alnum:]].

The expression \w+@\w+(\.\w+)+ matches the string jdoe@company.co.uk but not the string jdoe@company.

\W

A non-word character. It is equivalent to the POSIX class [^[:alnum:]_].

The expression \w+\W\s\w+ matches the string to: bill but not the stringto bill.

\s

A whitespace character. It is equivalent to the POSIX class [[:space:]].

The expression \(\w\s\w\s\) matches the string (a b ) but not the string(ab).

\S

A non-whitespace character. It is equivalent to the POSIX class[^[:space:]].

The expression \(\w\S\w\S\) matches the string (abde) but not the string(a b d e).

\A

Only at the beginning of a string. In multi-line mode, that is, when embedded newline characters in a string are considered the termination of a line, \A does not match the beginning of each line.

The expression \AL matches only the first L character in the stringLine1\nLine2\n, regardless of whether the search is in single-line or multi-line mode.

\Z

Only at the end of string or before a newline ending a string. In multi-line mode, that is, when embedded newline characters in a string are considered the termination of a line, \Z does not match the end of each line.

In the expression \s\Z, the \s matches the last space in the string L i n e \n, regardless of whether the search is in single-line or multi-line mode.

\z

Only at the end of a string.

In the expression \s\z, the \s matches the newline in the string L i n e \n, regardless of whether the search is in single-line or multi-line mode.

*?

The preceding pattern element 0 or more times (non-greedy). Note that this quantifier matches the empty string whenever possible.

The expression \w*?x\w is "non-greedy" and so matches abxc in the stringabxcxd. The expression \w*x\w is "greedy" and so matches abxcxd in the string abxcxd. The expression \w*?x\w also matches the string xa.

+?

The preceding pattern element 1 or more times (non-greedy).

The expression \w+?x\w is "non-greedy" and so matches abxc in the stringabxcxd. The expression \w+x\w is "greedy" and so matches abxcxd in the string abxcxd. The expression \w+?x\w does not match the string xa, but does match the string axa.

??

The preceding pattern element 0 or 1 time (non-greedy). Note that this quantifier matches the empty string whenever possible.

The expression a??aa is "non-greedy" and matches aa in the string aaaa. The expression a?aa is "greedy" and so matches aaa in the string aaaa.

{n}?

The preceding pattern element exactly n times (non-greedy). In this case{n}? is equivalent to {n}.

The expression (a|aa){2}? matches aa in the string aaaa.

{n,}?

The preceding pattern element at least n times (non-greedy).

The expression a{2,}? is "non-greedy" and matches aa in the string aaaaa. The expression a{2,} is "greedy" and so matches aaaaa.

{n,m}?

At least n but not more than m times (non-greedy). Note that {0,m}?matches the empty string whenever possible.

The expression a{2,4}? is "non-greedy" and matches aa in the string aaaaa. The expression a{2,4} is "greedy" and so matches aaaa.




egular expressions are a useful way to enforce integrity constraints.

 For example regexp_like()

CREATE TABLE contacts
(
  l_name    VARCHAR2(30), 
  p_number  VARCHAR2(30)
    CONSTRAINT p_number_format      
      CHECK ( REGEXP_LIKE ( p_number, '^\(\d{3}\) \d{3}-\d{4}$' ) )  -----约束检查p_number列 的value是否符合表达式要求:
);

表达式解说:

Regular Expression ElementMatches . . .

^

The beginning of the string.  字符开始符号

\(

A left parenthesis. The backward slash (\) is an escape character that indicates that the left parenthesis following it is a literal rather than a grouping expression. 转义字符\  将( 转为字符串内容

\d{3}

Exactly three digits.   3个数字

\)

A right parenthesis. The backward slash (\) is an escape character that indicates that the right parenthesis following it is a literal rather than a grouping expression.   转义字符\  将) 转为字符串内容

(space character)

A space character. 空格字符

\d{3}

Exactly three digits.3个数字

-

A hyphen.- 为连接字符

\d{4}

Exactly four digits. 4个数字

$

The end of the string.  字符串结束符号


所以P_number  的values 类似: (023) 245-5458

Example 4-2 shows a SQL script that attempts to insert seven phone numbers into the contacts table. Only the first two INSERT statements use a format that conforms to the p_number_format constraint; the remaining statements generate check constraint errors.

Example 4-2 insert_contacts.sql

-- first two statements use valid phone number format
INSERT INTO contacts (p_number)
  VALUES(  '(650) 555-5555'   );
INSERT INTO contacts (p_number)
  VALUES(  '(215) 555-3427'   );   插入正常
-- remaining statements generate check contraint errors
INSERT INTO contacts (p_number)
  VALUES(  '650 555-5555'     );
INSERT INTO contacts (p_number)
  VALUES(  '650 555 5555'     );
INSERT INTO contacts (p_number)
  VALUES(  '650-555-5555'     );
INSERT INTO contacts (p_number)
  VALUES(  '(650)555-5555'    );
INSERT INTO contacts (p_number)
  VALUES(  ' (650) 555-5555'  );----违反约束

 查询结果


For example regexp_replace()

SELECT names "names",
  REGEXP_REPLACE(names,
                 '^(\S+)\s(\S+)\s(\S+)$',
                 '\3, \1 \2')
  AS "names after regexp"
FROM famous_people;

Table 4-7 explains the elements of the regular expression.

Table 4-7 Explanation of the Regular Expression Elements in Example 4-4

Regular Expression ElementDescription

^

Matches the beginning of the string.

$

Matches the end of the string.

(\S+)

Matches one or more non-space characters. The parentheses are not escaped so they function as a grouping expression.

\s

Matches a whitespace character.

\1

Substitutes the first subexpression, that is, the first group of parentheses in the matching pattern.

\2

Substitutes the second subexpression, that is, the second group of parentheses in the matching pattern.

\3

Substitutes the third subexpression, that is, the third group of parentheses in the matching pattern.

,

Inserts a comma character.


Example 4-5 shows the result set of the query in Example 4-4. The regular expression matched only the first two rows.

Example 4-5 Result Set of Regular Expression Query





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值