Google Python Class 学习笔记(2) 正则表达式

本文深入探讨了正则表达式的使用方法,包括普通字符、元字符、重复匹配以及邮件地址的精确匹配技巧。通过实例演示,展示了如何使用正则表达式进行复杂模式匹配,特别聚焦于电子邮件地址的解析。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1、Regular Expression
a, X, 9, 
   -- ordinary characters just match themselves exactly. 
   The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period) 
   -- matches any single character except newline '\n'
\w 
   -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. 
   Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. 
\W (upper case W) 
   matches any non-word character.
\b 
   -- boundary between word and non-word
\s
   -- (lowercase s) matches a single whitespace character 
   -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r 
   -- tab, newline, return
\d 
   -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

   = start, 

   = end 
   -- match the start or end of the string

   -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. 
   If you are unsure if a character has special meaning, such as '@', 
   you can put a slash in front of it, \@, to make sure it is treated just as a character.


2、Repetition


Things get more interesting when you use + and * to specify repetition in the pattern


+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left


3、Demo
match = re.search(r'[\w.-]+@[\w.-]+', str)
  if match:
    print match.group()  ## 'alice-b@google.com'


str = 'purple alice-b@google.com monkey dishwasher'
  match = re.search('([\w.-]+)@([\w.-]+)', str)
  if match:
    print match.group()   ## 'alice-b@google.com' (the whole match)
    print match.group(1)  ## 'alice-b' (the username, group 1)
    print match.group(2)  ## 'google.com' (the host, group 2)


# Here re.findall() returns a list of all the found email strings
    emails = re.findall(r'[\w\.-]+@[\w\.-]+', str)   # ['alice@google.com', 'bob@abc.com']
    for email in emails:


        # do something with each found email string
        print email


# Open file
f = open('text.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'[\w\.-]+@[\w\.-]+', f.read())
print strings


(Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)


# re.sub(pat, replacement, str) -- returns new string with all replacements,
# \1 is group(1), \2 group(2) in the replacement
strs = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
print strs
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', strs)
# purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher



### Python 正则表达式学习笔记 #### 1. 基本概念 正则表达式是一种用于匹配字符串中字符模式的强大工具。通过定义特定的语法结构,可以实现复杂的字符串搜索、替换等功能。 #### 2. 编译正则表达式对象 为了提高效率,在多次使用同一个模式时应该先将其编译成正则表达式对象。这可以通过 `re.compile()` 函数完成[^2]: ```python pattern = re.compile('www') matches = pattern.findall('www.baidu.www') # 结果为 ['www', 'www'] ``` #### 3. 使用特殊字符 某些字符具有特殊的含义,比如点号`.`表示任意单个字符(除换行符外)。如果想要匹配这些字符本身,则需要用反斜杠`\`来转义它们[^3]: ```python result = re.split(r'\.', "www.www.baidu.com") # 结果为 ['www', 'www', 'baidu', 'com'] ``` #### 4. 查找所有匹配项 函数 `findall()` 可以用来找到输入字符串中所有与给定模式相匹配的部分,并返回一个列表形式的结果集: ```python text = 'www.baidu.www' regex = re.compile('www') found_items = regex.findall(text) # 返回 ['www', 'www'] ``` #### 5. 实际应用案例 - 网页抓取 下面是一个简单的例子,展示了如何利用正则表达式从HTML文档中提取所需的信息。这里的目标是从指定URL页面内获取章节标题并打印出来[^4]: ```python import re import requests url = "https://example.com" response = requests.get(url) html_content = response.content.decode("utf-8") chapter_list_html = re.findall( r'<ul class="chapter-list clearfix">.*?</ul>', html_content, flags=re.DOTALL)[0] titles = re.findall( r'<a.*?title="(.*?)".*?>', chapter_list_html) for title in titles: print(title) print(f"共爬取到 {len(titles)} 条记录") ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值