python3爬虫入门之正则表达式

最新推荐文章于 2025-10-04 08:39:42 发布

原创最新推荐文章于 2025-10-04 08:39:42 发布 · 330 阅读

3 ·

CC 4.0 BY-SA版权

python3 同时被 2 个专栏收录

7 篇文章

订阅专栏

爬虫

1 篇文章

订阅专栏

本文档提供了正则表达式的快速入门教程，适合没有爬虫基础的学习者。通过实例介绍了点号、星号、问号及括号等基本符号的使用方法，并对比了findall与search函数的不同之处。

本文对正则表达式进行简单的讲解，对于毫无爬虫基础的同学可以5分钟入门。

首先，导入爬虫所需要的re库。

import re

.的使用：

#点号表示占位符
a='xz123'
b=re.findall('x.',a)
c=re.findall('x..',a)
print(b)
print(c)

['xz']
['xz1']

*的使用：

#*号匹配前一个字符0次或无限次
a='xyxy123'
b=re.findall('x*',a)
print(b)

['x', '', 'x', '', '', '', '', '']

？的使用：

#？号匹配前一个字符0次或1次
a='xz123'
b=re.findall('x?',a)
print(b)

['x', '', '', '', '', '']

.*的使用:

secret_code='hadkfalifexxIxxhfkfkjhjkh134xxlovexx4543367dsaxxyouxx8gffj'

b=re.findall('xx.*xx',secret_code)
print('b',b)
c=re.findall('xx.*?xx',secret_code)
print('c',c)

b ['xxIxxhfkfkjhjkh134xxlovexx4543367dsaxxyouxx']
c ['xxIxx', 'xxlovexx', 'xxyouxx']

括号的使用：

b=re.findall('xx(.*?)xx',secret_code)
print(b)
for each in b:
    print(each)

['I', 'love', 'you']
I
love
you

跨行取词：

s='''sddfdxxhello
xxhjgjxxworldxxasd'''

b=re.findall('xx(.*?)xx',s)
print('b',b)
c=re.findall('xx(.*?)xx',s,re.S)
print('c',c)

b ['hjgj']
c ['hello\n', 'world']

对比findall与search的区别

s='sdaxxIxx123xxlovexxjhk'
b=re.search('xx(.*?)xx123xx(.*?)xx',s).group(2)#group里的数字小于等于正则表达式中括号数
print('b',b)
c=re.findall('xx(.*?)xx123xx(.*?)xx',s)
print('c',c)
print(type(c))
print(c[0][1])

b love
c [('I', 'love')]
<class 'list'>
love

sub的使用（自动翻页中常用）：

s='123ghjjsnak123'
b=re.sub('123(.*?)123','789',s)
print(b)

纯数字匹配利器：

a='ashgaj47865432578jhbkj657576hkj'
b=re.findall('(\d+)',a)
print(b)

['47865432578', '657576']

爬虫中需要用到的基本正则表达式就全部介绍完了。