title: "Python正则表达式基础应用"
date: 2016-03-29 21:32:22
tags: Python
PyCharm的安装与配置
PyCharm
是由JetBrains
开发的Python集成开发环境(IDE),支持调试,语法高亮、Project管理、代码跳转、智能提示、自动完成、单元测试、版本控制等常用功能。
PyCharm包含Professional Edition
和Community Edition
,其中前者收费,后者免费并且开源,个人开发者可以前往jetbrains下载。
PyCharm的默认字体个人感觉非常小,十分不利于查看,可以通过File->Settings->Editor->Colors&Fonts->Font
进行修改字体和字体大小。需要注意的是,默认Scheme是无法修改的,需要Save As...
另建一个方案,才能进行字体和字体大小的修改。个人推荐使用Consolas
等宽字体,Size设置为14
,比较适合编程使用。
另外,可以在Apperance&Behavior->Keymap
中对PyCharm进行键位风格的设置,比如个人习惯Eclipse
的键位风格,Keymaps则设置为Eclipse
。
PyCharm支持CVS
、Subversion
、Git
、GitHub
等多种版本控制工具,可以在Settings->Version Control
中进行配置,这里就不一一列举了。
正则表达式基本符号
比较常用的常用符号:
. :匹配任意字符,换行符\n除外。
***** :匹配前一个字符0次或无限次
? :匹配前一个字符0次或1次
.* :贪心算法
**.*?:非贪心算法
()** :括号内的数据作为结果返回
Python中正则表达式方法
Python中默认的正则表达式模块是re
模块,其核心函数和方法如下:
#函数/方法 描述
#####re 模块的函数
#compile(pattern,flags=0) 对正则表达式模式pattern 进行编译,flags 是可选标志符,并返回一个regex 对象re 模块的函数和regex 对象的方法
#match(pattern,string, flags=0) 尝试用正则表达式模式pattern 匹配字符串string,flags 是可选标志符,如果匹配成功,则返回一个匹配对象;否则返回None
#search(pattern,string, flags=0) 在字符串string 中查找正则表达式模式pattern 的第一次出现,flags 是可选标志符,如果匹配成功,则返回一个匹配对象;否则返回None
#findall(pattern,string[,flags]) 在字符串string 中查找正则表达式模式pattern 的所有(非重复)出现;返回一个匹配对象的列表
#finditer(pattern,string[, flags]) 和findall()相同,但返回的不是列表而是迭代器;对于每个匹配,该迭代器返回一个匹配对象
#
#
#####匹配对象的方法
#split(pattern,string, max=0) 根据正则表达式pattern 中的分隔符把字符string 分割为一个列表,返回成功匹配的列表,最多分割max 次(默认是分割所有匹配的地方)。
#sub(pattern, repl, string, max=0) 把字符串string 中所有匹配正则表达式pattern 的地方替换成字符串repl,如果max 的值没有给出,则对所有匹配的地方进行替换。
#group(num=0) 返回全部匹配对象(或指定编号是num 的子组)
#groups() 返回一个包含全部匹配的子组的元组(如果没有成功匹配,就返回一个空元组)
基础应用举例
#encoding=utf-8
#指定以utf-8编码,以便进行中文的输入和显示
import re
#导入用于正则表达式模块re
a = 'xz123awxc1332'
b = re.findall('x.', a)
print(b)
#output: ['xz', 'xc']
c = re.findall('x*', a)
print(c)
#output: ['x','', '', '', '', '', '', 'x', '', '', '', '']
d = re.findall('x?', a)
print(d)
#output: ['xz123awxc1332']
e = re.findall('x.*', a)
print(e)
#output: ['xz123awxc1332']
code = 'aqweqwxxIxxlaqwewqujxxLovexxasdqwxxYouxx'
g = re.findall('xx.*xx', code)
print(g)
#output: ['xxIxxlaqwewqujxxLovexxasdqwxxYouxx']
h = re.findall('xx.*?xx', code)
print(h)
#output: ['xxIxx', 'xxLovexx', 'xxYouxx']
i = re.findall('xx(.*?)xx', code)
print(i)
#output:['I', 'Love', 'You']
简单爬虫应用
我们以爬取极客学院首页中的课程图片链接为例:
import re, urllib2
url = 'http://www.jikexueyuan.com/'
content = urllib2.urlopen(url).read().decode("utf-8")
pic_urls = re.findall('<div class="lessonimg-box.*?img src="(.*?)" class="lessonimg"', content, re.S)
for pic_url in pic_urls:
print(pic_url)
#output:
http://a1.jikexueyuan.com/home/201603/24/d460/56f35cb427ae4.jpg
http://a1.jikexueyuan.com/home/201603/24/c8c9/56f35c7525600.jpg
http://a1.jikexueyuan.com/home/201603/24/34b5/56f35ce68f302.jpg
http://a1.jikexueyuan.com/home/201603/21/aee5/56ef575e4745e.jpg
http://a1.jikexueyuan.com/home/201603/22/5728/56f0a6e737976.png
http://a1.jikexueyuan.com/home/201603/23/121d/56f1f77a76468.png
http://a1.jikexueyuan.com/home/201603/17/d744/56ea1e4f7ecaf.jpg
http://a1.jikexueyuan.com/home/201603/15/eb10/56e76c8c10d5a.jpg
...
其他
其实正则表达式中还有很多符号匹配规则,我们可以打开re
模块进行了解。
The special characters are:
"." Matches any character except a newline.
"^" Matches the start of the string.
"$" Matches the end of the string or just before the newline at
the end of the string.
"*" Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+" Matches 1 or more (greedy) repetitions of the preceding RE.
"?" Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
"\\" Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"|" A|B, creates an RE that will match either A or B.
(...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
(?:...) Non-grouping version of regular parentheses.
(?P<name>...) The substring matched by the group is accessible by name.
(?P=name) Matches the text matched earlier by the group named name.
(?#...) A comment; ignored.
(?=...) Matches if ... matches next, but doesn't consume the string.
(?!...) Matches if ... doesn't match next.
(?<=...) Matches if preceded by ... (must be fixed length).
(?<!...) Matches if not preceded by ... (must be fixed length).
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.
The special sequences consist of "\\" and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [^0-9].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v].
\S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.
\\ Matches a literal backslash.