首先,Python中自带urllib及urllib2这两个模块,基本上能满足一般的页面抓取,另外,requests也是非常有用的。
Requests:
import requests
response = requests.get(url)
content = requests.get(url).content
print "response headers:", response.headers
print "content:", content
Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()
print "response headers:", response.headers
print "content:", content
Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url, 'GET')
print "response headers:", response_headers
print "content:", content
对于带有查询字段的url,get请求一般会将来请求的数据附在url之后,以?分割url和传输数据,多个参数用&连接。
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
import requests
response = requests.get(url=url, params=data)
Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
full_url = url+'?'+data
response = urllib2.urlopen(full_url)
RE库.group():
a
=
"123abc456"
print
re
.
search
(
"([0-9]*)([a-z]*)([0-9]*)"
,
a
)
.
group
(
0
)
#123abc456,返回整体
print
re
.
search
(
"([0-9]*)([a-z]*)([0-9]*)"
,
a
)
.
group
(
1
)
#123
print
re
.
search
(
"([0-9]*)([a-z]*)([0-9]*)"
,
a
)
.
group
(
2
)
#abc
print
re
.
search
(
"([0-9]*)([a-z]*)([0-9]*)"
,
a
)
.
group
(
3
)
#456