1
最近总是要爬取一些东西,索性就把Python爬虫的相关内容都总结起来了,自己多动手还是好。
(1)普通的内容爬取
(2)保存爬取的图片/视频和文件和网页
(3)普通模拟登录
(4)处理验证码登录
(5)爬取js网站
(6)全网爬虫
(7)某个网站的站内所有目录爬虫
(8)多线程
(9)爬虫框架Scrapy
一,普通的内容爬取
coding=utf-8
import urllib
import urllib2
url = ‘https://www.dataanswer.top’
headers = {
‘Host’:’www.dataanswer.top’,
‘User-Agent’:’Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0’,
#’Accept’:’application/json, text/javascript, /; q=0.01’,
#’Accept-Language’:’zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3’,
#’Accept-Encoding’:’gzip,deflate’,
#’Referer’:’https://www.dataanswer.top’
}
request = urllib2.Request(url,headers=headers)
response = urllib2.urlopen(request)
page = response.read()
print page
二,保存爬取的图片/视频和文件和网页
图片/视频和文件和网页的地址抓取下来后,利用模块urllib里的urlretrieve()方法下载下来:
coding=utf-8
import urllib
import urllib2
import os
def getPage(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
return response.read()
url=’https://www.dataanswer.top/’
result=getPage(url)
file_name=’test.doc’
file_path=’doc’
if os.path.exists(file_path) == False:
os.makedirs(file_path)
local=os.path.join(file_path,file_name)
f = open(local,”w+”)
f.write(result)
f.close()
coding=utf-8
import urllib
import urllib2
import os
def getPage(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
return response.read()
url=’https://www.dataanswer.top/’ #把该地址改成图片/文件/视频/网页的地址即可
result=getPage(url)
file_name=’test.doc’
file_path=’doc’
if os.path.exists(file_path) == False:
os.makedirs(file_path)
local=os.path.join(file_path,file_name)
urllib.urlretrieve(local)
三,普通模拟登录
import urllib
import urllib2
import cookielib
filename = ‘cookie.txt’
声明一个MozillaCookieJar对象实例来保存cookie,之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
postdata = urllib.urlencode({
‘name’:’春天里’,
‘pwd’:’1222222’
})
登录的URL
loginUrl = ‘https://www.dataanswer.top/Log