1.安装python2.7.5环境
window系统可以直接下载python2.7.5然后傻瓜式下一步就好
linux系统安装参考https://www.jianshu.com/p/c8d520553893
2.安装pip
win:在安装pip前,请确认你win系统中已经安装好了python,和easy_install工具,如果系统安装成功,easy_install在目录C:\Python27\Scripts 下面,进入命令行,然后把目录切换到python的安装目录下的Script文件夹下,运行 easy_inatall pip,pip安装成功后,在cmd下执行pip list可以成功就可以
linux:如果有需要升级的用:sudo pip install --upgrade pip
没有:
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
3.安装scrapy
sudo pip install scrapy
4.安装selenium
sudo pip install selenium
5.安装pyopenssl
sudo pip install pyopenssl
6.window系统需要安装 win32api
在cmd:
pip install win32api
反正报错百度缺什么在安装
7.安装 firefoxl因为用到模拟浏览器linux安装一个
yum install firefox
firefox version
8.下载火狐驱动
geckodriver的安装方法
下载地址:链接
我下载的是geckodriver-v0.20.0-linux64.tar.gz .
安装方法与chromedriver类似:
在终端下切换到下载路径下,输入以下命令移动到火狐启动路径/usr/bin
tar -xvzf geckodriver*
chmod +x geckodriver
sudo mv geckodriver /usr/bin/
现在到了爬虫的时刻:
我用的是pycharm编译器
1.middlewares.py
from random import choice
from selenium import webdriver as wb
from scrapy.http import HtmlResponse
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver import FirefoxOptions
ua_list = [
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36",
"Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36"
]
dcap = dict(DesiredCapabilities.FIREFOX)
dcap["firefox.page.settings.resourceTimeout"] = 15
dcap["firefox.page.settings.loadImages"] = False
dcap["firefox.page.settings.userAgent"] = choice(ua_list)
class SeleniumMiddleware(object):
def process_request(self, request, spider):
# 设置无头模式 浏览器就不会弹出来 建议开始不要设置 弹出来说明访问成功
opts = FirefoxOptions()
opts.add_argument("--headless")
driver = wb.Firefox(firefox_options=opts)
driver.get(request.url)
driver.implicitly_wait(3)
page = driver.page_source # .decode('utf-8','ignore')
driver.close()
return HtmlResponse(request.url, body=page, encoding='utf-8', request=request, )
2.item.py
class TopWenItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
3.pipline.py
import json
import codecs
#以Json的形式存储
class JsonWithEncodingCnblogsPipeline(object):
def __init__(self):
self.file = codecs.open('xx.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
4.scrapy.py
class GsSpider(scrapy.Spider):
name = "gsspider"
start_urls = [
'http://movie.douban.com/top250/'
]
# start_urls = []
# file = open('C:/test-scrapy/tutorial/tutorial/url.txt')
# for word in file:
# word = word.strip()
# url = 'http://www.gsdata.cn/query/wx?q=' + word
#
# start_urls.append(url)
def parse(self, response):
items = []
for info in response.xpath('//div[@class="item"]'):
item = MovieItem()
item['rank'] = info.xpath('div[@class="pic"]/em/text()').extract()
item['title'] = info.xpath('div[@class="pic"]/a/img/@alt').extract()
item['link'] = info.xpath('div[@class="pic"]/a/@href').extract()
item['rate'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()').extract()
item['quote'] = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
items.append(item)
yield item
# 翻页
next_page = response.xpath('//span[@class="next"]/a/@href')
if next_page:
url = response.urljoin(next_page[0].extract())
# 爬每一页
yield scrapy.Request(url, self.parse)
在工程路径/usr/home/spiderdemo/spiderdemo/路径执行 scrapy crawl gsspider 第三个参数是里不是文件名 是你爬虫设置的name
问题总结:
1.Selenium Error: no display specified
这个也是因为linux没有图形界面导致报错按照问题3解决
2.from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util
这个问题有两个原因 你版本不够高 ,第二个你的openssl安装有问题里面包不全 建议删掉openssl 目录在python目录里
3.Webdriver Exception:Process unexpectedly closed with status: 1
这是linux没有图形界面导致的 你可以方法有两个:
第一个:安装Xvfb
yum install -y Xvfb
启动Xvfb
Xvfb -ac :7 -screen 0 1280x1024x8 &
export DISPLAY=:7 (和上一步的number号相同)
第二个方法:
from selenium.webdriver import FirefoxOptions
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(firefox_options=opts)