Python Monitor Water Falls(4)Crawler and Scrapy

本文详细介绍如何使用Python环境下的Scrapy和Selenium进行网页爬取。从创建虚拟环境开始,逐步安装Scrapy和Selenium,并配置无头浏览器Chrome进行数据抓取。此外,还介绍了如何部署Scrapy项目及运行爬虫。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Python Monitor Water Falls(4)Crawler and Scrapy

Create a virtual env
> python3 -m venv ./pythonenv

Use that ENV
> source ./pythonenv/bin/activate

> pip install scrapy
> pip install scrapyd

Check version
>scrapy --version
Scrapy 1.5.0 - project: scrapy_clawer

> scrapyd --version
twistd (the Twisted daemon) 17.9.0
Copyright (c) 2001-2016 Twisted Matrix Laboratories.
See LICENSE for details.

> pip install selenium

Install Phantomjs
http://phantomjs.org/download.html
Download the zip file and place in the working directory

Check the version
> phantomjs --version
2.1.1

Warning Message:
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Mar 21 2018: No releases

Solution:
Run the headless chrome
https://intoli.com/blog/running-selenium-with-headless-chrome/
Install on MAC
> brew install chromedriver

Check if it is there
> chromedriver
Starting ChromeDriver 2.36.540469 (1881fd7f8641508feb5166b7cae561d87723cfa8) on port 9515
Only local connections are allowed.

Change the Python Codes as follow, it works again
from selenium import webdriver

options = webdriver.ChromeOptions()
#options.binary_location = '/usr/bin/google-chrome-unstable'
options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Chrome(chrome_options=options)

browser.get('https://hydromet.lcra.org/riverreport')

tables = browser.find_elements_by_css_selector('table.table-condensed')

tbody = tables[5].find_element_by_tag_name("tbody")
for row in tbody.find_elements_by_tag_name("tr"):
cells = row.find_elements_by_tag_name("td")
if(cells[0].text == 'Marble Falls (Starcke)'):
print(cells[1].text)

browser.quit()

Run everything with Scrapy and See if it is working well.

Another way to create ENV
> virtualenv env
> . env/bin/activate

Check version for PYTHON module
> pip install show

> pip install selenium
> pip show selenium | grep Version
Version: 3.11.0

I have a file named requirements.txt
selenium==3.11.0

I can run
> pip install -r requirements.txt

Here is how it generate the requirements.txt
> pip freeze > requirements.txt

How to run the Spider Local
> scrapy crawl quotes

Prepare the Deployment ENV
> pip install scrapyd
> pip install scrapyd-client

Start the Server
> scrapyd

Deploy the spider
> scrapyd-deploy

List the Projects and spiders
curl http://localhost:6800/listprojects.json
curl http://localhost:6800/listspiders.json?project=default
curl http://localhost:6800/schedule.json -d project=default -d spider=quotes

Investigate the Requests
https://requestbin.fullcontact.com/
https://hookbin.com/

A lot of details are in project monitor-water

References:
mysql
http://sillycat.iteye.com/blog/2393787
haproxy
http://sillycat.iteye.com/blog/2066118
http://sillycat.iteye.com/blog/1055846
http://sillycat.iteye.com/blog/562645

https://intoli.com/blog/running-selenium-with-headless-chrome/
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值