Python Monitor Water Falls(4)Crawler and Scrapy

最新推荐文章于 2025-07-17 15:42:13 发布

magic_dreamer

最新推荐文章于 2025-07-17 15:42:13 发布

阅读量149

点赞数

CC 4.0 BY-SA版权

分类专栏： Summary 文章标签： python json 数据库

本文链接：https://blog.youkuaiyun.com/magic_dreamer/article/details/84917890

Summary 专栏收录该内容

381 篇文章

订阅专栏

本文详细介绍如何使用Python环境下的Scrapy和Selenium进行网页爬取。从创建虚拟环境开始，逐步安装Scrapy和Selenium，并配置无头浏览器Chrome进行数据抓取。此外，还介绍了如何部署Scrapy项目及运行爬虫。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Python Monitor Water Falls(4)Crawler and Scrapy

Create a virtual env
> python3 -m venv ./pythonenv

Use that ENV
> source ./pythonenv/bin/activate

> pip install scrapy
> pip install scrapyd

Check version
>scrapy --version
Scrapy 1.5.0 - project: scrapy_clawer

> scrapyd --version
twistd (the Twisted daemon) 17.9.0
Copyright (c) 2001-2016 Twisted Matrix Laboratories.
See LICENSE for details.

> pip install selenium

Install Phantomjs
http://phantomjs.org/download.html
Download the zip file and place in the working directory

Check the version
> phantomjs --version
2.1.1

Warning Message:
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Mar 21 2018: No releases

Solution:
Run the headless chrome
https://intoli.com/blog/running-selenium-with-headless-chrome/
Install on MAC
> brew install chromedriver

Check if it is there
> chromedriver
Starting ChromeDriver 2.36.540469 (1881fd7f8641508feb5166b7cae561d87723cfa8) on port 9515
Only local connections are allowed.

Change the Python Codes as follow, it works again
from selenium import webdriver

options = webdriver.ChromeOptions()
#options.binary_location = '/usr/bin/google-chrome-unstable'
options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Chrome(chrome_options=options)

browser.get('https://hydromet.lcra.org/riverreport')

tables = browser.find_elements_by_css_selector('table.table-condensed')

tbody = tables[5].find_element_by_tag_name("tbody")
for row in tbody.find_elements_by_tag_name("tr"):
cells = row.find_elements_by_tag_name("td")
if(cells[0].text == 'Marble Falls (Starcke)'):
print(cells[1].text)

browser.quit()

Run everything with Scrapy and See if it is working well.

Another way to create ENV
> virtualenv env
> . env/bin/activate

Check version for PYTHON module
> pip install show

> pip install selenium
> pip show selenium | grep Version
Version: 3.11.0

I have a file named requirements.txt
selenium==3.11.0

I can run
> pip install -r requirements.txt

Here is how it generate the requirements.txt
> pip freeze > requirements.txt

How to run the Spider Local
> scrapy crawl quotes

Prepare the Deployment ENV
> pip install scrapyd
> pip install scrapyd-client

Start the Server
> scrapyd

Deploy the spider
> scrapyd-deploy

List the Projects and spiders
curl http://localhost:6800/listprojects.json
curl http://localhost:6800/listspiders.json?project=default
curl http://localhost:6800/schedule.json -d project=default -d spider=quotes

Investigate the Requests
https://requestbin.fullcontact.com/
https://hookbin.com/

A lot of details are in project monitor-water

References:
mysql
http://sillycat.iteye.com/blog/2393787
haproxy
http://sillycat.iteye.com/blog/2066118
http://sillycat.iteye.com/blog/1055846
http://sillycat.iteye.com/blog/562645

https://intoli.com/blog/running-selenium-with-headless-chrome/