python3 scrapy 入门 记录 LTS
参考
https://scrapy.org/
https://docs.scrapy.org/en/latest/
https://docs.scrapy.org/en/latest/topics/commands.html
利用Python抓取雅虎财经成分股数据生成图形界面并绘制图表
https://zhuanlan.zhihu.com/p/26394996
https://docs.scrapy.org/en/latest/topics/debug.html
python 入门基础
https://docs.python.org/3/tutorial
Automate the Boring Stuff With Python
How To Think Like a Computer Scientist
Learn Python 3 The Hard Way
this list of Python resources for non-programmers,
as well as
the suggested resources in the learnpython-subreddit.
测试网站:
简单例子1
pip install scrapy
myspider.py
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
scrapy runspider myspider.py
能运行但是也不知道什么意思啊?
Scrapy at a glance
https://docs.scrapy.org/en/latest/intro/overview.html
要访问的 网址是:
http://quotes.toscrape.com
quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
scrapy runspider quotes_spider.py -o quotes.json
Scrapy Tutorial
https://docs.scrapy.org/en/latest/intro/tutorial.html
http://quotes.toscrape.com/
scrapy startproject tutorial
test5> tree /f
在 tutorial/spiders 目录下面创建文件quotes_spider.py
内容为:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
tutorial\tutorial\spiders
在目录tutorial下面 执行如下命令:与scrapy.cfg 同一个级别的目录
scrapy crawl quotes
extract data — 提取数据
修改代码:
在 tutorial/spiders 目录下面文件quotes_spider.py
内容为:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
Window7 下面执行:
scrapy shell "http://quotes.toscrape.com/page/1/"
linux 下面执行:
scrapy shell 'http://quotes.toscrape.com/page/1/'
11
response.css('title')
response.css('title::text').getall()
response.css('title::text').get()
response.css('title::text')[0].get()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')
view(response)
response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').getall()
response.xpath('//span[has-class("text")]/text()').getall()
https://docs.scrapy.org/en/latest/topics/developer-tools.html#topics-developer-tools
Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.
https://selectorgadget.com/
SelectorGadget:
point and click CSS selectors
/html/body/div/div[2]/div[1]/div[1]/span[1]
Then, back to your web browser, right-click on the span tag, select Copy > XPath and paste it in the Scrapy shell like so:
response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').getall()
http://quotes.toscrape.com/scroll
Selecting dynamically-loaded content
https://docs.scrapy.org/en/latest/topics/dynamic-content.html
https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-finding-data-source
https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering
scrapy fetch --nolog https://example.com > response.html
Using a headless browser
The easiest way to use a headless browser with Scrapy is to use Selenium, along with scrapy-selenium for seamless integration.
https://www.selenium.dev/
https://github.com/clemfromspace/scrapy-selenium