python3 scrapy 入门记录 LTS_scrapy lts-优快云博客

本文链接：https://blog.youkuaiyun.com/wowocpp/article/details/106594430

本文介绍使用Python Scrapy框架抓取网页数据的实战经验，包括安装配置、编写爬虫代码、解析网页内容及数据提取技巧。通过实例演示，读者可以学习如何从目标网站抓取所需信息，并将数据保存为本地文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

python3 scrapy 入门记录 LTS

参考

https://scrapy.org/
https://docs.scrapy.org/en/latest/
https://docs.scrapy.org/en/latest/topics/commands.html
利用Python抓取雅虎财经成分股数据生成图形界面并绘制图表
https://zhuanlan.zhihu.com/p/26394996

https://docs.scrapy.org/en/latest/topics/debug.html

在这里插入图片描述

python 入门基础

https://docs.python.org/3/tutorial

Automate the Boring Stuff With Python

How To Think Like a Computer Scientist

Learn Python 3 The Hard Way
this list of Python resources for non-programmers,
as well as
the suggested resources in the learnpython-subreddit.

测试网站：

https://maoyan.com/board/

简单例子1

pip install scrapy

myspider.py

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)

scrapy runspider myspider.py

能运行但是也不知道什么意思啊？

Scrapy at a glance

https://docs.scrapy.org/en/latest/intro/overview.html
要访问的网址是：
http://quotes.toscrape.com

quotes_spider.py

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

scrapy runspider quotes_spider.py -o quotes.json

Scrapy Tutorial

https://docs.scrapy.org/en/latest/intro/tutorial.html
http://quotes.toscrape.com/
scrapy startproject tutorial

test5> tree /f
在这里插入图片描述
在 tutorial/spiders 目录下面创建文件quotes_spider.py
内容为：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

tutorial\tutorial\spiders

在目录tutorial下面执行如下命令：与scrapy.cfg 同一个级别的目录
scrapy crawl quotes

在这里插入图片描述
extract data — 提取数据

修改代码：
在 tutorial/spiders 目录下面文件quotes_spider.py
内容为：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

Window7 下面执行：

scrapy shell "http://quotes.toscrape.com/page/1/"

linux 下面执行：

scrapy shell 'http://quotes.toscrape.com/page/1/'

response.css('title')
response.css('title::text').getall()
response.css('title::text').get()
response.css('title::text')[0].get()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')
view(response)
response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').getall()
response.xpath('//span[has-class("text")]/text()').getall()

在这里插入图片描述
https://docs.scrapy.org/en/latest/topics/developer-tools.html#topics-developer-tools

Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.
https://selectorgadget.com/
SelectorGadget:
point and click CSS selectors

/html/body/div/div[2]/div[1]/div[1]/span[1]
Then, back to your web browser, right-click on the span tag, select Copy > XPath and paste it in the Scrapy shell like so:
response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').getall()

http://quotes.toscrape.com/scroll

Selecting dynamically-loaded content

https://docs.scrapy.org/en/latest/topics/dynamic-content.html

https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-finding-data-source

https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering

scrapy fetch --nolog https://example.com > response.html

https://github.com/stav/wgrep

Using a headless browser

The easiest way to use a headless browser with Scrapy is to use Selenium, along with scrapy-selenium for seamless integration.
https://www.selenium.dev/
https://github.com/clemfromspace/scrapy-selenium