Python 爬虫APP URL

Scrapy爬虫实战

最新推荐文章于 2024-04-29 17:09:12 发布

原创最新推荐文章于 2024-04-29 17:09:12 发布 · 2k 阅读

3 ·

CC 4.0 BY-SA版权

python 专栏收录该内容

4 篇文章

订阅专栏

本文介绍如何使用Python 2.7环境下安装并配置Scrapy爬虫框架，详细讲解了从安装到编写爬虫脚本的过程，包括环境搭建、命令行操作、脚本编写及调试技巧。

1、安装环境 python 2.7

2、安装scrapy

Pip2.7 install scrapy; 如果不是这么安装，则windows下scrapy命令用不了；先pip2.7 uninstall scrapy再install;

3、输入scrapy 有命令提示则安装正确；
4、Windows 下进入爬虫项目里，cd D:\PythonWorkspace\spider; 执行命令：scrapy startproject tutorial;
5、执行以后会出现很多脚本

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

6、编写爬虫脚本

# -*- coding:utf-8 -*-
import sys
import scrapy

reload(sys)
sys.setdefaultencoding('utf-8')

class WanDouJia_browser_Spider(scrapy.Spider):
    name ="Spider-appLabel_URL"

    def start_requests(self):
        #app应用大类入口
        #url="http://www.wandoujia.com/category/app"

        for line in open("D:\\PythonWorkspace\\spider\\Resources\\appLabels.csv".decode('utf-8'),'r'):
            keyWord= line.split(",")[0].strip()
            url = "http://www.wandoujia.com/search?key="+keyWord.decode('utf-8')
            yield scrapy.Request(url=url,meta={'appname':keyWord.decode('utf-8')},callback=self.parse_big_class)

    #解析出app的入口url
    def parse_big_class(self,response):
        appName = response.xpath('//h2[@class="app-title-h2"]/a/text()').extract_first()
        url = response.xpath('//h2[@class="app-title-h2"]/a/@href').extract_first()
        print  appName+" "+url

        f=open("appListURLs.csv", 'a')
        f.write(str(appName)+","+str(url)+"\n")

7、设置调试设置。

打开pycharm工程调试配置界面（Run -> Edit Configurations）。

选择工程。选择调试工程 Spider。
设置执行脚本（Script）。设置为 D:\Python27\Lib\site-packages\scrapy\cmdline.py， cmdline.py 是 scrapy提供的命令行调用脚本，此处将启动脚本设置为 cmdline.py，将需要调试的工程作为参数传递给此脚本。
设置执行脚本参数（Script parameters）。设置为 crawl Spider-appLabel_URL，参数命令参照官方文档提供的爬虫执行命；
设置工作目录（Work Directory）。设置为工程根目录 D:\PythonWorkspace\spider\tutorial，根目录下包含爬虫配置文件 scrapy.cfg。

配置如下图：