手写myscrapy（七）

最新推荐文章于 2026-01-05 17:06:32 发布

原创最新推荐文章于 2026-01-05 17:06:32 发布 · 523 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #scrapy

python 专栏收录该内容

20 篇文章

订阅专栏

博客分享了Scrapy项目地址及完整源代码，介绍了项目配置。在config下有logging.conf和setting.py文件，前者是Python的logging库配置文件，用于日志记录；后者是爬取类配置文件，参数有说明，可按需修改。

项目地址https://gitee.com/wyu_001/myscrapy

这里先附上六的完整源代码：

from common.myspider import MySpider
from common.log import loger
class DemoSpider(MySpider):
    name = 'demo' # 爬虫名称,excel和数据库中爬虫名称一致
    start_urls = ['https://www.baidu.com']  # 起始url列表

    def __init__(self):
        super().__init__()

    def parse(self, response): # 解析函数,response为common.myresponse对象，方法名必须为parse

        title = response.xpath('//title/text()').getall()
        loger.info(title)

        url = response.xpath("//div[@id='s-top-left']/a[1]/@href").get()
        loger.info(url)
        response.follow(url, callback=self.parse_new)
    def parse_new(self,respose):

        title = respose.xpath('//div[@class="hotnews"]//a/text()').getall()

        for node in respose.xpath('//div[@class="hotnews"]//a'):
            text = node.xpath('text()').get()
            url = node.xpath('@href').get()
            loger.info(f'{text}:{url}')
        loger.info(title)

if __name__ == '__main__':
    DemoSpider().start_request()

我们继续说明一下项目的配置：
在config下的logging.conf 文件和setting.py 文件

logging.conf 是python的logging库的配置文件，本项目也是使用logging库来实现日志记录
setting.py 文件就是爬取类的配置了


#日志模块配置文件位置 绝对路径
LOG_CONFIG="D:/myscrapy/config/logging.conf"
####################################################################################################
#request 请求参数设置

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'


# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': '*/*',
    'Connection':'keep-alive',
    # 'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'User-Agent' : USER_AGENT
}
# 是否生成excel文件存储标识 true 生成 false 不生成
EXCEL_FILE= True

#生成excel文件存放目录位置，绝对路径
#Location of generated excel file storage directory
EXCEL_FILE_DIR = "D:/myscrapy/file"

#是否存储数据库标识 true 存储 false 不存储
DATABASE= False
MYSQL = {
    "host" : "localhost",
    "user"    : "mysql",
    "passwd": "mysql123",
    "db"      : "demo",
    "charset" : "utf8"
}

#是否网址去重 ture：去重 flase 不去重
URL_DUPLICATE = False
########################################################################################
#百度ocr识别应用id， key ， secret_key 设置参数
OCR = {
    "app_id" : "xxxxx",
    "api_key" : "xxxxxxxxxxxxxx",
    "secret_key" : "xxxxxxxxxxxxxxxxxx"
}
#########################################################################################
# WebDriverWait seconds

WEB_DRIVER_TIME=10
#浏览器位置
SELENIUM_LOCATION ='C:/Users/spring/AppData/Local/Google/Chrome/Application/chrome.exe'

#浏览器驱动位置
SELENIUM_EXCUTEPATH = r'D:/chromedriver_win32/chromedriver.exe'
########################################################################################
#batch
#批量运行默认情况下运行spider下继承myspider类的子类
#批量运行脚本参数定义，一次并发线程数

BATCH_THREADS =10

#batch run files in list
#自定义运行spider下脚本文件
BATCH_FILES =['dxyqueryhospital.py',
              'haodfqueryhospital.py'
              ]
#######################################################################################
#采集关键词字典到本地词典开关
KEY_WORD_COLLECT = False

#  Content of brief retrieval rules
# 简历文本检索规则配置 ，按正则表达式规则配置
RULES = []