创新实训(6)-博客园首页爬虫(二)_创新爬虫项目-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_34842847/article/details/106918212

本文是创新实训系列的第六篇，介绍如何获取博客园文章的标签。通过分析发现，标签是通过AJAX请求获取的。文章详细讲述了如何找到blogId和postId，构造URL并发送GET请求来获取标签，最终将数据保存到MySQL数据库。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

创新实训(6)-博客园首页爬虫(二)

接着分析博客园的文章。

2.6 博文标签获取

标签获取就比较麻烦了，一开始我还是和之前一样，直接F12，去找对应的html，但是执行之后发现出错了，找不到。于是我去response里搜索了一下，果然没有。

猜测可以是使用ajax另外获取的，于是再次搜索，发现了另外一个请求：

查看该请求：

博客标签03

发现是个GET请求，URL的规律是

https://www.cnblogs.com/ + 用户昵称 + /ajax/CategoriesTags.aspx?blogId= + blogId(416394) + &postId= postId(13180589)

现在只要找到 blogId 和 postId 就可以请求到博客对应的标签了。

到博文的response里再次搜索，

博客标签04

发现了blogId和postId,下面只需要获取到这两个值，然后构造URL发送请求即可得到博文标签了。

这里使用正则表达式提取blogId和postId:

# blogid ，用于获取分类和tag
blogid = response.xpath('//script').re(r'cb_blogId = (\d[0-9])')[0]
# postId
postid = response.xpath('//script').re(r'cb_entryId = (\d[0-9])')[0]

然后构造URL：

# 构造获取分类的url
category_url = url[:url.index('/', 24)] + \
                       '/ajax/CategoriesTags.aspx?blogId=' + \
            blogid + '&postId=' + postid

最后提取标签和分类：
博客标签05

item['tags'] = response.xpath('//div[@id="BlogPostCategory"]/a/text()').extract() + response.xpath('//div[@id="EntryTag"]/a/text()').extract()

3. Pipeline保存数据

这里我直接保存到了MySQL数据库中，

import pymysql

class MysqlPipeline(object):
    def __init__(self):
        # connection database
        self.connect = pymysql.connect(host='xxx', user='xxx', passwd='xxx',
                                       db='blogstorm')  # 后面三个依次是数据库连接名、数据库密码、数据库名称
        # get cursor
        self.cursor = self.connect.cursor()
        print("连接数据库成功")

    def process_item(self, item, spider):
        title = item['title'][0]
        url = item['url'][0]
        content = item['content'][0]
        tags = item['tags']
        if tags:
            tags = ','.join(tags)
        else:
            tags = ''
        update_time = item['update_time'][0].split()[0]

        # sql语句
        insert_sql = """
        insert into article(title, url, content, tags, update_time) 
        VALUES (%s,%s,%s,%s,str_to_date(%s,'%%Y-%%m-%%d'))
        """
        # 执行插入数据到数据库操作
        self.cursor.execute(insert_sql, (title, url, content, tags,
                                         update_time))
        # 提交，不进行提交无法保存到数据库
        self.connect.commit()
        return item

    def close_spider(self, spider):
        # 关闭游标和连接
        self.cursor.close()
        self.connect.close()