爬取csdn博客指定博主的文章转成csdn

qq_42307546

已于 2023-02-12 20:55:48 修改

阅读量344

点赞数

分类专栏： python 文章标签： python 爬虫

于 2023-01-09 21:02:21 首次发布

本文链接：https://blog.youkuaiyun.com/qq_42307546/article/details/128606844

版权

python 专栏收录该内容

88 篇文章

订阅专栏

代码思路:
step1: 爬取博主的所有博文的article_ids
step2: 根据article_id，爬取这篇文章的html，拿到我们想要的部分，并且
step3: 保存为html格式，再保存一个可读性更好的pdf格式

先访问某个博主的文章观察它的url格式
在这里插入图片描述
可以看到特定的某张文章的url格式是
https://blog.youkuaiyun.com/{博主id}/article/details/{文章id}
只要我们能够获取到所有的文章id在构造这样的url格式就能获取到文章的信息，那么我们如何才能获取到所有的id呢

点击博主的头像获取的所有文章信息，在右键点击检查在点击网络在让鼠标向下滑动就能获取到接口信息
在这里插入图片描述
可以看到它的url格式是https://blog.youkuaiyun.com/community/home-api/v1/get-business-list?page=10&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username=weixin_44799217

点击预览
在这里插入图片描述
可以看到通过get方式发送的数据，page是当前页数,size是一页显示的文章数量，username就是博主的id那么它的articleid在哪点击预览可以看到20篇文章的id

这样我们就可以通过向这个api获取到所有的文章id,在拼接url在访问就可以获取信息
完整代码


import os
import random
import time
import requests
from lxml import etree
import pdfkit

author_name = input('请输入博主ID: ')
MAX_PAGE_NUM = 200
i = 1
sess = requests.Session()  # 创建一个session保持连接
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'  # 设置浏览头伪装成浏览器
sess.headers['User-Agent'] = agent


def crawler_blog_by(author_name, article_id, title):
    """
    获取到博主id和文章id之后构造url进行访问在保存成html格式
    :param author_name:
    :param article_id:
    :param title:
    :return:
    """
    article_requst_url = f'https://blog.youkuaiyun.com/{author_name}/article/details/{article_id}'
    respone = sess.get(article_requst_url)
    select = etree.HTML(respone.text)
    head_msg = select.xpath(r'//head')[0]
    head_str = etree.tostring(head_msg, encoding='utf8', method='html').decode()  # 将标签对象转为str对象
    body_msg = select.xpath(r"//div[@id='content_views']")[0]
    body_str = etree.tostring(body_msg, encoding='utf8', method='html').decode()
    if not os.path.exists('craw_url'):
        os.mkdir('craw_url')
    title = title.replace("/", "-").replace(":", "").replace(": ", "")
    sane_name = f'{author_name}-{title}-{article_id}.html'
    with open(os.path.join('craw_url', sane_name), 'w',encoding='utf-8') as f:
        f.write(f"""
        <head>
        <meta charset="UTF-8">
        </head>

        {body_str}

        """)
        html_to_pdf(os.path.join('craw_url', sane_name))

        global i
        print(f'【info】博文{author_name}-{title}-{article_id}第{i}篇保存成功')
        i += 1


def html_to_pdf(file_html_name):
    pre_file_name = os.path.splitext(file_html_name)[0]
    wkhtmltopdf_options = {'enable-local-file-access': None}
    pdfkit.from_file(file_html_name, pre_file_name + '.pdf', options=wkhtmltopdf_options)


# 循环爬取分页html
for each in range(1, MAX_PAGE_NUM + 1):
    try:
        data = {'page':
                    each,
                'size':
                    20,
                'businessType':
                    'blog',
                'orderby': '',
                'noMore': False,
                'year': '',
                'month': '',
                'username': author_name}
        page_dict = sess.get('https://blog.youkuaiyun.com/community/home-api/v1/get-business-list', params=data).json()
        for article in page_dict['data']['list']:
            article_id = article['articleId']
            title = article['title']
            crawler_blog_by(author_name, article_id, title)
            time.sleep(random.uniform(0.4, 1.0))
    except Exception as e:
        print(e)  # log日志文件系统