爬虫实践小例子爬取书籍保存到本地

最新推荐文章于 2025-08-04 17:16:47 发布

weixin_30725315

最新推荐文章于 2025-08-04 17:16:47 发布

阅读量150

点赞数

CC 4.0 BY-SA版权

文章标签：爬虫操作系统

原文链接：http://www.cnblogs.com/he-qing-qing/p/11502543.html

本文介绍了一个使用Python爬虫抓取古诗词网站书籍内容的实例。通过requests和lxml库，该爬虫能遍历目标网站的书籍目录，下载每本书的所有章节并保存到本地文件。首先，爬虫获取书籍列表页面，解析出每本书的链接和名称；然后，针对每本书，爬虫进一步抓取章节列表，并下载每个章节的内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬虫实践小例子

import requests,os
from urllib import request
from lxml import etree

dirName = './books'
if not os.path.exists(dirName):
    os.mkdir(dirName)

headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}

url = 'http://www.shicimingju.com/book'

page_text = requests.get(url,headers=headers).text   

# print(page_text)

tree = etree.HTML(page_text)
a_list = tree.xpath('//div[@class="bookmark-list"]//a')   
for a in a_list:
    bookname = a.xpath('./text()')[0]
    book_path = "http://www.shicimingju.com" + a.xpath('./@href')[0]
    #print(bookname,book_path) # 不取第一个元素的话返回的是列表   ['三国演义'] ['/book/sanguoyanyi.html']
    book_page = requests.get(book_path,headers=headers).text
    tree = etree.HTML(book_page)
    book_a_list = tree.xpath('//div[@class="book-mulu"]//a')
    path = dirName + '/' + bookname
    with open(path,'w',encoding='utf-8') as f:
        for a in book_a_list:
            title = a.xpath('./text()')[0]
            detail_path = 'http://www.shicimingju.com'+a.xpath('./@href')[0]
            detail_page = requests.get(detail_path,headers=headers).text
            content = etree.HTML(detail_page).xpath('//div[@class="chapter_content"]//text()')
            content = ''.join(content)
            f.write(title+':' + content + '\n')
            
            print(title,"下载成功")

转载于:https://www.cnblogs.com/he-qing-qing/p/11502543.html