【测试】利用python下载电子书并通过NaturalRead实现听书

本文链接：https://blog.youkuaiyun.com/qq_23938507/article/details/143451008

免责声明：本作品中所有内容仅供学习参考，不得用于任何商业活动。如有任何侵权行为，请联系以便及时处理。

思路说明：

1、先查询电子书，且找到免费的资源

如，电子书名《都市古仙医》，免费在线资源：小说在线阅读-都市古仙医

2、打开在线资源（免费）
3、查看网页源码，在界面鼠标右键，检查元素

点开，前三章观察，除了/89352331，这串数字不一样外

https://www.000zww.com/1106/1106863/89352331.html
https://www.000zww.com/1106/1106863/89352394.html
https://www.000zww.com/1106/1106863/89352422.html

4、试着通过python先获取每个URL

import requests
from lxml import html
from bs4 import BeautifulSoup

# 基础URL
base_url = "https://www.000zww.com/1106/1106863/index.html"
headers = {"User-Agent": "Mozilla/5.0"}

# 请求主页面
response = requests.get(base_url, headers=headers)
response.raise_for_status()

# 解析页面内容
tree = html.fromstring(response.content)

# 使用XPath定位所有章节链接
chapter_links = tree.xpath('//*[@id="list"]/dl/dt[2]/following-sibling::dd/a')

# 处理每个章节链接
for link in chapter_links:
    chapter_url = link.get("href")
    # 如果链接是相对路径，补全为绝对路径
    if chapter_url.startswith("/"):
        chapter_url = "https://www.000zww.com" + chapter_url
    print(chapter_url)

5、访问每一个URL，进入下载，并保存成1个txt文档

先进入一个URL，看下内容结构

下载内容，并保存为txt文档，代码

import requests
from lxml import html
from bs4 import BeautifulSoup
import os


# Base URL
base_url = "https://www.000zww.com/1106/1106863/index.html"
headers = {"User-Agent": "Mozilla/5.0"}

# Request the main page
response = requests.get(base_url, headers=headers)
response.raise_for_status()

# Parse the page content
tree = html.fromstring(response.content)

# Locate all chapter links
chapter_links = tree.xpath('//*[@id="list"]/dl/dt[2]/following-sibling::dd/a')

# Create a directory to save files if it doesn't exist
output_directory = "chapters"
os.makedirs(output_directory, exist_ok=True)

# Loop through each chapter link and save to text files
for link in chapter_links:
    chapter_url = link.get("href")
    chapter_title = link.get("title").replace(" ", "_") + ".txt"

    # Complete the URL if needed
    if chapter_url.startswith("/"):
        chapter_url = "https://www.000zww.com" + chapter_url

    # Fetch the chapter content
    chapter_response = requests.get(chapter_url, headers=headers)
    chapter_response.encoding = 'utf-8'
    chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')

    # Extract the main content
    content_div = chapter_soup.find("div", {"id": "content"})
    if content_div:
        # Remove newline characters from the text content
        text_content = content_div.get_text(separator=" ").replace("\n", "")

        # Save to a .txt file with the chapter title as the filename
        file_path = os.path.join(output_directory, chapter_title)
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_content)

        print(f"Saved content of {chapter_title} to {file_path}")

下载过程：

代码运行依赖库

（1）requests：发送HTTP请求，用于获取页面内容。

pip install requests -i https://mirrors.aliyun.com/pypi/simple/

（2） lxml：解析HTML内容，支持XPath选择器，便于提取特定内容。

pip install lxml -i https://mirrors.aliyun.com/pypi/simple/

（3）BeautifulSoup（bs4）：用于处理HTML解析，简化提取特定的标签内容。

pip install beautifulsoup4 -i https://mirrors.aliyun.com/pypi/simple/

（4）chardet：用于自动检测文本文件的编码格式，帮助确定不同文件的字符编码，以便正确读取和处理文本数据。

pip install chardet -i https://mirrors.aliyun.com/pypi/simple/

6、将所有txt 合并到一个完整的txt中

【也可以转成PDF，txt文本有限制长度，不是完美方案】

import os
import chardet

# 指定包含各个章节 .txt 文件的文件夹路径
input_directory = "chapters"
# 合并后的输出文件路径
output_file_path = "《都市古仙医》.txt"

# 打开输出文件，使用 "utf-8" 编码写入
with open(output_file_path, "w", encoding="utf-8") as output_file:
    # 遍历文件夹中按字母顺序排序的所有 .txt 文件
    for filename in sorted(os.listdir(input_directory)):
        if filename.endswith(".txt"):  # 检查文件是否为 .txt 文件
            file_path = os.path.join(input_directory, filename)

            # 检测文件编码
            with open(file_path, 'rb') as file:
                raw_data = file.read()
                result = chardet.detect(raw_data)  # 使用 chardet 检测编码格式
                encoding = result['encoding'] if result['encoding'] else 'utf-8'

            # 在控制台输出当前写入的文件名
            print(f"正在写入 {filename} 到合并文件。")

            # 将章节标题写入输出文件
            chapter_title = filename.replace("_", " ").replace(".txt", "")
            output_file.write(f"章节: {chapter_title}\n")
            output_file.write("-" * 50 + "\n")

            # 使用检测到的编码读取每个文件的内容
            with open(file_path, "r", encoding=encoding, errors='replace') as file:
                output_file.write(file.read())  # 写入章节内容
                output_file.write("\n\n")  # 章节之间添加空行

print(f"所有章节已合并至 {output_file_path}")