【测试】利用python下载电子书并通过NaturalRead实现听书

免责声明:本作品中所有内容仅供学习参考,不得用于任何商业活动。如有任何侵权行为,请联系以便及时处理。

思路说明:

1、先查询电子书,且找到免费的资源

如,电子书名《都市古仙医》,免费在线资源:小说在线阅读-都市古仙医

2、打开在线资源(免费)
3、查看网页源码,在界面鼠标右键,检查元素

 点开,前三章观察,除了/89352331,这串数字不一样外

https://www.000zww.com/1106/1106863/89352331.html
https://www.000zww.com/1106/1106863/89352394.html
https://www.000zww.com/1106/1106863/89352422.html

4、试着通过python先获取每个URL

import requests
from lxml import html
from bs4 import BeautifulSoup

# 基础URL
base_url = "https://www.000zww.com/1106/1106863/index.html"
headers = {"User-Agent": "Mozilla/5.0"}

# 请求主页面
response = requests.get(base_url, headers=headers)
response.raise_for_status()

# 解析页面内容
tree = html.fromstring(response.content)

# 使用XPath定位所有章节链接
chapter_links = tree.xpath('//*[@id="list"]/dl/dt[2]/following-sibling::dd/a')

# 处理每个章节链接
for link in chapter_links:
    chapter_url = link.get("href")
    # 如果链接是相对路径,补全为绝对路径
    if chapter_url.startswith("/"):
        chapter_url = "https://www.000zww.com" + chapter_url
    print(chapter_url)

5、访问每一个URL,进入下载,并保存成1个txt文档

先进入一个URL,看下内容结构

下载内容,并保存为txt文档,代码

import requests
from lxml import html
from bs4 import BeautifulSoup
import os


# Base URL
base_url = "https://www.000zww.com/1106/1106863/index.html"
headers = {"User-Agent": "Mozilla/5.0"}

# Request the main page
response = requests.get(base_url, headers=headers)
response.raise_for_status()

# Parse the page content
tree = html.fromstring(response.content)

# Locate all chapter links
chapter_links = tree.xpath('//*[@id="list"]/dl/dt[2]/following-sibling::dd/a')

# Create a directory to save files if it doesn't exist
output_directory = "chapters"
os.makedirs(output_directory, exist_ok=True)

# Loop through each chapter link and save to text files
for link in chapter_links:
    chapter_url = link.get("href")
    chapter_title = link.get("title").replace(" ", "_") + ".txt"

    # Complete the URL if needed
    if chapter_url.startswith("/"):
        chapter_url = "https://www.000zww.com" + chapter_url

    # Fetch the chapter content
    chapter_response = requests.get(chapter_url, headers=headers)
    chapter_response.encoding = 'utf-8'
    chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')

    # Extract the main content
    content_div = chapter_soup.find("div", {"id": "content"})
    if content_div:
        # Remove newline characters from the text content
        text_content = content_div.get_text(separator=" ").replace("\n", "")

        # Save to a .txt file with the chapter title as the filename
        file_path = os.path.join(output_directory, chapter_title)
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_content)

        print(f"Saved content of {chapter_title} to {file_path}")

  下载过程:

代码运行依赖库

(1)requests:发送HTTP请求,用于获取页面内容。

pip install requests -i https://mirrors.aliyun.com/pypi/simple/

(2) lxml:解析HTML内容,支持XPath选择器,便于提取特定内容。

pip install lxml -i https://mirrors.aliyun.com/pypi/simple/

(3)BeautifulSoup(bs4):用于处理HTML解析,简化提取特定的标签内容。

pip install beautifulsoup4 -i https://mirrors.aliyun.com/pypi/simple/

(4)chardet:用于自动检测文本文件的编码格式,帮助确定不同文件的字符编码,以便正确读取和处理文本数据。

pip install chardet -i https://mirrors.aliyun.com/pypi/simple/

6、将所有txt 合并到一个完整的txt中

【也可以转成PDF,txt文本有限制长度,不是完美方案】

import os
import chardet

# 指定包含各个章节 .txt 文件的文件夹路径
input_directory = "chapters"
# 合并后的输出文件路径
output_file_path = "《都市古仙医》.txt"

# 打开输出文件,使用 "utf-8" 编码写入
with open(output_file_path, "w", encoding="utf-8") as output_file:
    # 遍历文件夹中按字母顺序排序的所有 .txt 文件
    for filename in sorted(os.listdir(input_directory)):
        if filename.endswith(".txt"):  # 检查文件是否为 .txt 文件
            file_path = os.path.join(input_directory, filename)

            # 检测文件编码
            with open(file_path, 'rb') as file:
                raw_data = file.read()
                result = chardet.detect(raw_data)  # 使用 chardet 检测编码格式
                encoding = result['encoding'] if result['encoding'] else 'utf-8'

            # 在控制台输出当前写入的文件名
            print(f"正在写入 {filename} 到合并文件。")

            # 将章节标题写入输出文件
            chapter_title = filename.replace("_", " ").replace(".txt", "")
            output_file.write(f"章节: {chapter_title}\n")
            output_file.write("-" * 50 + "\n")

            # 使用检测到的编码读取每个文件的内容
            with open(file_path, "r", encoding=encoding, errors='replace') as file:
                output_file.write(file.read())  # 写入章节内容
                output_file.write("\n\n")  # 章节之间添加空行

print(f"所有章节已合并至 {output_file_path}")

7、利用NaturalReader进行语音播放

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值