免责声明:本作品中所有内容仅供学习参考,不得用于任何商业活动。如有任何侵权行为,请联系以便及时处理。
思路说明:
1、先查询电子书,且找到免费的资源
如,电子书名《都市古仙医》,免费在线资源:小说在线阅读-都市古仙医
2、打开在线资源(免费)
3、查看网页源码,在界面鼠标右键,检查元素
点开,前三章观察,除了/89352331,这串数字不一样外
https://www.000zww.com/1106/1106863/89352331.html
https://www.000zww.com/1106/1106863/89352394.html
https://www.000zww.com/1106/1106863/89352422.html
4、试着通过python先获取每个URL
import requests
from lxml import html
from bs4 import BeautifulSoup
# 基础URL
base_url = "https://www.000zww.com/1106/1106863/index.html"
headers = {"User-Agent": "Mozilla/5.0"}
# 请求主页面
response = requests.get(base_url, headers=headers)
response.raise_for_status()
# 解析页面内容
tree = html.fromstring(response.content)
# 使用XPath定位所有章节链接
chapter_links = tree.xpath('//*[@id="list"]/dl/dt[2]/following-sibling::dd/a')
# 处理每个章节链接
for link in chapter_links:
chapter_url = link.get("href")
# 如果链接是相对路径,补全为绝对路径
if chapter_url.startswith("/"):
chapter_url = "https://www.000zww.com" + chapter_url
print(chapter_url)
5、访问每一个URL,进入下载,并保存成1个txt文档
先进入一个URL,看下内容结构
下载内容,并保存为txt文档,代码
import requests
from lxml import html
from bs4 import BeautifulSoup
import os
# Base URL
base_url = "https://www.000zww.com/1106/1106863/index.html"
headers = {"User-Agent": "Mozilla/5.0"}
# Request the main page
response = requests.get(base_url, headers=headers)
response.raise_for_status()
# Parse the page content
tree = html.fromstring(response.content)
# Locate all chapter links
chapter_links = tree.xpath('//*[@id="list"]/dl/dt[2]/following-sibling::dd/a')
# Create a directory to save files if it doesn't exist
output_directory = "chapters"
os.makedirs(output_directory, exist_ok=True)
# Loop through each chapter link and save to text files
for link in chapter_links:
chapter_url = link.get("href")
chapter_title = link.get("title").replace(" ", "_") + ".txt"
# Complete the URL if needed
if chapter_url.startswith("/"):
chapter_url = "https://www.000zww.com" + chapter_url
# Fetch the chapter content
chapter_response = requests.get(chapter_url, headers=headers)
chapter_response.encoding = 'utf-8'
chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
# Extract the main content
content_div = chapter_soup.find("div", {"id": "content"})
if content_div:
# Remove newline characters from the text content
text_content = content_div.get_text(separator=" ").replace("\n", "")
# Save to a .txt file with the chapter title as the filename
file_path = os.path.join(output_directory, chapter_title)
with open(file_path, "w", encoding="utf-8") as file:
file.write(text_content)
print(f"Saved content of {chapter_title} to {file_path}")
下载过程:
代码运行依赖库
(1)requests:发送HTTP请求,用于获取页面内容。
pip install requests -i https://mirrors.aliyun.com/pypi/simple/
(2) lxml:解析HTML内容,支持XPath选择器,便于提取特定内容。
pip install lxml -i https://mirrors.aliyun.com/pypi/simple/
(3)BeautifulSoup(bs4):用于处理HTML解析,简化提取特定的标签内容。
pip install beautifulsoup4 -i https://mirrors.aliyun.com/pypi/simple/
(4)chardet:用于自动检测文本文件的编码格式,帮助确定不同文件的字符编码,以便正确读取和处理文本数据。
pip install chardet -i https://mirrors.aliyun.com/pypi/simple/
6、将所有txt 合并到一个完整的txt中
【也可以转成PDF,txt文本有限制长度,不是完美方案】
import os
import chardet
# 指定包含各个章节 .txt 文件的文件夹路径
input_directory = "chapters"
# 合并后的输出文件路径
output_file_path = "《都市古仙医》.txt"
# 打开输出文件,使用 "utf-8" 编码写入
with open(output_file_path, "w", encoding="utf-8") as output_file:
# 遍历文件夹中按字母顺序排序的所有 .txt 文件
for filename in sorted(os.listdir(input_directory)):
if filename.endswith(".txt"): # 检查文件是否为 .txt 文件
file_path = os.path.join(input_directory, filename)
# 检测文件编码
with open(file_path, 'rb') as file:
raw_data = file.read()
result = chardet.detect(raw_data) # 使用 chardet 检测编码格式
encoding = result['encoding'] if result['encoding'] else 'utf-8'
# 在控制台输出当前写入的文件名
print(f"正在写入 {filename} 到合并文件。")
# 将章节标题写入输出文件
chapter_title = filename.replace("_", " ").replace(".txt", "")
output_file.write(f"章节: {chapter_title}\n")
output_file.write("-" * 50 + "\n")
# 使用检测到的编码读取每个文件的内容
with open(file_path, "r", encoding=encoding, errors='replace') as file:
output_file.write(file.read()) # 写入章节内容
output_file.write("\n\n") # 章节之间添加空行
print(f"所有章节已合并至 {output_file_path}")
7、利用NaturalReader进行语音播放