使用epublib解析epub文件(章节内容、书籍菜单)

前阵子在android上解析epub格式的书籍。发现了这个开源的epub解析库。相关资料甚少!折腾了一阵子,发现其实光使用的话还是挺简单的。真是萌萌哒~下面简单介绍一下epublib。PS:第一次在优快云发博客,排版略丑别嫌弃啊~

epublib不仅可以用来解析epub格式的书籍,同样也可以用来生成一本epub书籍。由于我只是用于阅读,所以这里只介绍解析的方法。当然,要想了解epub的解析,首先得对epub格式的规范有一定的认识。

话不多说,进入正题!

(一) 相关资料:

1. 项目地址:

https://github.com/psiegman/epublib

2. 官方api文档:

http://www.siegmann.nl/static/epublib/apidocs/

3. 官方网站:

http://www.siegmann.nl/epublib 

4. Android上使用需要用到的依赖 

Slf4j-android : http://www.slf4j.org/android/ 

epublib-core-latest.jar  :  https://github.com/downloads/psiegman/epublib/epublib-core-latest.jar

5. 使用epublib的开源android项目

PagerTurner (这个以后可能会开一个篇章简单介绍一下)

(二)重要的类:

1. Book 

—— 表示一本书。书籍的内容全在这里。通过book对象能过获得各类书籍内容,如resource,Metadata等;


2.Resource

—— 所有的resource对象构成一本书。所以,一个Resource就是书籍的一部分资源,这资源信息可以是html,css,js,图片等;

3.MetaData

—— 书籍的头信息。比如,作者,出版社,语言等;

4.Spine

—— 书籍的阅读顺序,是一个线性的顺序。通过Spine可以知道应该按照怎样的章节(注:这里所说的章节其实就是resource,不仅是书籍文本内容哦~下同)顺序去阅读,并且通过Spine可以找到对应章节的内容。

形式如:


5.TableOfContent

—— 这个与Spine有所区别。同样可以访问到各个章节的内容。但是他是树形结构。

形式如:




6.Resources

——获得全部的Resource对象。然后通过Resources对象能够轻易的取出你想要查找的Resource对象,可以用于查找Resource.

可以通过id或者href来定位resource,也可以通过MediaType类型来获得指定resource。部分重要方法如下图所示:

7.MediaType

—— Resource的类型描述。用于说明此Resource是何种类型(CSS/JS/图片/HTML/ VEDIO等)。

8.MediatypeService

—— 这个类中就包含了各种MediaType类型。如下图所示:


(三)基本用法

1.读取一本书

(1)最简单的方法就是直接从流中读取书籍

try {
    EpubReader reader = new EpubReader();
    InputStream in = aciticity.getAssets().open("1.epub");
    reader.readEpub(in);
} catch (Exception e) {
    e.printStackTrace();
}

(2)加载速度太慢?图片。视频之类的就先不读取了。

  try {
            EpubReader epubReader = new EpubReader();

        MediaType[] lazyTypes = {
                 MediatypeService.CSS,  
                 MediatypeService.GIF, MediatypeService.JPG,
                 MediatypeService.PNG,
                 MediatypeService.MP3,
                 MediatypeService.MP4};
        String fileName = "1.epub";
        Book book = epubReader.readEpubLazy(fileName,"UTF-8",Arrays.asList(lazyTypes));

    } catch (Exception e) {
        e.printStackTrace();
    }

2.获取章节内容

(1)获取所有章节内容。
 List<Resource> contents = book.getContents();

       (2)获取某一章的章节内容。

//通过index获取
int index = 0;
Resource byIndex = book.getSpine().getResource(index);

//通过href获取
String href = "/images/1.png";
Resource byHref = book.getResources().getByHref(href);

//通过id
String id = "chapter01"; 
Resource byId = book.getResources().getById(id);

//特殊的resource,可以直接获取
book.getCoverImage();
book.getCoverPage();
book.getNcxResource();
book.getOpfResource();

//其他
book.getSpine().getSpineReferences().get(0).getResource();
book.getGuide().getReferences().get(0).getResource();
                    

3.获取书籍菜单

(1)两种不同的菜单
//通过spine获取线性的阅读菜单,此菜单依照阅读顺序排序
List<SpineReference> spineReferences = book.getSpine().getSpineReferences();
//获取对应resource
pineReferences.get(0).getResource();

//通过TableOfContents获取树形菜单。此菜单按照章节之间的关系(树形)排序
TableOfContents tableOfContents = book.getTableOfContents();
//获得对应的reource
tableOfContents.getTocReferences().get(0).getResource();

其中,对于TableOfContent,接下来给出参考的遍历方式。

首先,是菜单的JavaBean:
/**
 * 菜单,包含了标题,href,儿子节点
 * @author Administrator
 *
 */
public class ContentItem{

    private String title;
    private String url;//href
    private int size;//resource的大小

    //孩子节点
    private List<ContentItem> children;

    public ContentItem() {
        super();
    }

    public ContentItem(String title, String url,int size) {
        super();
        this.title = title;
        this.url = url;
        this.size = size;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public List<ContentItem> getChildren() {
        if(this.children==null){
            children = new ArrayList<ContentItem>();
        }
        return children;
    }

    public void setChildren(List<ContentItem> children) {
        this.children = children;
    }

    public int getSize() {
        return size;
    }

    public void setSize(int size) {
        this.size = size;
    }

}

上面contentItem中的url就是href,到时候需要加载具体章节的内容的话,只需要根据href就可以拿到指定的resource了。

然后,通过TableOfContents,遍历出我们想要的树形菜单。

public class EpubMenuParser {

    private ContentItem menu = new ContentItem();

    public ContentItem startParse(Book book){
        //从深度为0开始遍历
        parseMenu(book.getTableOfContents().getTocReferences(), 0);
        return menu;

    }

    /**
     * 遍历epub书籍的目录
     * @param refs 
     * @param level 菜单深度
     */
    private void parseMenu(List<TOCReference> refs, int level) {

        if (refs == null || refs.isEmpty()) {
            return;
        }

        for (TOCReference ref : refs) {

            if (ref.getResource() != null) {
                if (level == 0) {
                    // 第一层,一级节点,父节点是root
                    ContentItem item = new ContentItem(ref.getTitle(),
                             ref.getCompleteHref(),(int)ref.getResource().getSize());
                    menu.getChildren().add(item);

                } else if (level == 1) {

                    int lastIndexOf_depth1 = menu.getChildren().size() - 1;// 当前最后一个一级节点的位置

                    // 存入root的孩子节点中的最后一个一级节点中
                    ContentItem item2 = new ContentItem(ref.getTitle(),
                            ref.getCompleteHref(),(int)ref.getResource().getSize());

                    menu.getChildren().get(lastIndexOf_depth1).getChildren()
                            .add(item2);

                } else if (level == 2) {

                    int lastIndexOf_depth1 = menu.getChildren().size() - 1;// 当前最后一个一级节点的位置
                    int lastIndexOf_depth2 = menu.getChildren()
                            .get(lastIndexOf_depth1).getChildren().size() - 1;// 当前最后一个二级节点的位置

                    // 父节点为二级节点中的最后一个节点
                    ContentItem item = new ContentItem(ref.getTitle(),
                        ref.getCompleteHref(),(int)ref.getResource().getSize());

                    menu.getChildren().get(lastIndexOf_depth1).getChildren()
                            .get(lastIndexOf_depth2).getChildren().add(item);
                }
            }
            //继续遍历它的儿子
            parseMenu(ref.getChildren(), level + 1);
        }
    }

}

当然,这个遍历方法还是有点问题的。比如,目前的写法只支持3级菜单。不过一般的都够用了。当然,写死了并不好,根据需求而定哦~

还有很多其他的用法,之后慢慢研究。




import tkinter as tk from tkinter import filedialog, scrolledtext, ttk, messagebox from ebooklib import epub from bs4 import BeautifulSoup from PIL import Image, ImageTk import io import os import posixpath import re import html import requests import threading import time import webbrowser import urllib.parse class EPubReaderApp: def __init__(self, root): self.root = root self.root.title("EPub Reader Pro") self.root.geometry("1200x850") self.root.configure(bg="#f0f0f0") # 创建样式 self.style = ttk.Style() self.style.configure("TButton", padding=6, relief="flat", background="#4a7abc", foreground="black") self.style.configure("Title.TLabel", font=("Arial", 16, "bold"), background="#f0f0f0") self.style.configure("Status.TLabel", font=("Arial", 10), background="#f0f0f0", foreground="#555") self.style.configure("Treeview", font=("Arial", 10), rowheight=25) self.style.configure("Treeview.Heading", font=("Arial", 11, "bold")) self.style.map("TButton", background=[("active", "#3a6a9c")]) # 创建主框架 main_frame = ttk.Frame(root) main_frame.pack(fill=tk.BOTH, expand=True, padx=15, pady=15) # 创建顶部标题 title_frame = ttk.Frame(main_frame) title_frame.pack(fill=tk.X, pady=(0, 10)) title_label = ttk.Label(title_frame, text="EPUB Reader Pro", style="Title.TLabel") title_label.pack(side=tk.LEFT) # 创建状态标签 self.status_label = ttk.Label(title_frame, text="准备就绪", style="Status.TLabel") self.status_label.pack(side=tk.RIGHT) # 创建主面板 paned_window = ttk.PanedWindow(main_frame, orient=tk.HORIZONTAL) paned_window.pack(fill=tk.BOTH, expand=True) # 左侧面板 - 搜索和书籍管理 left_frame = ttk.Frame(paned_window, width=350) paned_window.add(left_frame, weight=1) # 搜索框架 search_frame = ttk.LabelFrame(left_frame, text="在线搜索电子书") search_frame.pack(fill=tk.X, padx=5, pady=5) search_container = ttk.Frame(search_frame) search_container.pack(fill=tk.X, padx=5, pady=5) self.search_entry = ttk.Entry(search_container) self.search_entry.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(0, 5)) self.search_entry.insert(0, "epub") search_button = ttk.Button(search_container, text="搜索", command=self.start_search) search_button.pack(side=tk.RIGHT) # 搜索结果树状视图 self.search_tree = ttk.Treeview( search_frame, columns=("size", "date"), show="headings", selectmode="browse" ) self.search_tree.heading("#0", text="书名") self.search_tree.heading("size", text="大小") self.search_tree.heading("date", text="日期") self.search_tree.column("#0", width=220, stretch=tk.YES) self.search_tree.column("size", width=70, anchor=tk.CENTER) self.search_tree.column("date", width=80, anchor=tk.CENTER) scrollbar = ttk.Scrollbar(search_frame, orient=tk.VERTICAL, command=self.search_tree.yview) self.search_tree.configure(yscrollcommand=scrollbar.set) self.search_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=(0, 5)) scrollbar.pack(side=tk.RIGHT, fill=tk.Y) # 下载按钮 button_frame = ttk.Frame(search_frame) button_frame.pack(fill=tk.X, pady=(5, 0)) self.download_button = ttk.Button( button_frame, text="下载选中书籍", command=self.download_selected, state=tk.DISABLED ) self.download_button.pack(side=tk.LEFT, padx=(0, 5)) open_github_button = ttk.Button( button_frame, text="访问GitHub", command=lambda: webbrowser.open("https://github.com/harptwzx/e-book") ) open_github_button.pack(side=tk.RIGHT) # 书架框架 bookshelf_frame = ttk.LabelFrame(left_frame, text="我的书架") bookshelf_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 书架树状视图 self.bookshelf_tree = ttk.Treeview( bookshelf_frame, columns=("size", "date"), show="headings", selectmode="browse" ) self.bookshelf_tree.heading("#0", text="书名") self.bookshelf_tree.heading("size", text="大小") self.bookshelf_tree.heading("date", text="日期") self.bookshelf_tree.column("#0", width=220, stretch=tk.YES) self.bookshelf_tree.column("size", width=70, anchor=tk.CENTER) self.bookshelf_tree.column("date", width=80, anchor=tk.CENTER) bookshelf_scrollbar = ttk.Scrollbar(bookshelf_frame, orient=tk.VERTICAL, command=self.bookshelf_tree.yview) self.bookshelf_tree.configure(yscrollcommand=bookshelf_scrollbar.set) self.bookshelf_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=(0, 5)) bookshelf_scrollbar.pack(side=tk.RIGHT, fill=tk.Y) # 书架按钮 bookshelf_btn_frame = ttk.Frame(bookshelf_frame) bookshelf_btn_frame.pack(fill=tk.X, pady=(5, 0)) self.load_button = ttk.Button( bookshelf_btn_frame, text="加载选中书籍", command=self.load_from_bookshelf, state=tk.DISABLED ) self.load_button.pack(side=tk.LEFT, padx=(0, 5)) remove_button = ttk.Button( bookshelf_btn_frame, text="移除选中书籍", command=self.remove_from_bookshelf, state=tk.DISABLED ) remove_button.pack(side=tk.LEFT) # 右侧面板 - 阅读器 right_frame = ttk.Frame(paned_window) paned_window.add(right_frame, weight=2) # 阅读器控制面板 control_frame = ttk.Frame(right_frame) control_frame.pack(fill=tk.X, pady=(0, 10)) # 章节选择下拉菜单 self.chapter_var = tk.StringVar() self.chapter_combo = ttk.Combobox(control_frame, textvariable=self.chapter_var, state="readonly", width=40) self.chapter_combo.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(0, 10)) self.chapter_combo.bind("<<ComboboxSelected>>", self.on_chapter_select) # 翻页按钮 self.prev_button = ttk.Button(control_frame, text="上一章", command=self.show_previous, state=tk.DISABLED) self.prev_button.pack(side=tk.LEFT, padx=(0, 5)) self.next_button = ttk.Button(control_frame, text="下一章", command=self.show_next, state=tk.DISABLED) self.next_button.pack(side=tk.RIGHT) # 页码标签 self.page_label = ttk.Label(control_frame, text="章节: 0/0") self.page_label.pack(side=tk.RIGHT, padx=(0, 10)) # 文本区域框架 text_frame = ttk.Frame(right_frame) text_frame.pack(fill=tk.BOTH, expand=True) # 滚动文本框 self.text_area = scrolledtext.ScrolledText( text_frame, wrap=tk.WORD, font=("Arial", 12), padx=15, pady=15, bg="white", relief="flat" ) self.text_area.pack(fill=tk.BOTH, expand=True) self.text_area.config(state=tk.DISABLED) self.text_area.tag_configure("center", justify='center') self.text_area.tag_configure("heading", font=("Arial", 16, "bold"), foreground="#2c3e50") self.text_area.tag_configure("subheading", font=("Arial", 14, "bold"), foreground="#3498db") self.text_area.tag_configure("chapter_title", font=("Arial", 14, "bold"), foreground="#e74c3c") # 初始化变量 self.book = None self.chapters = [] self.chapter_titles = [] self.current_chapter_index = 0 self.book_title = "" self.image_references = [] self.image_resources = {} self.ncx_toc = None self.bookshelf_dir = "bookshelf" self.github_url = "https://api.github.com/repos/harptwzx/e-book/contents/books?ref=2f2f739150f3eb95b51c99758805f139db753f60" # 创建书架目录 if not os.path.exists(self.bookshelf_dir): os.makedirs(self.bookshelf_dir) # 加载书架内容 self.refresh_bookshelf() # 绑定事件 self.search_tree.bind("<<TreeviewSelect>>", self.on_search_select) self.bookshelf_tree.bind("<<TreeviewSelect>>", self.on_bookshelf_select) # 显示欢迎信息 self.show_welcome_message() def show_welcome_message(self): self.text_area.config(state=tk.NORMAL) self.text_area.delete(1.0, tk.END) welcome_text = """ 📚 EPUB Reader Pro 欢迎使用增强版EPUB阅读器! 主要功能: 1. 从GitHub仓库搜索和下载电子书 2. 管理个人电子书架 3. 阅读EPUB电子书 4. 支持章节导航和图片显示 使用说明: - 在左侧搜索电子书并下载到书架 - 从书架中选择电子书加载阅读 - 使用章节下拉菜单和翻页按钮导航 提示:您也可以使用顶部的"加载EPUB文件"按钮加载本地文件 """ self.text_area.insert(tk.END, welcome_text, "center") self.text_area.config(state=tk.DISABLED) def start_search(self): query = self.search_entry.get().strip() if not query: messagebox.showwarning("搜索提示", "请输入搜索关键词") return self.status_label.config(text="正在搜索电子书...") threading.Thread(target=self.search_books, args=(query,), daemon=True).start() def search_books(self, query): try: # 清空现有搜索结果 for item in self.search_tree.get_children(): self.search_tree.delete(item) # 获取GitHub仓库内容 response = requests.get(self.github_url) response.raise_for_status() data = response.json() # 过滤EPUB文件 epub_files = [item for item in data if item["name"].lower().endswith(".epub")] # 根据查询过滤 if query.lower() != "epub": epub_files = [item for item in epub_files if query.lower() in item["name"].lower()] # 在UI中显示结果 for item in epub_files: # 转换文件大小 size = item["size"] if size < 1024: size_str = f"{size} B" elif size < 1024 * 1024: size_str = f"{size/1024:.1f} KB" else: size_str = f"{size/(1024*1024):.1f} MB" # 格式化日期 date_str = item["git_last_modified"].split("T")[0] self.search_tree.insert("", tk.END, text=item["name"], values=(size_str, date_str)) self.status_label.config(text=f"找到 {len(epub_files)} 本电子书") except Exception as e: self.status_label.config(text=f"搜索失败: {str(e)}") messagebox.showerror("搜索错误", f"无法获取电子书列表: {str(e)}") def on_search_select(self, event): selected = self.search_tree.selection() if selected: self.download_button.config(state=tk.NORMAL) else: self.download_button.config(state=tk.DISABLED) def download_selected(self): selected = self.search_tree.selection() if not selected: return item = self.search_tree.item(selected[0]) book_name = item["text"] # 检查是否已存在 local_path = os.path.join(self.bookshelf_dir, book_name) if os.path.exists(local_path): if not messagebox.askyesno("确认", f"'{book_name}' 已存在,是否覆盖?"): return self.status_label.config(text=f"正在下载 {book_name}...") threading.Thread(target=self.download_book, args=(book_name,), daemon=True).start() def download_book(self, book_name): try: # 获取下载URL download_url = f"https://raw.githubusercontent.com/harptwzx/e-book/2f2f739150f3eb95b51c99758805f139db753f60/books/{urllib.parse.quote(book_name)}" # 下载文件 response = requests.get(download_url, stream=True) response.raise_for_status() # 保存文件 local_path = os.path.join(self.bookshelf_dir, book_name) with open(local_path, "wb") as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk) # 更新UI self.root.after(0, lambda: self.status_label.config(text=f"下载完成: {book_name}")) self.root.after(0, self.refresh_bookshelf) self.root.after(0, lambda: messagebox.showinfo("下载成功", f"'{book_name}' 已添加到书架")) except Exception as e: self.root.after(0, lambda: self.status_label.config(text=f"下载失败: {str(e)}")) self.root.after(0, lambda: messagebox.showerror("下载错误", f"无法下载电子书: {str(e)}")) def refresh_bookshelf(self): # 清空书架 for item in self.bookshelf_tree.get_children(): self.bookshelf_tree.delete(item) # 添加书架中的电子书 for filename in os.listdir(self.bookshelf_dir): if filename.lower().endswith(".epub"): filepath = os.path.join(self.bookshelf_dir, filename) size = os.path.getsize(filepath) date = time.strftime("%Y-%m-%d", time.localtime(os.path.getmtime(filepath))) # 转换文件大小 if size < 1024: size_str = f"{size} B" elif size < 1024 * 1024: size_str = f"{size/1024:.1f} KB" else: size_str = f"{size/(1024*1024):.1f} MB" self.bookshelf_tree.insert("", tk.END, text=filename, values=(size_str, date)) def on_bookshelf_select(self, event): selected = self.bookshelf_tree.selection() if selected: self.load_button.config(state=tk.NORMAL) else: self.load_button.config(state=tk.DISABLED) def load_from_bookshelf(self): selected = self.bookshelf_tree.selection() if not selected: return item = self.bookshelf_tree.item(selected[0]) book_name = item["text"] file_path = os.path.join(self.bookshelf_dir, book_name) self.load_epub(file_path) def remove_from_bookshelf(self): selected = self.bookshelf_tree.selection() if not selected: return item = self.bookshelf_tree.item(selected[0]) book_name = item["text"] if not messagebox.askyesno("确认删除", f"确定要从书架中移除 '{book_name}' 吗?"): return file_path = os.path.join(self.bookshelf_dir, book_name) try: os.remove(file_path) self.refresh_bookshelf() self.status_label.config(text=f"已移除: {book_name}") except Exception as e: messagebox.showerror("删除错误", f"无法删除文件: {str(e)}") def load_epub(self, file_path=None): if not file_path: file_path = filedialog.askopenfilename( filetypes=[("EPub files", "*.epub"), ("All files", "*.*")] ) if not file_path: return try: # 读取EPUB文件 self.book = epub.read_epub(file_path) self.chapters = [] self.chapter_titles = [] self.image_references = [] self.image_resources = {} self.ncx_toc = None # 获取书籍标题 self.book_title = self.extract_book_title() self.status_label.config(text=f"正在加载: {self.book_title}") self.root.update() # 收集图片资源 self.collect_image_resources() # 解析目录结构 self.parse_table_of_contents() # 如果没有通过目录找到章节,尝试备用方法 if not self.chapters: self.status_label.config(text=f"使用备用方法加载: {self.book_title}") self.root.update() self.parse_chapters_fallback() # 更新UI if self.chapters: self.chapter_combo.config(values=self.chapter_titles) self.chapter_combo.current(0) self.prev_button.config(state=tk.NORMAL) self.next_button.config(state=tk.NORMAL) self.current_chapter_index = 0 self.show_chapter(self.current_chapter_index) self.status_label.config(text=f"已加载: {self.book_title} - 共 {len(self.chapters)} 章") else: self.status_label.config(text=f"错误: 在 {self.book_title} 中未找到章节") self.clear_text_area() except Exception as e: self.status_label.config(text=f"错误: {str(e)}") self.clear_text_area() messagebox.showerror("加载错误", f"无法加载EPUB文件: {str(e)}") def extract_book_title(self): """从元数据中提取书籍标题""" try: # 尝试多种方法获取标题 if self.book.get_metadata('DC', 'title'): return self.book.get_metadata('DC', 'title')[0][0] # 尝试从封面或第一页获取标题 for item in self.book.get_items(): if item.get_type() == ebooklib.ITEM_DOCUMENT: soup = BeautifulSoup(item.get_content(), 'html.parser') if soup.title: return soup.title.string return "未知标题" except: return "未知标题" def collect_image_resources(self): """收集所有图片资源""" for item in self.book.get_items(): if item.get_type() == ebooklib.ITEM_IMAGE: path = item.get_name() self.image_resources[path] = item.get_content() filename = os.path.basename(path) if filename not in self.image_resources: self.image_resources[filename] = item.get_content() def parse_table_of_contents(self): """解析目录结构获取章节信息""" try: # 获取NCX目录(标准目录格式) ncx_items = [item for item in self.book.get_items() if item.get_type() == ebooklib.ITEM_NAVIGATION and item.get_name().endswith('ncx')] if ncx_items: ncx_content = ncx_items[0].get_content() ncx_soup = BeautifulSoup(ncx_content, 'xml') nav_points = ncx_soup.find_all('navPoint') self.process_nav_points(nav_points) return # 尝试HTML目录(较新的EPUB3格式) nav_items = [item for item in self.book.get_items() if item.get_type() == ebooklib.ITEM_NAVIGATION and item.get_name().endswith('xhtml')] if nav_items: nav_content = nav_items[0].get_content() nav_soup = BeautifulSoup(nav_content, 'html.parser') nav_links = nav_soup.find_all('a', href=True) self.process_nav_links(nav_links) return except Exception as e: print(f"解析目录时出错: {e}") def process_nav_points(self, nav_points): """处理NCX目录点""" for nav_point in nav_points: # 提取章节标题 title = nav_point.find('text').get_text().strip() # 提取内容路径 content_src = nav_point.find('content')['src'] content_path = self.resolve_path(content_src.split('#')[0]) # 获取章节内容 chapter_item = self.book.get_item_with_href(content_path) if chapter_item and chapter_item.get_type() == ebooklib.ITEM_DOCUMENT: self.add_chapter(chapter_item, title) # 递归处理子目录 child_points = nav_point.find_all('navPoint', recursive=False) if child_points: self.process_nav_points(child_points) def process_nav_links(self, nav_links): """处理HTML导航链接""" for link in nav_links: title = link.get_text().strip() content_src = link['href'] content_path = self.resolve_path(content_src.split('#')[0]) chapter_item = self.book.get_item_with_href(content_path) if chapter_item and chapter_item.get_type() == ebooklib.ITEM_DOCUMENT: self.add_chapter(chapter_item, title) def parse_chapters_fallback(self): """备用的章节解析方法""" try: # 获取spine顺序(阅读顺序) spine_items = [self.book.get_item_with_id(item[0]) for item in self.book.spine if self.book.get_item_with_id(item[0])] # 按阅读顺序处理项目 for idx, item in enumerate(spine_items): if item.get_type() == ebooklib.ITEM_DOCUMENT: soup = BeautifulSoup(item.get_content(), 'html.parser') title = f"章节 {idx+1}" # 尝试从文档中提取标题 if soup.title and soup.title.string: title = soup.title.string.strip() elif soup.find('h1'): title = soup.find('h1').get_text().strip() elif soup.find('h2'): title = soup.find('h2').get_text().strip() self.add_chapter(item, title) except Exception as e: print(f"备用章节解析失败: {e}") def add_chapter(self, item, title): """添加章节到列表中""" # 确保章节标题唯一 base_title = title counter = 1 while title in self.chapter_titles: title = f"{base_title} ({counter})" counter += 1 self.chapter_titles.append(title) # 解析章节内容 content = item.get_content() soup = BeautifulSoup(content, 'html.parser') # 存储章节信息 self.chapters.append({ "soup": soup, "path": item.get_name(), "title": title, "item": item }) def resolve_path(self, path): """解析相对路径为绝对路径""" # 处理绝对路径 if path.startswith('/'): return path[1:] # 处理相对路径 - 这里需要知道基础路径,但EPUBlib不直接提供 # 在大多数情况下,路径已经是绝对路径 return path def clear_text_area(self): self.text_area.config(state=tk.NORMAL) self.text_area.delete(1.0, tk.END) self.text_area.config(state=tk.DISABLED) self.image_references = [] def show_chapter(self, index): if not self.chapters or index < 0 or index >= len(self.chapters): return self.clear_text_area() self.text_area.config(state=tk.NORMAL) # 更新UI状态 self.page_label.config(text=f"章节: {index+1}/{len(self.chapters)}") self.chapter_combo.current(index) self.current_chapter_index = index # 获取章节数据 chapter = self.chapters[index] soup = chapter["soup"] path = chapter["path"] title = chapter["title"] # 显示章节标题 self.text_area.insert(tk.END, f"\n{title}\n", "chapter_title") self.text_area.insert(tk.END, "\n" + "=" * len(title) + "\n\n", "chapter_title") # 处理章节内容 self.process_content(soup, path) # 添加章节结束标记 self.text_area.insert(tk.END, "\n\n" + "-" * 40 + "\n\n") # 滚动到顶部 self.text_area.yview_moveto(0) self.text_area.config(state=tk.DISABLED) def process_content(self, soup, chapter_path): # 创建章节目录 toc = self.create_chapter_toc(soup) if toc: self.text_area.insert(tk.END, "本章目录:\n\n", "subheading") for level, title in toc: indent = " " * (level - 1) self.text_area.insert(tk.END, f"{indent}- {title}\n") self.text_area.insert(tk.END, "\n" + "-" * 40 + "\n\n") # 移除不需要的元素 for element in soup(['script', 'style', 'header', 'footer', 'nav', 'aside', 'svg']): element.decompose() # 处理正文内容 body = soup.body if soup.body else soup self.process_element(body, chapter_path) def create_chapter_toc(self, soup): """创建章节内目录""" toc = [] headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']) for heading in headings: level = int(heading.name[1]) title = heading.get_text().strip() if title: toc.append((level, title)) return toc if toc else None def process_element(self, element, chapter_path): """递归处理HTML元素""" if isinstance(element, str): # 处理文本节点 text = html.unescape(element.strip()) if text: self.text_area.insert(tk.END, text + " ") elif hasattr(element, 'children'): # 处理元素节点 if element.name == 'img' and 'src' in element.attrs: self.insert_image(element['src'], chapter_path) elif element.name == 'p': self.text_area.insert(tk.END, '\n\n') for child in element.children: self.process_element(child, chapter_path) self.text_area.insert(tk.END, '\n') elif element.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']: level = int(element.name[1]) self.text_area.insert(tk.END, '\n\n') self.text_area.insert(tk.END, element.get_text().strip() + '\n', "subheading") self.text_area.insert(tk.END, '-' * len(element.get_text().strip()) + '\n\n') elif element.name == 'br': self.text_area.insert(tk.END, '\n') elif element.name == 'hr': self.text_area.insert(tk.END, '\n' + '-' * 40 + '\n') elif element.name == 'blockquote': self.text_area.insert(tk.END, '\n > ') for child in element.children: self.process_element(child, chapter_path) self.text_area.insert(tk.END, '\n\n') elif element.name == 'div' or element.name == 'section': self.text_area.insert(tk.END, '\n') for child in element.children: self.process_element(child, chapter_path) self.text_area.insert(tk.END, '\n') elif element.name == 'li': self.text_area.insert(tk.END, '\n• ') for child in element.children: self.process_element(child, chapter_path) elif element.name == 'a' and 'href' in element.attrs: # 处理超链接但不显示URL for child in element.children: self.process_element(child, chapter_path) else: # 默认处理:递归处理所有子元素 for child in element.children: self.process_element(child, chapter_path) elif element is not None: # 处理其他型的节点 self.text_area.insert(tk.END, str(element)) def insert_image(self, src, chapter_dir): """插入图片到文本区域""" try: # 解析图片路径 image_path = self.resolve_image_path(src, chapter_dir) # 查找图片资源 image_data = None if image_path in self.image_resources: image_data = self.image_resources[image_path] else: filename = os.path.basename(image_path) if filename in self.image_resources: image_data = self.image_resources[filename] elif src in self.image_resources: image_data = self.image_resources[src] if not image_data: self.text_area.insert(tk.END, f"\n[图片未找到: {image_path}]\n\n") return # 处理图片 image = Image.open(io.BytesIO(image_data)) # 调整图片大小 text_width = self.text_area.winfo_width() - 50 if text_width < 100: text_width = 600 width, height = image.size if width > text_width: ratio = text_width / width new_size = (int(width * ratio), int(height * ratio)) image = image.resize(new_size, Image.LANCZOS) # 显示图片 photo = ImageTk.PhotoImage(image) self.image_references.append(photo) # 居中显示 self.text_area.image_create(tk.END, image=photo) self.text_area.tag_add("center", "insert-1c", "insert") self.text_area.insert(tk.END, '\n\n') except Exception as e: self.text_area.insert(tk.END, f"\n[图片错误: {str(e)}]\n\n") def resolve_image_path(self, src, chapter_dir): """解析图片路径""" if src.startswith('/'): return src[1:] if chapter_dir: resolved_path = posixpath.normpath(posixpath.join(chapter_dir, src)) else: resolved_path = src return resolved_path.replace("\\", "/") def on_chapter_select(self, event): selected_index = self.chapter_combo.current() if 0 <= selected_index < len(self.chapters) and selected_index != self.current_chapter_index: self.show_chapter(selected_index) def show_previous(self): if self.current_chapter_index > 0: self.show_chapter(self.current_chapter_index - 1) def show_next(self): if self.current_chapter_index < len(self.chapters) - 1: self.show_chapter(self.current_chapter_index + 1) if __name__ == "__main__": root = tk.Tk() app = EPubReaderApp(root) root.mainloop() python代码理解
最新发布
06-03
评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值