Python-mht转html-用email库转Android端QQ浏览器保存下的mht文件

这或许是全网首个针对安卓端qq浏览器下保存下的网页mht文件转码为html文件的代码

数据安全声明:本人学生,学艺不精。数据安全请各自负责,在没有确认文件没有出现损坏前请不要删除回收站中的源文件!!!

先上代码

-------------分割线----------------

以下内容于2024年10月20日更改:

1.添加线程功能将文件的分割操作挂起排队,以加快运行速度

    @staticmethod
    def write_down_file_content(file_path,file_content):
        #由于不知道写入的是二进制还是已编码文件,故进行二进制判断
        if isinstance(file_content,ByteString):
            with open(file_path, 'wb') as f:
                f.write(file_content)
        else:
            with open(file_path, 'w', encoding="utf-8") as f:
                f.write(file_content)

    @staticmethod
    def write_down_file_content_threaded(file_path, file_content):
        thread = threading.Thread(target=Tool.write_down_file_content, args=(file_path, file_content))
        thread.start()
        return thread

但是需要注意:在代码没有运行完成之前不要提前强制关闭,因为线程功能下,有可能会先写入转码好的html文件,再将mht中的图片等写入写入依赖位置。如果强制关闭,可能会出现html找不到依赖位置上文件的问题

另外,建议在处理大量文件时对mht.convert()代码挂到进程池中处理,否则会出现如下问题,导致速度更慢:

python-thread模块下运行高io并发代码后windows任务管理器出现已挂起、运行速度底下-优快云博客

2.进行了一些小优化

后续有可能添加可视化

再次强调:本人学生,学艺不精。数据安全请各自负责,在没有确认文件没有出现损坏前请不要删除回收站中的源文件!!!

主类Mht2Html:

import base64
import email
import hashlib
import os
import quopri
import re
from datetime import datetime
from email.policy import default
from collections.abc import ByteString
from send2trash import send2trash

from Tool import Tool


def format_filename(filepath):
    dir, filename = os.path.split(filepath)
    invalid_chars_regex = r'[<>:"/\\|?*]'
    filename = re.sub(invalid_chars_regex, '', filename)
    return os.path.join(dir, filename)


def format_str(str_object):
    illegal_chars = ['/', '*', ':', '?', '"', '<', '>', '|','\\']
    # 替换掉所有非法字符为下划线 '_'
    for char in illegal_chars:
        str_object = str_object.replace(char, "_")
    return str_object


class Mht2Html:
    path = ""
    fileName = ""
    url = ""
    title = ""
    savedTime = ""
    boundary = ""
    message = ""
    test = ""

    def extract_info(self, text, key):
        start = text.find(key) + len(key)
        end = text.find("\n", start)
        value = text[start:end].strip()
        return value

    def decode_mime_encoded_text(self, encoded_text):
        # 检查是否是MIME编码
        decoded_text = ''
        try:
            if encoded_text.startswith('=?') and encoded_text.endswith('?='):
                # 分割编码类型、编码方式和编码文本
                parts = encoded_text[2:-2].split('?')
                charset = parts[0]
                encoding = parts[1]
                text = parts[2]

                # 根据编码方式解码文本
                if encoding == 'Q':
                    # Quoted-Printable编码
                    decoded_text = quopri.decodestring(text).decode(charset)
                elif encoding == 'B':
                    # Base64编码
                    decoded_text = base64.b64decode(text).decode(charset)
        except Exception:
            decoded_text = os.path.splitext(self.fileName)[0]
        return decoded_text

    def __init__(self, path, test=False):
        self.path = path
        self.test = test
        if (self.path == ""):
            return False
        self.fileName = os.path.basename(self.path)
        self.main_encoding = Tool.get_file_encoding_form_file_path(self.path)
        with open(self.path, 'r', encoding=self.main_encoding) as file:
            file_content = file.read()
            self.message = email.message_from_string(file_content, policy=default)

        self.url = self.extract_info(file_content, "Snapshot-Content-Location: ")
        self.title = format_str(self.decode_mime_encoded_text(self.extract_info(file_content, "Subject: ")))
        self.savedTime = Tool.rfc2822_to_timestamp(self.extract_info(file_content, "Date: "))  # RFC 2822
        self.boundary = self.extract_info(file_content, "boundary=")

    def addMainDistribute(self, file_content):
        file_content = "<!--" + ("url: " + self.url + "\n") + (
                "savedTime: " + str(self.savedTime)) + "-->\n" + file_content
        return file_content

    def convert(self):
        if self.test:
            print("url: " + self.url)
            print("title: " + self.title)
            print("savedTime: " + str(self.savedTime))
            print("boundary: " + self.boundary)
            # print("message: " + str(self.message))
        main_filePath = ''
        main_file_content = ''
        md5_value = ''
        num = 0
        all_message = list(self.message.walk())
        all_message_size = len(all_message) - 1
        for part in all_message:
            # 这个情况是集合,和后面的内容重复,所以要剔除掉
            if part.get_content_maintype() == 'multipart':
                continue
            num = num + 1
            content_type = part.get_content_type()
            file_content = part.get_payload(decode=True)
            extension = part.get_content_subtype()

            if extension == "*":
                extension = Tool.getFileExtension_from_binary_data(file_content)
                if not extension:
                    continue

            if num > 1:
                md5_value = hashlib.md5(file_content).hexdigest()
            if content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                encoding, file_content = Tool.get_file_encoding_and_content_form_file_content(file_content)
            else:
                encoding = Tool.get_file_encoding_and_content_form_file_content(file_content)[0]
            if num == 1:
                file_path = os.path.join(os.path.dirname(self.path), datetime.fromtimestamp(self.savedTime).strftime(
                    '%Y%m%d_%H%M%S_') + self.title + '.' + extension)
                main_filePath = format_filename(file_path)
                main_file_content = file_content
                continue

            elif content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                file_path = os.path.join(os.path.dirname(self.path), 'text',
                                         md5_value + '.' + extension)
            else:
                file_path = os.path.join(os.path.dirname(self.path), 'imgs',
                                         md5_value + '.' + extension)
            if file_path != '' and part.get('Content-Location') != None:
                main_file_content = Tool.replace_link(main_file_content, str(part.get('Content-Location')),
                                                      os.path.relpath(file_path,
                                                                          os.path.dirname(main_filePath)).replace('\\',
                                                                                                                  '/'))
            if os.path.exists(file_path):
                continue
            directory = os.path.dirname(file_path)
            if not os.path.exists(directory):
                os.makedirs(directory)

            Tool.write_down_file_content_threaded(file_path,file_content)


        # 应对文件出错情况,在极少数情况下,保存的文件会不完整,此情况无法正常定向要转换的html文件
        if main_filePath == '':
            return False

        if isinstance(main_file_content,ByteString):
            # 特殊情况:当文件内只包含一个非html类型问题(如图片文件时,时间二进制保存)
            Tool.write_down_file_content_threaded(main_filePath ,main_file_content)
        else:
            Tool.write_down_file_content_threaded(main_filePath ,self.addMainDistribute(main_file_content))



        send2trash(self.path)
        return main_filePath

工具类Tool:

import os
import threading
from datetime import datetime, timezone

import magic
import mimetypes
import chardet
from collections.abc import ByteString

mime = magic.Magic(mime=True)

#额外添加的对应关系
custom_types = {
    'image/svg': '.svg',
    'text/x-c': '.c',
    'text/x-h': '.h',
    'text/x-Algol68': '.a68',
    'application/vnd.ms-opentype': '.otf',
    'application/vnd.example+json': '.exjson',

}

# 循环添加MIME类型和扩展名的映射关系
for mime_type, extension in custom_types.items():
    mimetypes.add_type(mime_type, extension, strict=True)

class Tool:
    @staticmethod
    def rfc2822_to_timestamp( rfc2822):
        #转码rfc2822格式为时间戳
        date_format = "%a, %d %b %Y %H:%M:%S %z"
        date_time = datetime.strptime(rfc2822, date_format)
        # 将datetime对象转换为UTC时间戳
        timestamp = date_time.replace(tzinfo=timezone.utc).timestamp()
        return timestamp
    @staticmethod
    def write_down_file_content(file_path,file_content):
        #由于不知道写入的是二进制还是已编码文件,故进行二进制判断
        if isinstance(file_content,ByteString):
            with open(file_path, 'wb') as f:
                f.write(file_content)
        else:
            with open(file_path, 'w', encoding="utf-8") as f:
                f.write(file_content)

    @staticmethod
    def write_down_file_content_threaded(file_path, file_content):
        thread = threading.Thread(target=Tool.write_down_file_content, args=(file_path, file_content))
        thread.start()
        return thread
    @staticmethod
    def generate_file_path(file_name,file_extension,file_dir,parent_dir):
        if file_extension == 'bin':
            file_name = f"{file_name}"
        else:
            file_name = f"{file_name}.{file_extension}"
        save_dir = os.path.join(parent_dir, file_dir)
        os.makedirs(save_dir, exist_ok=True)
        generated_path = os.path.join(save_dir, file_name)
        return generated_path

    @staticmethod
    def getFileExtension_from_mime_type(mime_type):
        if mime_type == "application/octet-stream":
            return False
        split = mime_type.split('/')
        if len(split) == 2:
            return split[1].lower()
        else:
            return False

    @staticmethod
    def getFileExtension_from_binary_data(data):
        mime_type = mime.from_buffer(data)
        if mimetypes.guess_extension(mime_type)==None:
            a=mimetypes.guess_extension(mime_type)
            #extension=
            print(mime_type)
            return False
        else:
            extension=mimetypes.guess_extension(mime_type)[1::]
        return extension

    @staticmethod
    def replace_link(html_content, link, path):
        new_html_content = html_content.replace(link, path)
        if new_html_content == html_content:
            new_html_content = html_content.replace(link.replace("https:", ""), path)
        return new_html_content

    @staticmethod
    def get_file_encoding_and_content_form_file_content(file_content):
        encoding = ''
        guess_encoding = chardet.detect(file_content)['encoding']
        if guess_encoding == 'GB2312':
            guess_encoding = 'GBK'
        try:
            file_content = file_content.decode(guess_encoding, errors='ignore')
            encoding = guess_encoding
        except:
            pass

        for guess_encoding in ['utf-8', 'GBK', 'ISO-8859-1', 'Windows-1252']:
            try:
                file_content = file_content.decode(guess_encoding, errors='ignore')
                encoding = guess_encoding
            except:
                pass

        return encoding, file_content

    @staticmethod
    def get_file_encoding_form_file_path(file_path):
        # 遍历尝试一下几种encode
        encodings = ['utf-8', 'GBK', 'ISO-8859-1', 'Windows-1252']
        for encoding in encodings:
            try:
                with open(file_path, 'r', encoding=encoding) as file:
                    file_content = file.read()
                    return encoding
            except Exception as e:
                pass

        else:
            return ''

-------------分割线----------------

以下内容于2024年10月11日更改:

重大更改:

发现原先代码中存在问题:

1.没有考虑到部分网站采用utf-8以为的格式存储,故会导致例如打开失败、成功打开后写入乱码的问题。

2.对一些特殊情况会导致文件名出错的情况做了处理:

        比如一些保存的网页标题可能带有'/','\'这类的。我列了一个非法invalid_chars_regex = r'[<>:"/\\|?*]'。

并且将代码做了拆分,扔了一些静态函数出来到了一个类

其实9月发布完就发现零星的bug了。由于数据量不够,跑了几万个文件后才获得了400多个bug文件(其实是犯懒),但是那些文件确实跑跑停停攒到了国庆后。到了现在才更改,也是很不好意思。

后续有可能添加可视化

再次强调:本人学生,学艺不精。数据安全请各自负责,在没有确认文件没有出现损坏前请不要删除回收站中的源文件!!!

主类Mht2Html:

import base64
import email
import hashlib
import os
import quopri
import re
from datetime import datetime, timezone
from email.policy import default

from send2trash import send2trash

from PathDeal import PathDeal


def format_filename(filepath):
    dir, filename = os.path.split(filepath)
    invalid_chars_regex = r'[<>:"/\\|?*]'
    filename = re.sub(invalid_chars_regex, '', filename)
    return os.path.join(dir, filename)


def format_str(str_object):
    illegal_chars = ['/', '*', ':', '?', '"', '<', '>', '|']
    # 替换掉所有非法字符为下划线 '_'
    for char in illegal_chars:
        str_object = str_object.replace(char, "_")
    return str_object


class Mht2Html:
    path = ""
    fileName = ""
    url = ""
    title = ""
    savedTime = ""
    boundary = ""
    message = ""
    test = ""

    def extract_info(self, text, key):
        start = text.find(key) + len(key)
        end = text.find("\n", start)
        value = text[start:end].strip()
        return value

    def decode_mime_encoded_text(self, encoded_text):
        # 检查是否是MIME编码
        decoded_text = ''
        try:
            if encoded_text.startswith('=?') and encoded_text.endswith('?='):
                # 分割编码类型、编码方式和编码文本
                parts = encoded_text[2:-2].split('?')
                charset = parts[0]
                encoding = parts[1]
                text = parts[2]

                # 根据编码方式解码文本
                if encoding == 'Q':
                    # Quoted-Printable编码
                    decoded_text = quopri.decodestring(text).decode(charset)
                elif encoding == 'B':
                    # Base64编码
                    decoded_text = base64.b64decode(text).decode(charset)
        except Exception:
            decoded_text = os.path.splitext(self.fileName)[0]
        return decoded_text

    def rfc2822TOtimestamp(self, rfc2822):
        date_format = "%a, %d %b %Y %H:%M:%S %z"
        date_time = datetime.strptime(rfc2822, date_format)
        # 将datetime对象转换为UTC时间戳
        timestamp = date_time.replace(tzinfo=timezone.utc).timestamp()
        return timestamp

    def __init__(self, path, test=False):
        self.path = path
        self.test = test
        if (self.path == ""):
            return False
        self.fileName = os.path.basename(self.path)
        self.main_encoding = PathDeal.get_file_encoding_form_file_path(self.path)
        with open(self.path, 'r', encoding=self.main_encoding) as file:
            file_content = file.read()
            self.message = email.message_from_string(file_content, policy=default)

        self.url = self.extract_info(file_content, "Snapshot-Content-Location: ")
        self.title = format_str(self.decode_mime_encoded_text(self.extract_info(file_content, "Subject: ")))
        self.savedTime = self.rfc2822TOtimestamp(self.extract_info(file_content, "Date: "))  # RFC 2822
        self.boundary = self.extract_info(file_content, "boundary=")

    def addMainDistribute(self, file_content):
        file_content = "<!--" + ("url: " + self.url + "\n") + (
                "savedTime: " + str(self.savedTime)) + "-->\n" + file_content
        return file_content

    def convert(self):
        if self.test:
            print("url: " + self.url)
            print("title: " + self.title)
            print("savedTime: " + str(self.savedTime))
            print("boundary: " + self.boundary)
            # print("message: " + str(self.message))
        main_filePath = ''
        main_file_content = ''
        md5_value = ''
        num = 0
        all_message = list(self.message.walk())
        all_message_size = len(all_message) - 1
        for part in all_message:
            # 这个情况是集合,和后面的内容重复,所以要剔除掉
            if part.get_content_maintype() == 'multipart':
                continue
            num = num + 1
            content_type = part.get_content_type()
            file_content = part.get_payload(decode=True)
            extension = part.get_content_subtype()

            if extension == "*":
                extension = PathDeal.getFileExtension(file_content)

            if num > 1:
                md5_value = hashlib.md5(file_content).hexdigest()
            if content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                encoding, file_content = PathDeal.get_file_encoding_and_content_form_file_content(file_content)
            else:
                encoding = PathDeal.get_file_encoding_and_content_form_file_content(file_content)[0]
            if num == 1:
                file_path = os.path.join(os.path.dirname(self.path), datetime.fromtimestamp(self.savedTime).strftime(
                    '%Y%m%d_%H%M%S_') + self.title + '.' + extension)
                main_filePath = format_filename(file_path)
                main_file_content = file_content
                continue

            elif content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                file_path = os.path.join(os.path.dirname(self.path), 'text',
                                         md5_value + '.' + extension)
            else:
                file_path = os.path.join(os.path.dirname(self.path), 'imgs',
                                         md5_value + '.' + extension)
            if file_path != '' and part.get('Content-Location') != None:
                main_file_content = PathDeal.replace_link(main_file_content, str(part.get('Content-Location')),
                                                          os.path.relpath(file_path,
                                                                          os.path.dirname(main_filePath)).replace('\\',
                                                                                                                  '/'))
            if os.path.exists(file_path):
                continue
            directory = os.path.dirname(file_path)
            if not os.path.exists(directory):
                os.makedirs(directory)
            if encoding == '':
                with open(file_path, 'wb') as f:
                    f.write(file_content)
                continue
            if content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                # 写入文本文件
                with open(file_path, 'w', encoding=encoding) as f:
                    f.write(file_content)
            else:
                # 写入二进制文件
                with open(file_path, 'wb') as f:
                    f.write(file_content)

        # 应对文件出错情况,在极少数情况下,保存的文件会不完整,此情况无法正常定向要转换的html文件
        if main_filePath == '':
            return False
        if all_message_size > 1:
            with open(main_filePath, 'w', encoding="utf-8") as f:
                f.write(self.addMainDistribute(main_file_content))
        else:
            # 特殊情况:当文件内只包含一个非html类型问题(如图片文件时,时间二进制保存)
            with open(main_filePath, 'wb') as f:
                f.write(file_content)

        send2trash(self.path)
        return main_filePath

拆分出的工具类PathDeal:

import magic
import mimetypes
import chardet


class PathDeal:
    @staticmethod
    def getFileExtension(data):
        mime = magic.Magic(mime=True)
        mime_type = mime.from_buffer(data)
        return mimetypes.guess_extension(mime_type)[1::]

    @staticmethod
    def replace_link(html_content, link, path):
        new_html_content = html_content.replace(link, path)
        if new_html_content == html_content:
            new_html_content = html_content.replace(link.replace("https:", ""), path)
        return new_html_content

    @staticmethod
    def get_file_encoding_and_content_form_file_content(file_content):
        encoding = ''
        guess_encoding = chardet.detect(file_content)['encoding']
        if guess_encoding == 'GB2312':
            guess_encoding = 'GBK'
        try:
            file_content = file_content.decode(guess_encoding, errors='ignore')
            encoding = guess_encoding
        except:
            pass

        for guess_encoding in ['utf-8', 'GBK', 'ISO-8859-1', 'Windows-1252']:
            try:
                file_content = file_content.decode(guess_encoding, errors='ignore')
                encoding = guess_encoding
            except:
                pass

        return encoding, file_content

    @staticmethod
    def get_file_encoding_form_file_path(file_path):
        # 遍历尝试一下几种encode
        encodings = ['utf-8', 'GBK', 'ISO-8859-1', 'Windows-1252']
        for encoding in encodings:
            try:
                with open(file_path, 'r', encoding=encoding) as file:
                    file_content = file.read()
                    return encoding
            except Exception as e:
                pass

        else:
            return ''

-------------分割线----------------

以下代码于2024年9月20日写,因存在未考虑utf-8以外格式的问题,故被弃用,仅做参考

import base64
import email
import hashlib
import os
import quopri
import re
from datetime import datetime, timezone
from email.policy import default
from send2trash import send2trash


class Mht2Html:
    path = ""
    fileName = ""
    url = ""
    title = ""
    savedTime = ""
    boundary = ""
    message = ""
    test = ""

    def extract_info(self, text, key):
        start = text.find(key) + len(key)
        end = text.find("\n", start)
        value = text[start:end].strip()
        return value

    def decode_mime_encoded_text(self, encoded_text):
        # 检查是否是MIME编码
        try:
            if encoded_text.startswith('=?') and encoded_text.endswith('?='):
                # 分割编码类型、编码方式和编码文本
                parts = encoded_text[2:-2].split('?')
                charset = parts[0]
                encoding = parts[1]
                text = parts[2]

                # 根据编码方式解码文本
                if encoding == 'Q':
                    # Quoted-Printable编码
                    decoded_text = quopri.decodestring(text).decode(charset)
                elif encoding == 'B':
                    # Base64编码
                    decoded_text = base64.b64decode(text).decode(charset)
        except Exception:
            decoded_text = os.path.splitext(self.fileName)[0]
        return decoded_text

    def rfc2822TOtimestamp(self, rfc2822):
        date_format = "%a, %d %b %Y %H:%M:%S %z"
        date_time = datetime.strptime(rfc2822, date_format)
        # 将datetime对象转换为UTC时间戳
        timestamp = date_time.replace(tzinfo=timezone.utc).timestamp()
        return timestamp

    def __init__(self, path, test=False):
        self.path = path
        self.test = test
        if (self.path == ""):
            return False
        self.fileName = os.path.basename(self.path)
        with open(self.path, 'r', encoding='utf-8') as file:
            file_content = file.read()
            self.message = email.message_from_string(file_content, policy=default)

        self.url = self.extract_info(file_content, "Snapshot-Content-Location: ")
        self.title = self.decode_mime_encoded_text(self.extract_info(file_content, "Subject: "))
        self.savedTime = self.rfc2822TOtimestamp(self.extract_info(file_content, "Date: "))  # RFC 2822
        self.boundary = self.extract_info(file_content, "boundary=")

    def replace_link(self, html_content, link, path):
        # 使用正则表达式查找并替换链接
        html_content = re.sub(re.escape(link), path, html_content)
        return html_content

    def addMainDiscribe(self, file_content):
        file_content = "<!--" + ("url: " + self.url + "\n") + (
                "savedTime: " + str(self.savedTime)) + "-->" + file_content
        return file_content

    def convert(self):
        if self.test:
            print("url: " + self.url)
            print("title: " + self.title)
            print("savedTime: " + str(self.savedTime))
            print("boundary: " + self.boundary)
            # print("message: " + str(self.message))
        main_filePath = ''
        main_file_content = ''
        num = 0
        for part in self.message.walk():
            if part.get_content_maintype() == 'multipart':
                continue
            num = num + 1
            content_type = part.get_content_type()
            extension = part.get_content_subtype()
            file_content = part.get_payload(decode=True)
            if num == 1:
                file_path = os.path.join(os.path.dirname(self.path), datetime.fromtimestamp(self.savedTime).strftime(
                    '%Y%m%d %H%M%S') + self.title + '.' + extension)
                main_filePath = file_path
                main_file_content = file_content.decode('utf-8')
                continue

            elif content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                file_path = os.path.join(os.path.dirname(self.path), 'text',
                                         hashlib.md5(file_content).hexdigest() + '.' + extension)
            else:
                file_path = os.path.join(os.path.dirname(self.path), 'imgs',
                                         hashlib.md5(file_content).hexdigest() + '.' + extension)
            if os.path.exists(file_path):
                continue
            directory = os.path.dirname(file_path)
            if not os.path.exists(directory):
                os.makedirs(directory)
            if content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                # 写入文本文件
                with open(file_path, 'w', encoding='utf-8') as f:
                    f.write(file_content.decode('utf-8'))
            else:
                # 写入二进制文件
                with open(file_path, 'wb') as f:
                    f.write(file_content)
            if file_path != '' and part.get('Content-Location') != None:
                main_file_content = self.replace_link(main_file_content, part.get('Content-Location'),
                                                      os.path.relpath(file_path, main_filePath).lstrip('../'))
        with open(main_filePath, 'w', encoding='utf-8') as f:
            f.write(self.addMainDiscribe(main_file_content))
        send2trash(self.path)
        return main_filePath

最开始想搞这个是因为用手机保存了若干网页文件,由于最开始习惯上了QQ浏览器,就一直用了下去。我至今不是很理解为什么QQ浏览器保存下的网页格式是mht而不是html

或许从mht文件头中的

From: <Saved by Blink>

中的blink可以联想到和阿里巴巴的关系

查询了若干csdn也好github也好的教程或者开源的mht转html代码,发现他们大多数都是针对一些邮件使用的,最离谱的是我还找不到这格式的通用标准。我怀疑这也和我找的开源代码都无法成功转译成html文件有关系。

询问ai也无果后,只好开始了自己的折腾之路

先找一个文件观察结构

From: <Saved by Blink>
Snapshot-Content-Location: https:
Subject: =?utf-8?Q?xxx?=
Date: week, day ,time with gmt-0
MIME-Version: 1.0
Content-Type: multipart/related;
	type="text/html";
	boundary="----MultipartBoundary--xxx----"

头开始重要信息有:

content-location:记载这这个文件从什么网页保存

subject:这个网页的title

date:日期;week是星期几,后面是具体时间。整个时间都是按照gmt-0.也就是中时区储存的

boundary:每个的分段标识。xxx由系统随机生成

再往下面会有若干个小分块,都是以

------MultipartBoundary--xxx----

隔成。

每个分块的头都有

Content-Type: 
Content-ID: 
Content-Transfer-Encoding: quoted-printable
Content-Location: https://

Content-Type告诉浏览器接下来的内容是什么类型的文件,常见的有text/html,text/css;image/png等等

Content-Transfer-Encoding代表下面信息的编码,一般text的都是quoted-printable编码,image类都是base64编码。这两个编码若不懂可以自行查询

再往下分析,会发现。QQ浏览器保存的文件的第一个分段都是HTML文件,这个我将其称为主文件。因为它就是整个网页的核心,而后面的都是主文件的附加内容,比如css文件,图片文件等等

分析到这里,要如何把mht文件转为HTML文件的思路就清晰了——获取出第一个html文件,建立主文件内部与其他杂碎小文件的关系

(注:如何建立主文件内部与其他杂碎小文件的关系有两个方法。1.按照常规的html文件应当要把小文件塞到HTML文件中当成一个文件的。但是由于我手上杂碎文件较多,为了节省内存,就选择将小文件逐一保存计算哈希,采用相对位置的方式将路径映射到主文件中。)

接下来我们考虑如何将各个文件分散开来。由于各个文件是采用boundary分块,所以我们可以用正则表达式匹配这些boundary再进行分析

正则表达式多累啊。匹配完之后还要再次分段获取Content-Type等信息,有既成的mht文件处理用email库不香吗。

直接读完文件后扔给email库得到message对象:

        with open(self.path, 'r', encoding='utf-8') as file:
            file_content = file.read()
            self.message = email.message_from_string(file_content, policy=default)

(当然,email库给了很多方式得到message对象,你直接把file对象扔进去也可以(message_from_file))

通过

for part in self.message.walk():

获取到每一个分段

如果是第一个分段特殊处理,直接提出来当做主文件:

            if num == 1:
                file_path = os.path.join(os.path.dirname(self.path), datetime.fromtimestamp(self.savedTime).strftime(
                    '%Y%m%d %H%M%S') + self.title + '.' + extension)
                main_filePath = file_path
                main_file_content = file_content.decode('utf-8')
                continue

这里我采用时间+标题的形式命名。有其他命名需求的可以自行改动

后面的小分块分为文字类和图片等大类,这样储存

            elif content_type in ['text/plain', 'text/html', 'text/css', 'application/javascript']:
                file_path = os.path.join(os.path.dirname(self.path), 'text',
                                         hashlib.md5(file_content).hexdigest() + '.' + extension)
            else:
                file_path = os.path.join(os.path.dirname(self.path), 'imgs',
                                         hashlib.md5(file_content).hexdigest() + '.' + extension)

同样,有其他命名、储存需求的可以自行改动。这里我采用了

hashlib.md5(file_content).hexdigest() 

计算md5值当做文件的准唯一标识符(有风险撞库,若您厌恶该风险,可以将后面的判断文件存在改为逐内容判断里面是否相同)

对于小分块,储存后还要建立与主文件的对应关系。在QQ浏览器储存的mht文件中,该对应关系是通过Content-Location进行定位的。所以我们从小分块中获取到location后

part.get('Content-Location')

再在主文件的content中寻找有指向到该location的地方,全部替换为相对位置

            if file_path != '' and part.get('Content-Location') != None:
                main_file_content = self.replace_link(main_file_content, part.get('Content-Location'),
                                                      os.path.relpath(file_path, main_filePath).lstrip('../'))

这里提示一下。在测试过程中发现个别网页会出现part.get('Content-Location') == None,也就是没有定义Content-Location的情况。我估摸着浏览器也定位不到这个位置,就给他跳过掉了。数据有风险,转码需谨慎!虽然我目前没有发现跳过这个有出什么问题就是了

在搞完全部后,再写入主文件并且删除原mht文件即可

        with open(main_filePath, 'w', encoding='utf-8') as f:
            f.write(self.addMainDiscribe(main_file_content))
        send2trash(self.path)

如果需要批量处理,可以采用

def main(directory):
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if file_path.endswith('.mht'):
            mht = Mht2Html(path=file_path)
            mht.convert()

处理文件夹下的所有mht文件即可

再次:数据安全声明:本人学生,学艺不精。数据安全请各自负责,在没有确认文件没有出现损坏前请不要删除回收站中的源文件!!!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值