由 for text in wiki.get_texts():引发的Python BZ2 IOError: invalid data stream报错!

本文详细记录了使用gensim库的WikiCorpus处理中文维基百科语料库时遇到的OSError错误,解释了错误原因在于未正确处理.bz2压缩文件,并提供了修改后的代码解决方案。
Python3.10

Python3.10

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

最近在跑wiki中文语料库的词向量训练,第一步就是把xml类型的文档转化为txt类型的文档。

看了别人的代码,好多人直接用了gensim的WikiCorpus,原始代码如下:

# coding; utf-8

from gensim.corpora import WikiCorpus

if __name__ == '__main__':

    print('主程序开始...')

    input_file_name = 'zhwiki-20191120-pages-articles-multistream.xml'
    output_file_name = 'wiki.cn.txt'
    print('开始读入wiki数据...')
    output_file = open(output_file_name, 'w', encoding="utf-8")
    input_file = WikiCorpus(input_file_name, lemmatize=False, dictionary={})
    print('wiki数据读入完成!')
    print('处理程序开始...')
    count = 0
    for texts in input_file.get_texts():
        output_file.write(b' '.join(texts).decode('utf-8') + '\n')
        count = count + 1
        if count % 10000 == 0:
           print('目前已处理%d条数据' % count)
        print('处理程序结束!')

  #  output_file.close()
    print('主程序结束!')

结果就出现了令人百思不得其解的一串报错??最后:

OSError: Invalid data stream

traceback从这里开始:

Traceback (most recent call last):
  File "C:/PycharmProjects/zhwiki/xml2txt.py", line 23, in <module>
    for texts in input_file.get_texts():

查了好几个小时,最后终于在WikiCorpus.py(Ctrl+右键+点击函数名)中找到了一点线索:

wikicorpus类:

get_texts()函数:

注意下图红线处!!!!!!BZ2File!!!!!!! 

所以这里的input file是那个我们下载的时候,没有解压的那个!!!!! 

终于搞明白了,改成下面的之后,再运行就没有再报这个错了(不过还有别的错。。) 

input_file_name = 'zhwiki-20191120-pages-articles-multistream.xml.bz2'#输入文件应是.bz2后缀的未解压文件

您可能感兴趣的与本文相关的镜像

Python3.10

Python3.10

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

class DocumentBlockObject(BaseObject): """文档块对象 文档块是一个文档从结构上应该视为一个整体的部分,整个文档由这样多个块构成。 在不同的文档对应的部分不同,word中指代整个word。在Excel中指代一个表单,因为每个表单承载了独立的数据。 """ def __init__(self): self._file_name = "" # 式样书的名称 self._name = '' # block名称 self._type = "block" self._header = [] # 页眉 self._footer = [] # 页脚 self._elements = [] # 所有元素的对象列表,按文档顺序装载。例:[TextObject, TableObject] self._texts: list[TextObject] = [] # 文本对象列表,按文档顺序装载。例:[TextObject] self._tables: list[TableObject] = [] # 表格对象列表,按文档顺序装载。例:[TableObject] self._pictures: list[PictureObject] = [] # 图片对象列表,按文档顺序装载rId。例:[PictureObject] self._graphics: list[GraphicObject] = [] # 图形对象列表,按文档顺序装载。例:[GraphicObject] self._timing_waves = [] # 时序图对象列表 例:[TimingWaveObject] self._timing_texts = [] # 时序图行文本对象列表 例:[TimingTextObject] self._settings = [] # block级别的属性信息 self._layouts = [] # block级别的布局信息 self._styles = [] # block级别的样式信息 self._data_id = 0 # 数据的唯一id def to_dict(self): """ 将 DocumentBlockObject 对象转换为字典 """ return { "file_name": self._file_name, "name": self._name, "type": self._type, "header": [[element.to_dict() for element in header_list] for header_list in self._header], # 假设列表中的元素为基本数据类型 "footer": [[element.to_dict() for element in footer_list] for footer_list in self._footer], "elements": [element.to_dict() for element in self._elements], # 处理嵌套对象列表 "texts": [text.to_dict() for text in self._texts], # 文本对象列表 "tables": [table.to_dict() for table in self._tables], # 表格对象列表 "pictures": [picture.to_dict() for picture in self._pictures], # 图片对象列表 "graphics": [graphic.to_dict() for graphic in self._graphics], # 图形对象列表 "timing_waves": [timing_wave.to_dict() for timing_wave in self._timing_waves], # 时序图对象列表 "timing_texts": [timing_text.to_dict() for timing_text in self._timing_texts], # 时序图行文本列表 "settings": self._settings, "layouts": [layout.to_dict() for layout in self._layouts], "styles": [style.to_dict() for style in self._styles], "data_id": self._data_id, # "position": self._position.to_dict() } @classmethod def from_dict(cls, data): """ 从字典创建 DocumentBlockObject 实例 """ obj = cls() obj._file_name = data.get("file_name", "") obj._name = data.get("name", '') obj._type = data.get("type", "block") obj._header = data.get("header", []) obj._footer = data.get("footer", []) # obj._elements = [TextObject.from_dict(e) if e.get("type") == "text" # else TableObject.from_dict(e) if e.get("type") == "table" # else PictureObject.from_dict(e) if e.get("type") == "picture" # else GraphicObject.from_dict(e) if e.get("type") == "graphic" # else e for e in data.get("elements", [])] obj._texts = [TextObject.from_dict(t) for t in data.get("texts", [])] obj._tables = [TableObject.from_dict(t) for t in data.get("tables", [])] obj._pictures = [PictureObject.from_dict(p) for p in data.get("pictures", [])] obj._graphics = [GraphicObject.from_dict(g) for g in data.get("graphics", [])] obj._timing_waves = [TimingWaveObject.from_dict(w) for w in data.get("timing_waves", [])] obj._timing_texts = [TimingTextObject.from_dict(t) for t in data.get("timing_texts", [])] obj._settings = data.get("settings", []) obj._layouts = [LayoutObject.from_dict(l) for l in data.get("layouts", [])] obj._styles = [StyleObject.from_dict(s) for s in data.get("styles", [])] obj._data_id = data.get("data_id", 0) # obj._position = Position.from_dict(data.get("position", {})) return obj def __repr__(self): return f'{self.__class__.__name__}()[NAME="{self._name}"]' def __str__(self): return self._name @property def id(self): return self._data_id @id.setter def id(self, new_value): assert type(new_value) == int self._data_id = new_value @property def name(self): return self._name @name.setter def name(self, new_value): assert type(new_value) == str self._name = new_value @property def file_name(self): return self._file_name @file_name.setter def file_name(self, new_value): assert type(new_value) is str self._file_name = new_value @property def elements(self): return self._elements @elements.setter def elements(self, new_value): assert type(new_value) == list self._elements = new_value @property def texts(self): return self._texts @texts.setter def texts(self, new_value): assert type(new_value) == list self._texts = new_value @property def header(self): return self._header @header.setter def header(self, new_value): assert type(new_value) is list self._header = new_value @property def footer(self): return self._footer @footer.setter def footer(self, new_value): assert type(new_value) is list self._footer = new_value @property def tables(self): return self._tables @tables.setter def tables(self, new_value): assert type(new_value) == list self._tables = new_value @property def pictures(self): return self._pictures @pictures.setter def pictures(self, new_value): assert type(new_value) == list self._pictures = new_value @property def graphics(self): return self._graphics @graphics.setter def graphics(self, new_value): assert type(new_value) == list self._graphics = new_value @property def timing_waves(self): return self._timing_waves @timing_waves.setter def timing_waves(self, new_value): assert type(new_value) == list self._timing_waves = new_value @property def timing_texts(self): return self._timing_texts @timing_texts.setter def timing_texts(self, new_value): assert type(new_value) == list self._timing_texts = new_value @property def settings(self): return self._settings @settings.setter def settings(self, new_value): assert type(new_value) == list self._settings = new_value @property def layouts(self): return self._layouts @layouts.setter def layouts(self, new_value): assert type(new_value) == list self._layouts = new_value @property def styles(self): return self._styles @styles.setter def styles(self, new_value): assert type(new_value) == list self._styles = new_value def add_text(self, text_object): """添加文本对象 :param text_object: 文本对象 """ assert type(text_object) in [list, TextObject] if text_object: if isinstance(text_object, list): for obj in text_object: self.id += 1 obj.data_id = self.id self._texts.extend(text_object) self._elements.extend(text_object) else: self.id += 1 text_object.data_id = self.id self._texts.append(text_object) self._elements.append(text_object) def add_table(self, table_object): """添加表格对象 :param table_object: 表格对象 """ assert type(table_object) in [list, TableObject] if table_object: if isinstance(table_object, list): for table in table_object: self.id += 1 table.data_id = self.id self.align_table_col(table) self._tables.extend(table_object) self._elements.extend(table_object) else: self.id += 1 table_object.data_id = self.id self.align_table_col(table_object) self._tables.append(table_object) self._elements.append(table_object) def add_picture(self, picture_object, is_in_table=False): """添加图片对象 :param picture_object: 图片对象 """ assert type(picture_object) in [list, PictureObject] if picture_object: if isinstance(picture_object, list): for obj in picture_object: self.id += 1 obj.data_id = self.id if not is_in_table: self._pictures.extend(picture_object) self._elements.extend(picture_object) else: self.id += 1 picture_object.data_id = self.id if not is_in_table: self._pictures.append(picture_object) self._elements.append(picture_object) def add_graphic(self, graphic_object, is_in_table=False): """添加图形对象 :param graphic_object: 图形对象 """ assert type(graphic_object) in [list, GraphicObject] if graphic_object: if isinstance(graphic_object, list): for obj in graphic_object: self.id += 1 obj.data_id = self.id if not is_in_table: self._graphics.extend(graphic_object) self._elements.extend(graphic_object) else: self.id += 1 graphic_object.data_id = self.id if not is_in_table: self._graphics.append(graphic_object) self._elements.append(graphic_object) def add_timing_wave(self, object): """ 添加时序图对象 :param object: 时序图对象 """ assert type(object) in [list, TimingWaveObject] if object: if isinstance(object, list): for obj in object: self.id += 1 obj.data_id = self.id self._timing_waves.extend(object) self._elements.extend(object) else: self.id += 1 object.data_id = self.id self._timing_waves.append(object) self._elements.append(object) def add_timing_text(self, object): """ 添加时序图对象 :param object: 时序图对象 """ assert type(object) in [list, TimingTextObject] if object: if isinstance(object, list): for obj in object: self.id += 1 obj.data_id = self.id self._timing_texts.extend(object) self._elements.extend(object) else: self.id += 1 object.data_id = self.id self._timing_texts.append(object) self._elements.append(object) def align_table_col(self, base_table): max_col_count = max([len(row.cells) for row in base_table.rows]) for base_row in base_table.rows: if len(base_row.cells) != max_col_count: # 匹配行的列数不一致,补齐缺失的cell add_col_count = abs(len(base_row.cells) - max_col_count) base_row.cells.extend([CellObject() for _ in range(add_col_count)]) def get_chapter_content(self, text_obj: TextObject) -> List[Union[str, list]]: """ 获取word文档中一个章节标题对象下的所有子内容, 返回列表。 如变更履历,获取"变更履历"章节下的文本等内容 (只考虑文本、表格) """ para_text = "" table_data = [] total_result = [] # 是章节标题 if getattr(text_obj.layout, "chapter_id", None): cur_idx = self._elements.index(text_obj) + 1 while cur_idx < len(self._elements): cur_obj = self._elements[cur_idx] if isinstance(cur_obj, TextObject): # 遇到下一个章节标题,则停止 if getattr(cur_obj.layout, "chapter_id", None): break para_text += cur_obj.text + "\n" if table_data: table_data = [] elif isinstance(cur_obj, TableObject): if para_text: total_result.append(para_text) para_text = "" for row in cur_obj.rows: row_data = [] for cell in row.cells: row_data.append(cell.text) table_data.append(row_data) total_result.append(table_data) cur_idx += 1 if para_text: total_result.append(para_text) total_result = [i.strip("\n") if isinstance(i, str) else i for i in total_result] return total_result @staticmethod def is_change_resume( obj: Union[TextObject, PictureObject, GraphicObject, TableObject, RowObject, CellObject]) -> bool: """ 判断当前对象是否为变更履历下面的内容,返回bool类型 兼容word & excel """ parent_node = None # ""/None/TextObject/DocumentBlockObject # 文本、图片、图形 if isinstance(obj, (TextObject, PictureObject, GraphicObject, TableObject)): parent_node = obj.layout.parent_ref # 表格行对象 elif isinstance(obj, RowObject): row_parent = obj.layout.parent_ref if row_parent: parent_node = row_parent.layout.parent_ref # 单元格对象 elif isinstance(obj, CellObject): cell_parent = obj.layout.parent_ref if cell_parent: row_parent = cell_parent.layout.parent_ref if row_parent: parent_node = row_parent.layout.parent_ref if isinstance(parent_node, TextObject): # word # 兼容多级别章节标题 while parent_node: if any([True if i in parent_node.text else False for i in CHANGE_RESUME]): return True parent_node = parent_node.layout.parent_ref if isinstance(parent_node, DocumentBlockObject): return False else: return False elif isinstance(parent_node, DocumentBlockObject): # excel return any([True if i in parent_node.name else False for i in CHANGE_RESUME]) return False @staticmethod def get_chapter(obj: Union[TextObject, PictureObject, GraphicObject, TableObject, RowObject, CellObject]): """ 获取通用对象的章节文本对象 """ parent_node = None # 文本、图片、图形 if isinstance(obj, (TextObject, PictureObject, GraphicObject, TableObject)): parent_node = obj.layout.parent_ref # 表格行对象 elif isinstance(obj, RowObject): row_parent = obj.layout.parent_ref if row_parent: parent_node = row_parent.layout.parent_ref # 单元格对象 elif isinstance(obj, CellObject): cell_parent = obj.layout.parent_ref if cell_parent: row_parent = cell_parent.layout.parent_ref if row_parent: parent_node = row_parent.layout.parent_ref return parent_node 绘制出类图
最新发布
10-30
<think>我们需要根据给定的Python类`DocumentBlockObject`来绘制类图。由于我们无法直接生成图像,我将使用PlantUML语法来描述类图,你可以将其复制到PlantUML在线编辑器(如http://www.plantuml.com/)中生成图片。 类图应包含: 1. 类名:DocumentBlockObject 2. 属性:包括私有属性(如_file_name, _name等)和通过@property装饰器暴露的属性(如name, elements等) 3. 方法:包括构造方法`__init__`,转换方法`to_dict`和`from_dict`,以及各种添加对象的方法(如`add_text`、`add_table`等)和其他方法(如`get_chapter_content`等) 注意:由于类图通常不展示所有细节,我们选择重要的属性和方法进行展示。另外,考虑到类图的可读性,我们将只展示主要的公共接口(即属性获取器和设置器,以及重要方法)。 根据代码,我们注意到: - `DocumentBlockObject`继承自`BaseObject`(虽然未给出`BaseObject`的定义,但在类图中应体现继承关系)。 - 类中有多个私有属性,以及通过property定义的属性访问器和设置器。 - 有多个`add_xxx`方法用于添加不同类型的内容对象(如文本、表格等)。 - 还有一些静态方法和实例方法。 下面使用PlantUML语法绘制类图: ```plantuml @startuml ' 设置方向为从左到右 left to right direction ' 定义基类BaseObject class BaseObject { } ' 定义DocumentBlockObject类 class DocumentBlockObject { - _file_name: str - _name: str - _type: str - _header: list - _footer: list - _elements: list - _texts: list[TextObject] - _tables: list[TableObject] - _pictures: list[PictureObject] - _graphics: list[GraphicObject] - _timing_waves: list - _timing_texts: list - _settings: list - _layouts: list - _styles: list - _data_id: int + __init__() + to_dict() -> dict + from_dict(data: dict) -> DocumentBlockObject + __repr__() -> str + __str__() -> str + id: int (property, setter) + name: str (property, setter) + file_name: str (property, setter) + elements: list (property, setter) + texts: list (property, setter) + header: list (property, setter) + footer: list (property, setter) + tables: list (property, setter) + pictures: list (property, setter) + graphics: list (property, setter) + timing_waves: list (property, setter) + timing_texts: list (property, setter) + settings: list (property, setter) + layouts: list (property, setter) + styles: list (property, setter) + add_text(text_object: TextObject | list) + add_table(table_object: TableObject | list) + add_picture(picture_object: PictureObject | list, is_in_table: bool = False) + add_graphic(graphic_object: GraphicObject | list, is_in_table: bool = False) + add_timing_wave(object: TimingWaveObject | list) + add_timing_text(object: TimingTextObject | list) + align_table_col(base_table: TableObject) + get_chapter_content(text_obj: TextObject) -> list + is_change_resume(obj: Union) -> bool (static) + get_chapter(obj: Union) -> object (static) } ' 继承关系 BaseObject <|-- DocumentBlockObject ' 关联关系(由于其他类未定义,这里只做简单示意) DocumentBlockObject "1" *-- "0..*" TextObject DocumentBlockObject "1" *-- "0..*" TableObject DocumentBlockObject "1" *-- "0..*" PictureObject DocumentBlockObject "1" *-- "0..*" GraphicObject DocumentBlockObject "1" *-- "0..*" TimingWaveObject DocumentBlockObject "1" *-- "0..*" TimingTextObject DocumentBlockObject "1" *-- "0..*" LayoutObject DocumentBlockObject "1" *-- "0..*" StyleObject @enduml ``` 说明: 1. 由于PlantUML中无法直接表示Union类型,我们使用`|`符号表示或者直接写类型名称。但在关联关系中,我们只列出了主要关联的类(如TextObject, TableObject等)。 2. 静态方法`is_change_resume`和`get_chapter`的参数类型`Union`在类图中无法详细表示,因此只写了返回类型和参数名。 3. 关联关系用组合关系(实心菱形)表示,因为`DocumentBlockObject`包含了这些对象,并且这些对象的生命周期由`DocumentBlockObject`管理。 4. 由于其他类(如TextObject, TableObject等)未给出定义,这里只是示意关联关系。在实际类图中,你可能需要定义这些类。 注意事项: - 上述PlantUML代码是一个简化版本,重点展示`DocumentBlockObject`的结构。 - 如果你需要更详细的类图,可能需要补充其他相关类的定义。 §§相关问题§§: 1. 如何用Python生成类图? 2.Python中,如何设计一个类的序列化和反序列化方法(如to_dict和from_dict)? 3. 在面向对象设计中,组合和聚合有什么区别?在Python中如何实现? 4. 如何使用Python的`property`装饰器来管理类的属性? </think>以下是基于提供的 `DocumentBlockObject` 类代码绘制的 UML 类图(使用 PlantUML 语法表示)。这个类图展示了类的属性、方法、继承关系以及与其它类的关系: ```plantuml @startuml !theme plain ' 定义基础类 class BaseObject { } ' 定义关联类 class TextObject { } class TableObject { } class PictureObject { } class GraphicObject { } class TimingWaveObject { } class TimingTextObject { } class LayoutObject { } class StyleObject { } class RowObject { } class CellObject { } ' 主类定义 class DocumentBlockObject { - _file_name: str - _name: str - _type: str - _header: list - _footer: list - _elements: list - _texts: list[TextObject] - _tables: list[TableObject] - _pictures: list[PictureObject] - _graphics: list[GraphicObject] - _timing_waves: list - _timing_texts: list - _settings: list - _layouts: list - _styles: list - _data_id: int + __init__() + to_dict() -> dict + from_dict(data: dict) -> DocumentBlockObject + __repr__() -> str + __str__() -> str ' Properties + id: int (getter/setter) + name: str (getter/setter) + file_name: str (getter/setter) + elements: list (getter/setter) + texts: list (getter/setter) + header: list (getter/setter) + footer: list (getter/setter) + tables: list (getter/setter) + pictures: list (getter/setter) + graphics: list (getter/setter) + timing_waves: list (getter/setter) + timing_texts: list (getter/setter) + settings: list (getter/setter) + layouts: list (getter/setter) + styles: list (getter/setter) ' Methods + add_text(text_object: TextObject | list) + add_table(table_object: TableObject | list) + add_picture(picture_object: PictureObject | list, is_in_table: bool) + add_graphic(graphic_object: GraphicObject | list, is_in_table: bool) + add_timing_wave(object: TimingWaveObject | list) + add_timing_text(object: TimingTextObject | list) + align_table_col(base_table: TableObject) + get_chapter_content(text_obj: TextObject) -> list + is_change_resume(obj: Union) -> bool (static) + get_chapter(obj: Union) -> object (static) } ' 继承关系 BaseObject <|-- DocumentBlockObject ' 聚合关系 DocumentBlockObject "1" o-- "*" TextObject DocumentBlockObject "1" o-- "*" TableObject DocumentBlockObject "1" o-- "*" PictureObject DocumentBlockObject "1" o-- "*" GraphicObject DocumentBlockObject "1" o-- "*" TimingWaveObject DocumentBlockObject "1" o-- "*" TimingTextObject DocumentBlockObject "1" o-- "*" LayoutObject DocumentBlockObject "1" o-- "*" StyleObject ' 依赖关系 DocumentBlockObject --> RowObject : get_chapter_content() DocumentBlockObject --> CellObject : get_chapter_content() TableObject --> RowObject : rows RowObject --> CellObject : cells @enduml ``` ### 类图说明: 1. **继承关系**: - `DocumentBlockObject` 继承自 `BaseObject` 2. **聚合关系**: - 包含 0..n 个 `TextObject`、`TableObject`、`PictureObject` 等子对象 - 使用空心菱形箭头表示整体与部分的聚合关系 3. **属性特点**: - 所有属性都实现了 getter/setter - 核心列表属性:`_elements`、`_texts`、`_tables`等 - 数据标识属性:`_data_id` 4. **核心方法**: - **序列化/反序列化**:`to_dict()` 和 `from_dict()` - **批量添加方法**:`add_text()`、`add_table()` 等支持单个对象或列表 - **内容处理**:`get_chapter_content()` 获取结构化内容 - **辅助方法**:`align_table_col()` 对齐表格列 5. **静态方法**: - `is_change_resume()` 判断变更履历 - `get_chapter()` 获取章节信息 ### 关键设计模式: 1. **组合模式**(Composite): - 文档块作为容器,包含多种类型的子元素 - 通过`_elements`统一管理所有子元素 2. **装饰器模式**(Decorator): - 使用 `@property` 装饰器实现属性访问控制 - 提供数据验证(`assert`类型检查) 3. **工厂方法**(Factory Method): - `from_dict()` 根据字典数据重建对象实例 4. **访问者模式**(Visitor): - `to_dict()` 实现对象结构的遍历和转换 ### 可视化工具有: 1. 使用 [PlantUML 在线编辑器](http://www.plantuml.com/plantuml) 粘贴上述代码生成类图 2. 或用 PyCharm 的 Diagrams 功能(右键点击类名 > Diagrams > Show Diagram)
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值