关于python如何处理word文档doc docx,可以关注 python-docx 和 python-docx2txt 两个项目,python-docx复杂一些,适合创建文档,python-docx2txt可以方便将文档转换成txt:
https://python-docx.readthedocs.org/en/latest/
https://github.com/python-openxml/python-docx
另外doc文件本身是个压缩文件,实际文档内容是xml结构的,可使用unzip解压:
# unzip test.docx
Archive: test.docx
inflating: _rels/.rels
inflating: word/settings.xml
inflating: word/_rels/document.xml.rels
inflating: word/fontTable.xml
inflating: word/styles.xml
inflating: word/document.xml
inflating: docProps/app.xml
inflating: docProps/core.xml
&n