word中出现了这种鬼东西:
问卷很多,同事要帮忙提出来。研究了下,找到py的最佳方法,直接上代码
大概思路就是先转html,然后提取html就简单了,注意一定要用pywin32转,其他的转发,丢东西,这是微软的active控件,只能他们自家转的好。
#!/usr/bin/env python
# coding=utf-8
import win32com.client as win32
from bs4 import BeautifulSoup
import os
from openpyxl import Workbook
def docx2html():
word = win32.gencache.EnsureDispatch('Word.Application')
p=os.path.abspath("./test.docx")
print (p)
doc = word.Documents.Open(p)
# doc.SaveAs('./2.pdf', 17)
doc.SaveAs('D:\mypy\dataWash\wang_han\zz.html', 10)
doc.Close()
word.Quit()
def parse(f="./zz.html"):
wb=Workbook()
ws=wb.active
ws.title="res"
start=1
with open(f,"rb") as f:
data=f.read()
soup = BeautifulSoup(data, 'html.parser')
oblist=soup.select("object")
for ob in oblist:
paramlist=ob.select("param")
for param in paramlist:
name=param.get("name")
if name=="Value":
ws.cell(row=start,column=1).value=name
ws.cell(row=start, column=2).value = param.get("value")
start+=1
wb.save("caoyang.xlsx")
parse()
转出来的html
</object></span><span style='font-size:16.0pt;font-family:仿宋_GB2312'>省公司级单位本部</span></p>
</td>
</tr>
<tr style='height:25.0pt'>
<td width=590 style='width:442.8pt;border:solid white 1.0pt;border-top:none;
background:white;padding:0cm 5.4pt 0cm 5.4pt;height:25.0pt'>
<p class=MsoNormal style='text-indent:32.0pt;line-height:30.0pt'><span
lang=EN-US style='font-size:16.0pt;font-family:仿宋_GB2312'><object
classid="CLSID:8BD21D50-EC42-11CE-9E0D-00AA006002F3" id=OptionButton2
width=17 height=37>
<param name=DisplayStyle value=5>
<param name=Size value="462;990">
<param name=Value value=0>
<param name=GroupName value=1>
<param name=FontName value=微软雅黑>
<param name=FontHeight value=315>
<param name=FontCharSet value=134>
<param name=FontPitchAndFamily value=34>
</object></span><span style='font-size:16.0pt;font-family:仿宋_GB2312'>地市公司级单位</span></p>
</td>
</tr>
<tr style='height:25.0pt'>
<td width=590 style='width:442.8pt;border:solid white 1.0pt;border-top:none;
background:white;padding:0cm 5.4pt 0cm 5.4pt;height:25.0pt'>
<p class=MsoNormal style='text-indent:32.0pt;line-height:30.0pt'><span
lang=EN-US style='font-size:16.0pt;font-family:仿宋_GB2312'><object
classid="CLSID:8BD21D50-EC42-11CE-9E0D-00AA006002F3" id=OptionButton3
width=17 height=37>
<param name=DisplayStyle value=5>
<param name=Size value="462;990">
<param name=Value value=0>
<param name=GroupName value=1>
<param name=FontName value=微软雅黑>
<param name=FontHeight value=315>
<param name=FontCharSet value=134>
<param name=FontPitchAndFamily value=34>
</object>
仔细观察,其实就是找他: 或者value=1.然后一切搞定。