python复制word段落,在Word文件中找到标题,然后使用python将整个段落复制到新的Word文件中...

I have the following situation:

I have several hundred word files that contain company information. I would like to search these files for specific words to find specific paragraphs and copy just these paragraphs to new word files. Basically I just need to reduce the original couple hundred documents to a more readable size each.

The documents that I have are located in one directory and carry different names. In each of them I want to extract particular information that I need to define individually.

To go about this I started with the following code to first write all file names into a .csv file:

# list all transcript files and print names to .csv

import os

import csv

with open("C:\\Users\\Stef\\Desktop\\Files.csv", 'w') as f:

writer = csv.writer(f)

for path, dirs, files in os.walk("C:\\Users\\Stef\\Desktop\\Files"):

for filename in files:

writer.writerow([filename])

This works perfectly. Next I open Files.csv and edit the second column for the keywords that I need to search for in each document.

See picture below for how the .csv file looks:

The couple hundred word files I have, are structured with different layers of headings. What I wanted to do now was to search for specific headings with the keywords I manually defined in the .csv and then copy the content of the following passage to a new file. I uploaded an extract from a word file, "Presentation" is a 'Heading 1' and "North America" and "China" are 'Heading 2'.

In this case I would like for example to search for the 'Headline 2' "North America" and then copy the text that is below ("In total [...] diluted basis.) to a new word file that has the same name as the old one just an added "_clean.docx".

I started with my code as follows:

import os

import glob

import csv

import docx

os.chdir('C:\\Users\\Stef\\Desktop')

f = open('Files.csv')

csv_f = csv.reader(f)

file_name = []

matched_keyword = []

for row in csv_f:

file_name.append(row[0])

matched_keyword.append(row[1])

filelist = file_name

filelist2 = matched_keyword

for i, j in zip(filelist, filelist2):

rootdir = 'C:\\Users\\Stef\\Desktop\\Files'

doc = docx.Document(os.path.join(rootdir, i))

After this I was not able to find any working solution. I tried a few things but could not succeed at all. I would greatly appreciate further help.

I think the end should then again look something like this, however not quite sure.

output =

output.save(i +"._clean.docx")

Have considered the following questions and ideas:

解决方案

Just figured something similar for myself, so here is a complete working example for you. Might be a more pythonic way of doing it…

from docx import Document

inputFile = 'soTest.docx'

try:

doc = Document(inputFile)

except:

print(

"There was some problem with the input file.\nThings to check…\n"

"- Make sure the file is a .docx (with no macros)"

)

exit()

outFile = inputFile.split("/")[-1].split(".")[0] + "_clean.docx"

strFind = 'North America'

# paraOffset used in the event the paragraphs are not adjacent

paraOffset = 1

# document.paragraph returns a list of objects

parasFound = []

paras = doc.paragraphs

# use the list index find the paragraph immediately after the known string

# keep a list of found paras, in the event there is more than 1 match

parasFound = [paras[index+paraOffset]

for index in range(len(paras))

if (paras[index].text == strFind)]

# Add paras to new document

docOut = Document()

for para in parasFound:

docOut.add_paragraph(para.text)

docOut.save(outFile)

exit()

I've also added a image of the input file, showing that North America appears in more than 1 place. 949dfdc93c99e3feff11eaba16cb29fd.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值