文本合并行处理

最新推荐文章于 2025-09-15 10:30:00 发布

原创最新推荐文章于 2025-09-15 10:30:00 发布 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

Python 同时被 2 个专栏收录

173 篇文章

订阅专栏

日常问题

15 篇文章

订阅专栏

本文介绍了一个简单的Python脚本，用于处理从PDF文档中复制出来带有特殊格式（如换行符和破折号）的文本。通过合并这些被错误分割的单词，使得文本能够更方便地应用于诸如翻译等场景。

在复制pdf一段文字时,由于格式原因,变成'豆腐块',如下所示.

再进行应用(百度翻译)时比较麻烦.

The dataset is recorded using a time-of-flight
Intel Creative Interactive Gesture Camera and has
J = 16 annotated joints. Although the authors pro-
vide different artificially rotated training samples, we
only use the genuine 22k. The depth images have
a high quality with hardly any missing depth val-
ues, and sharp outlines with little noise. However,
the pose variability is limited compared to the NYU
dataset. Also, a relatively large number of samples
both from the training and test sets are incorrectly
annotated: We evaluated the accuracy and about 36%
of the poses from the test set have an annotation error
of at least 10 mm.

写了一小段python,对文本进行并行处理

def main():

    with open('a.md', 'r+') as obj:
        lines = obj.readlines()
        strr = ''
        for line in lines:
            line = line.rstrip()
            if len(line)==0:
                pass
            elif line[-1] == '-':
                strr += line[:-1]
            else:
                strr += line+' '
        obj.write(strr)
    obj.close()

if __name__ == '__main__':
    main()

此处'a.md'是ubuntu下随便起的文件名称.(Windows可以改为'a.txt')

注意要把py文件和md文件放在一个文件夹下.

处理后结果:

The dataset is recorded using a time-of-flight Intel Creative Interactive Gesture Camera and has J = 16 annotated joints. Although the authors provide different artificially rotated training samples, we only use the genuine 22k. The depth images have a high quality with hardly any missing depth values, and sharp outlines with little noise. However, the pose variability is limited compared to the NYU dataset. Also, a relatively large number of samples both from the training and test sets are incorrectly annotated: We evaluated the accuracy and about 36% of the poses from the test set have an annotation error of at least 10 mm.