（三）Position Rank代码解读（一）

最新推荐文章于 2022-09-13 16:45:02 发布

原创

最新推荐文章于 2022-09-13 16:45:02 发布 · 582 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #深度学习 #机器学习

2021SC@SDUSC

简介

项目github地址：GitHub - corinaflorescu/PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents

项目原始python版本：2.7

目录结构

在这里插入图片描述

依赖库及版本号：

backports.functools-lru-cache1.5
decorator4.3.0
futures3.2.0
networkx2.2
nltk3.4
nose1.3.7
numpy1.15.4
Pillow5.3.0
PositionRank1.0
psutil5.4.8
pyparsing2.3.0
pytz2018.7
scipy1.1.0
singledispatch3.4.0.3
six1.12.0
subprocess323.5.3

实验室要求

为了方便实验结果复用，实验室的学长对实验环境做出如下要求：

1.	Python及其第三方库
(1)	Python3.7.6
(2)	第三方库如
①	Numpy1.18.1 支持python3.5-3.8
②	Matplotlib3.1.3 支持python3.6-3.8
③	Scipy1.4.1 支持python3.5-3.8
④	Keras2.3.1 支持python3.5-3.8
⑤	Scikit_learn0.22.1 支持python3.5-3.8
⑥	Scikit_image0.16.2 支持python3.6-3.8
2.	深度学习框架
(1)	推荐Pytorch 1.8.2支持cpu/gpu
(2)	或tensorflow 1.14 支持python3.5-3.7,python2.7;支持cpu/gpu
3.	Cuda (gpu版需要)
(1)	cuda 10.1 
(2)	Cudnn 7.6.5 for cuda 10.1(与cuda版本对应)

因此，本项目需要对代码中的内容进行一定的修改和调整来满足新环境的需要。

流程控制模块(_main_.py)

初始化

初始化评估指标，P（PR）:精准率，R（RR）：召回率，F1：由P、R计算得到（本质上是P与R调和平均倒数乘以2），来解决PR之间的冲突。

计算公式如下：在这里插入图片描述

注意在python中对列表使用“”*“表示将列表数据复制并拼接。如[0] * args.topK表示一个大小为1*topK，且数据均为0的列表。

    # initialize the evaluation metrics vectors
    P, R, F1 = [0] * args.topK, [0] * args.topK, [0] * args.topK
    Rprec = 0.0
    bpref = 0.0
    docs = 0
    files = [f for f in os.listdir(args.input_data) if isfile(join(args.input_data, f))]

运行流程控制分析

    for filename in files:
        print(filename)
        # if doc has passed the criteria then we save its text and gold
        # 借助之前生成的文件名列表生成文件的绝对路径列表
        # text指的是原文的绝对路径列表，gold指的是已标注关键字的文件列表
        # 如果路径存在，读取源文件
        # 如果相应路径不存在，返回值为None
        text = process_data.read_input_file(args.input_data + filename)
        # 如果路径存在，读取文件并返回关键短语（keyphrase）列表
        # 如果相应路径不存在，返回值为None
        gold = process_data.read_gold_file(args.input_gold + filename)

        # 如果文档路径和关键词路径同时存在
        if text and gold:
            gold_stemmed = []
            # 将短语中的每一个单词进行词形还原并拼接成新的短语
            for keyphrase in gold:
                keyphrase = [porter_stemmer.stem(w) for w in keyphrase.lower().split()]
                gold_stemmed.append(' '.join(keyphrase))
            # count the document
            docs += 1

            # 对position_rank算法进行初始化
            system = PositionRank.PositionRank(text, args.window, args.phrase_type)