nlp--基于SpaCy和Networkx的依存树和最短依存路径分析

最新推荐文章于 2025-05-09 17:37:54 发布

函右右

最新推荐文章于 2025-05-09 17:37:54 发布

阅读量3.7k

点赞数 3

CC 4.0 BY-SA版权

文章标签： nlp python

本文链接：https://blog.youkuaiyun.com/m0_51732188/article/details/109479891

本文通过SpaCy和Networkx实现Harry Potter文本的依存树分析及Harry作为主语或宾语出现频率统计。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

nlp--基于SpaCy和Networkx的依存树和最短依存路径分析

一、使用SpaCy分析例句

本次实验中，选择用来进行依存关系分析的句子是：
Jingbo who dresses a green t-shirt was instructed by Chen.
在实验时输入这个句子，就可以分析词和词之间的语法关系和依存关系，还可以打印两个词之间的最短依存路径，绘制整个句子的依存句法树。

使用SpaCy判断句子的依存关系

其中token.head.text表示当前从属词（dependent）的支配词（head），token.text表示当前的从属词，token.dep_表示这两个词之间的依存关系（dependencerelation）。

在这里插入图片描述

使用displacy来绘制依存句法树

例句“Jingbo who dresses a green t-shirt was instructed by Chen.”绘制得到的结果保存在网址http://localhost:5000中
在这里插入图片描述

注意：en_core_web_sm的版本不同，依存句法树的结果也可能不同，本次实验是2.4版本显示的结果。

打印最短依存路径

以“Jingbo“和”Chen“之间的最短路径为例
在这里插入图片描述

可以得到这两个单词之间的最短依存路径为Jingbo-instructed-by-Chen，最短路径长度为3

在这里插入图片描述

完整代码如下：
在这里插入图片描述如果nlp=spacy.load(“en_core_web_sm”)出现报错
参考解决NLP任务中安装spacy的问题

二、Harry Potter 文本的主谓宾分析

实验数据预处理

下载txt文本Harry Potter and the Sorcerer’s Stone
保存为Harry_Potter_1.txt
进入终端

grep --color -E "\bHarry\b" Harry_Potter_1.txt

在这里插入图片描述说明：
grep:搜索
–color:用颜色展示出来
-E：使用正则表达式匹配
“\bHarry\b”：正则表达式
Harry_Potter_1.txt：在之前下载的哈利波特与魔法石文本

grep --color -E "\bHarry\b" Harry_Potter_1.txt |wc
grep --color -E "\bHarry\b" Harry_Potter_1.txt > Harry_Potter_Sentence.txt

在这里插入图片描述说明：
|：用来连接命令
wc: (wordcount)显示三个信息（文件行数单词数字节数）
>：生成新文件
Harry_Potter_Sentence.txt：新文件名

分析Harry作为主语或宾语出现（在所有句型中）

#sent_token.py
import spacy
import networkx as nx
import re

nlp=spacy.load('en_core_web_sm')
doc=nlp('''此处复制生成的Harry_Potter_Sentence.txt文件的内容''')

pattern1=re.compile(r'.*subj')
pattern2=re.compile(r'.*obj')
countm=0
countn=0
subj=[]
obj=[]
subjword={}
objword={}

for sent in doc.sents:
    #print("\nSentence is: ",sent)
    for token in nlp(str(sent)):
        #print("Tokens are: ",token.text)
        if 'Harry' == token.text:
            if(pattern1.match(token.dep_)!=None):
                countm+=1
                print((token.head.text,token.text,token.dep_))
                if token.dep_ not in subj:
                    subj.append(token.dep_)
                if token.head.text in subjword.keys():
                    subjword[token.head.text]+=1
                else:
                    subjword.update({token.head.text:1})

            elif(pattern2.match(token.dep_)!=None):
                countn+=1
                print((token.head.text,token.text,token.dep_))
                if token.dep_ not in obj:
                    obj.append(token.dep_)
                if token.head.text in objword.keys():
                    objword[token.head.text]+=1
                else:
                    objword.update({token.head.text:1})

print('\n\n')
print('subj count:',countm,' ; obj count:',countn)  #统计Harry作为主语的次数，Harry作为宾语的次数
print('\n\n')
print('Harry subj type:',subj)  #Harry作为主语的情形
print('Harry obj type:',obj)  #Harry作为宾语的情形
print('\n\n')
subjword2 = sorted(subjword.items(), key=lambda subjword:subjword[1],reverse = True)
print('Harry subj word:',subjword2)  #Harry作为主语时，该句中与之直接相关的另一单词
print('\n\n')
objword2 = sorted(objword.items(), key=lambda objword:objword[1],reverse = True)
print('Harry obj word:',objword2)   #Harry作为宾语时，该句中与之直接相关的另一单词

python3 sent_token.py

在这里插入图片描述
说明：
nsubj : nominal subject，名词主语
nsubjpass: passive nominal subject，被动的名词主语

dobj : direct object直接宾语
pobj : object of a preposition，介词的宾语

分析Harry作为主语或宾语出现（在完整的“主谓宾”结构的句子中）

#Harry_subj_obj.py
import spacy
import networkx as nx
import re

nlp=spacy.load('en_core_web_sm')
f=open('Harry_Potter_1.txt')

countm=0 #Harry appears as a subject
countn=0 #Harry appears as an object

for paragraph in f:
    paragraph=paragraph.replace('Harry Potter','Harry')
    doc=nlp(paragraph)  #每一段落
    for sent in doc.sents:
        words_pos={} #一个字典，保存该句子中每个单词的词性
        Harry_pos=''
        for token in nlp(str(sent)):  #每一单词
            words_pos[token.dep_]=token.text
            if 'Harry' == token.text:
                Harry_pos=token.dep_
        if 'Harry' in words_pos.values():
            if Harry_pos=='nsubj': #Harry appears as a subject
               if 'dobj' in words_pos.keys():
                   countm+=1
                   print((words_pos['nsubj'],words_pos['ROOT'],words_pos['dobj']),Harry_pos)
               #else句子结构不是完整的主谓宾结构                
            elif Harry_pos=='dobj': #Harry appears as an object
               if 'nsubj' in words_pos.keys():
                   countn+=1
                   print((words_pos['nsubj'],words_pos['ROOT'],words_pos['dobj']),Harry_pos)
               #else句子结构不是完整的主谓宾结构
            #Harry不是作为nsubj或dobj出现
            
print('\nIn the [subject predicate object] structure，')
print('The number of times Harry appears as a subject :',countm,'\nThe number of times Harry appears as an object:',countn)