Pynlpir词性标记集

这篇博客主要介绍了Pynlpir库中的词性标记集,包括相关的参考页面和源代码链接,帮助读者理解并使用Pynlpir进行词性标注。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

reference page:

http://pynlpir.readthedocs.io/en/latest/_modules/pynlpir/pos_map.html?highlight=pos

http://pynlpir.readthedocs.io/en/latest/pos_map.html?highlight=pos

https://github.com/tsroten/pynlpir/blob/develop/pynlpir/pos_map.py


POS_MAP = {
    'n': ('名词', 'noun', {
        'nr': ('人名', 'personal name', {
            'nr1': ('汉语姓氏', 'Chinese surname'),
            'nr2': ('汉语名字', 'Chinese given name'),
            'nrj': ('日语人名', 'Japanese personal name'),
            'nrf': ('音译人名', 'transcribed personal name')
        }),
        'ns': ('地名', 'toponym', {
            'nsf': ('音译地名', 'transcribed toponym'),
        }),
        'nt': ('机构团体名', 'organization/group name'),
        'nz': ('其它专名', 'other proper noun'),
        'nl': ('名词性惯用语', 'noun phrase'),
        'ng': ('名词性语素', 'noun morpheme'),
    }),
    't': ('时间词', 'time word', {
        'tg': ('时间词性语素', 'time morpheme'),
    }),
    's': ('处所词', 'locative wo
import jieba import pynlpir import numpy as np import tensorflow as tf from sklearn.model_selection import train_test_split # 读取文本文件 with open('1.txt', 'r', encoding='utf-8') as f: text = f.read() # 对文本进行分词 word_list = list(jieba.cut(text, cut_all=False)) # 打开pynlpir分词器 pynlpir.open() # 对分词后的词语进行词性标注 pos_list = pynlpir.segment(text, pos_tagging=True) # 将词汇表映射成整数编号 vocab = set(word_list) vocab_size = len(vocab) word_to_int = {word: i for i, word in enumerate(vocab)} int_to_word = {i: word for i, word in enumerate(vocab)} # 将词语和词性标记映射成整数编号 pos_tags = set(pos for word, pos in pos_list) num_tags = len(pos_tags) tag_to_int = {tag: i for i, tag in enumerate(pos_tags)} int_to_tag = {i: tag for i, tag in enumerate(pos_tags)} # 将文本和标签转换成整数序列 X = np.array([word_to_int[word] for word in word_list]) y = np.array([tag_to_int[pos] for word, pos in pos_list]) # 将数据划分成训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 定义模型参数 embedding_size = 128 rnn_size = 256 batch_size = 128 epochs = 10 # 定义RNN模型 model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_size), tf.keras.layers.SimpleRNN(rnn_size), tf.keras.layers.Dense(num_tags, activation='softmax') ]) # 编译模型 model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # 训练模型 model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test, y_test)) # 对测试集进行预测 y_pred = model.predict(X_test) y_pred = np.argmax(y_pred, axis=1) # 计算模型准确率 accuracy = np.mean(y_pred == y_test) print('Accuracy: {:.2f}%'.format(accuracy * 100)) # 将模型保存到文件中 model.save('model.h5')出现下述问题:ValueError: Found input variables with inconsistent numbers of samples:
06-07
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值