tensorflow实现word2vec报错&解决集

先是从书上学习,然后利用随书代码实现word2vec,代码在此:https://github.com/PacktPublishing/Natural-Language-Processing-with-TensorFlow/blob/master/ch3/ch3_word2vec.ipynb

因为想做领域词的识别,故没有用已有的英文数据试验,用的自己找的专业领域的小段语料做实验,先用jieba分词,然后开始Word2vec,这篇就写写报的错以及解决,有时间再详解代码。

1、在Generating Batches of Data for Skip-Gram阶段,报错:

  print('    batch:', [reverse_dictionary[bi] for bi in batch])
KeyError: 326960996

原因是:batch一开始是通过np.ndarray随机初始化的任意数值数组,当2倍window_size的大小没有被batch_size整除时,batch里剩下的值(如上面报错的326960996)作为reverse_dictionary的索引必然报错。举个例子如下,一切了然:

# data=[44,45,46,47,48,49,0,0,0,5,0,0,0,15,16.......]
# 示例:batchsize=16, windowsize=1,buffer队列长度=3,numsamples=2的时候
# batch=[45,45,46,46,47,47,48,48,49,49,0,0,0,0,0,
这个错误可能是因为您的TensorFlow版本不同导致的。在TensorFlow 2.0中,`tf.placeholder`被移除了,改为使用`tf.compat.v1.placeholder`。如果您正在使用TensorFlow 2.0或更高版本,请将代码中的`tf.placeholder`替换为`tf.compat.v1.placeholder`。如果您正在使用TensorFlow 1.x版本,则可以将代码中的`tf.compat.v1.placeholder`替换为`tf.placeholder`。 以下是修改后的代码示例: ```python import tensorflow.compat.v1 as tf tf.disable_v2_behavior() import numpy as np import random # 数据预处理 poems = [] with open('poems.txt', 'r', encoding='utf-8') as f: for line in f: line = line.strip() if len(line) <= 10: continue poems.append(line) # 获取所有唐诗的字符 all_words = [] for poem in poems: all_words += [word for word in poem] all_words = list(set(all_words)) all_words.sort() # 创建字符到数字的映射 word_num_map = dict(zip(all_words, range(len(all_words)))) num_word_map = dict(zip(range(len(all_words)), all_words)) # 定义超参数 batch_size = 64 time_steps = 50 input_size = len(all_words) output_size = len(all_words) cell_size = 128 learning_rate = 0.01 # 定义占位符 X = tf.compat.v1.placeholder(tf.float32, [None, time_steps, input_size]) Y = tf.compat.v1.placeholder(tf.float32, [None, output_size]) # 定义RNN模型 cell = tf.contrib.rnn.BasicLSTMCell(num_units=cell_size) init_state = cell.zero_state(batch_size, dtype=tf.float32) outputs, final_state = tf.nn.dynamic_rnn(cell, X, initial_state=init_state, dtype=tf.float32) output = tf.reshape(outputs, [-1, cell_size]) W = tf.Variable(tf.truncated_normal([cell_size, output_size], stddev=0.1)) b = tf.Variable(tf.zeros([output_size])) logits = tf.matmul(output, W) + b probs = tf.nn.softmax(logits) # 定义损失函数和优化器 loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y)) train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss) # 训练模型 with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(5000): start = random.randint(0, len(all_words) - time_steps - 1) end = start + time_steps + 1 batch = [word_num_map[word] for word in all_words[start:end]] input_batch = np.zeros((batch_size, time_steps, input_size)) output_batch = np.zeros((batch_size, output_size)) for j in range(batch_size): input_batch[j] = tf.one_hot(batch[j: j + time_steps], input_size).eval() output_batch[j] = tf.one_hot(batch[j + 1: j + time_steps + 1], output_size).eval()[-1] _, cost = sess.run([train_op, loss], feed_dict={X: input_batch, Y: output_batch}) if i % 100 == 0: print('step %d, cost %f' % (i, cost)) # 生成唐诗 start_word = '春' start_word_vec = np.zeros((1, 1, input_size)) start_word_vec[0, 0, word_num_map[start_word]] = 1 poem = start_word state = sess.run(cell.zero_state(1, tf.float32)) for i in range(100): probs_val, state_val = sess.run([probs, final_state], feed_dict={X: start_word_vec, init_state: state}) word_index = np.argmax(probs_val) word = num_word_map[word_index] poem += word start_word_vec[0, 0, word_index] = 1 state = state_val if word == '。': break print(poem) ``` 希望这次能够成功运行!
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值