推荐系统第三课

最新推荐文章于 2024-11-18 12:32:59 发布

原创最新推荐文章于 2024-11-18 12:32:59 发布 · 541 阅读

0 ·

CC 4.0 BY-SA版权

推荐系统专栏收录该内容

6 篇文章

订阅专栏

本文内容：Advanced Match Algorithm and Online Serving

一、embedding学习常用算法

1. Matrix Factorization

在这里插入图片描述

3. topic model

在这里插入图片描述
topic model 属于生成模型，属于无监督学习

embedding from topic model
在这里插入图片描述
LDA为狄利克雷分布，多了先验知识，每次生成时会通过狄利克雷分布选取词
LDA in music recommendation
•建模：
歌曲(doc)-歌词(word)
⽤户(doc)-歌曲(word)
•应⽤
相似歌曲：根据doc的topic分布计算相似度
⽣成歌单：每个topic下概率最⼤的doc
在这里插入图片描述
LDA是词袋模型，不考虑词语顺序；当文本量很大或低频词的处理，LDA效果一般

4. word2vec

由来：
在这里插入图片描述
Word2Vec实现方法

•Skip-Ngram是根据word来预测context的概率P(context|word) •CBOW(Continuous Bag of Words)：根据context来预测word概率P(word|context)；

word2vect example:
在这里插入图片描述

word2vect 训练:
1.Hierarchical Softmax
使⽤⼀颗⼆分Huffman树的表⽰，叶⼦结点是单词，词频越⾼离跟节点越低，优化计算效率O(logV)
2. Negative Sampling
在这里插入图片描述

在这里插入图片描述

5. DNN（Youtube）

在这里插入图片描述

training：通过分类任务学习出⽤户向量和视频向量：
•每个视频作为⼀个类别，观看完成是视频作为正例
•把最后⼀层隐层输出作为⽤户向量（U+C）
•视频向量？ 1、pre trained to feed 2、training together
serving：输⼊⽤户向量，查询出与之向量相似度TopK⾼的视频

Yutobe应用给出的他们业务的总结：
•随机负采样效果好于hierarchical soft-max
•使⽤全⽹数据⽽不是只使⽤推荐数据
•每个⽤户⽣成固定数量的样本
•丢弃搜索词的序列性
•输⼊数据只使⽤历史信息(看了第二集推第一集)
在这里插入图片描述

How to choose algorithn?
1、有监督无监督？
播放完成？有监督；收藏列表？无监督
2、有序无序？
视频播放？有序；Airbnb?无序
3、item量级？
LDA量级大了实现慢
4、实时性？
DNN快
5、多样性？
模型融合
6、业务场景？

二、Online Match Serving

通用召回框架

上图左边：user-action
item1/brand1(紫色):trigger
右边：candidate item list
K-Nearest Neighbors search (LSH、Kd tree、ball tree)
K-V存储：
•NoSQL存储系统：存储键值对⽂档，修改灵活；
⽆JOIN操作，操作简单，速度快
•kv存储是NoSQL存储的⼀种
•hbase：分布式、持久化（不易丢数据），常⽤于⼤数据存储
•redis：基于内存、速度快，常⽤于缓存

Machine Online Serving
在这里插入图片描述

召回的方法：
K-Nearest Neighbors search
（1）sharding(多台机器并行)
在这里插入图片描述

（2）Hashing
局部敏感哈希
在这里插入图片描述
局部敏感哈希：相似的向量分在一起
Sim-hash、mean hash
参考：https://www.cnblogs.com/maybe2030/p/5203186.html
（3）k-d tree
•k-dimension tree，k维欧⼏⾥德空间组织点的数据结构
•每个节点代表⼀个超平⾯，该超平⾯垂直于当前划分维度的坐标轴
•⼆叉树，每个节点分裂时选取⼀个维度。其左⼦树上所有点在d 维的坐标值均⼩于当前值，右⼦树上所有点在d维的坐标值均⼤于等于当前值
•平衡的k-d tree所有叶⼦节点到根节点的距离近似相等

k-d tree-构建 :
1> 随着树的深度轮流选择轴当作分割⾯。
2> 取数据点在该维度的中值作为切分超平⾯，将中值左侧的数据点挂在其左⼦树，将中值右侧的数据点挂在其右⼦树。
3> 递归处理其⼦树，直⾄所有数据点挂载完毕（叶⼦节点可以包含多个数据）。
在这里插入图片描述

k-d tree搜索示例：
在建好的k-d tree上搜索(3,5)的最近邻
1> 从根节点开始，递归的往下移
2> 移动到叶节点，并当作”⽬前最佳点”
3> 解开递归，并对每个经过的节点运⾏下列步骤
•如果⽬前所在点⽐⽬前最佳点更靠近输⼊点，则将其变为⽬前最佳点
•检查另⼀边⼦树有没有更近的点，如果有则从该节点往下找 4>当根节点搜索完毕后完成最邻近搜索
在这里插入图片描述

（4）ball-tree
构建：
在这里插入图片描述

搜索：
在这里插入图片描述
3. K -Nearest Neighbors search总结
•ofﬂine computation(实时性不高) + online storage
•Brute force #精确
•sharding（并⾏Brute force）#精确
•Hashing #近似
•space partition（ball tree）#近似
•只计算部分类别下的距离 #近似

近似算法的评估：召回率

词嵌入练习
使用http://mattmahoney.net/dc/textdata训练word2vect skip-gram
1、导入库文件

from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE
print('check：libs well prepared')

2、下载数据集并解压

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  #判断文件是否存在
  if not os.path.exists(filename):
    #下载
    print('download...')
    filename, _ = urlretrieve(url + filename, filename)
  #校验大小
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print('exception %s' % statinfo.st_size)
  return filename

filename = maybe_download('text8.zip', 31344016)

3、编码并替换低频次

vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  #每个词出现的次数
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  #单词到数字的映射
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0 
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  #数字到单词的映射
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

#映射之后的训练数据
data, count, dictionary, reverse_dictionary = build_dataset(words)
#
print('Most common words (+UNK)', count[:5])
print('original data', words[:10])
print('training data', data[:10])

输出：

Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
original data ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
training data [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156]

4、生成skip-gram训练数据

def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  # x y
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # context word context
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    # 循环使用
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # 
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels


print('data:', [reverse_dictionary[di] for di in data[:8]])
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=2)
print('    batch:', [reverse_dictionary[bi] for bi in batch])
print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

输出：

data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']
    batch: ['as', 'as', 'a', 'a', 'term', 'term', 'of', 'of']
    labels: ['originated', 'anarchism', 'of', 'term', 'as', 'a', 'abuse', 'first']

5、定义网络结构

batch_size = 128
embedding_size = 128 # 
skip_window = 1 # 
num_skips = 2 # 
valid_size = 16 # 
valid_window = 100 #
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 #

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # 输入数据
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # 定义变量
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  #本次训练数据对应的embedding
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  # batch loss
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
  #优化loss，更新参数
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  #归一化
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  #用已有embedding计算valid的相似次
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

6、运行训练流程

num_steps = 100000

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  average_loss = 0
  for step in range(num_steps+1):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    #2000次打印loss
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # 打印valid效果
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 5 #相似度最高的5个词
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()

运行结果示例：

Average loss at step 0: 8.125039
Nearest to their: liliuokalani, kobe, aeolian, judeo, gutman,
Nearest to state: emulates, matching, heritage, coder, rebounding,
Nearest to the: super, represent, whitacre, swine, clothing,
Nearest to system: populace, harshness, bungee, pounds, nist,
Nearest to between: infimum, macedonians, abyss, ziegler, lorica,
Nearest to such: wrath, comecon, ignite, winfield, revolution,
Nearest to up: coexist, breads, applesoft, azores, dogs,
Nearest to this: apart, vorarlberg, par, jardines, syntax,
Nearest to if: knowles, hindi, defeated, biochemical, lonergan,
Nearest to from: usp, martov, hormonal, pd, clouds,
Nearest to s: mediating, bit, challenges, lys, lavos,
Nearest to however: eps, lambdamoo, eternally, zanetti, cavers,
Nearest to over: yardbirds, duct, mayer, breaks, plagues,
。。。。。。
。。。。。。
Nearest to between: with, within, among, through, against,
Nearest to such: well, intelligent, known, certain, these,
Nearest to up: out, off, down, back, them,
Nearest to this: which, it, another, itself, some,
Nearest to if: when, though, where, before, because,
Nearest to from: through, into, protestors, muir, in,
Nearest to s: whose, his, isbn, my, dedicates,
Nearest to however: but, although, that, though, especially,
Nearest to over: overshadowed, around, between, through, within,
Nearest to also: often, still, now, never, sometimes,
Nearest to used: designed, referred, seen, considered, known,
Nearest to on: upon, in, through, under, bathroom,

7、可视化

num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(20,20))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)

在这里插入图片描述