本文内容:Advanced Match Algorithm and Online Serving
一、embedding学习常用算法
1. Matrix Factorization
3. topic model
topic model 属于生成模型,属于无监督学习
embedding from topic model
LDA为狄利克雷分布,多了先验知识,每次生成时会通过狄利克雷分布选取词
LDA in music recommendation
•建模:
歌曲(doc)-歌词(word)
⽤户(doc)-歌曲(word)
•应⽤
相似歌曲:根据doc的topic分布计算相似度
⽣成歌单:每个topic下概率最⼤的doc
LDA是词袋模型,不考虑词语顺序;当文本量很大或低频词的处理,LDA效果一般
4. word2vec
由来:
Word2Vec实现方法
•Skip-Ngram是根据word来预测context的概率P(context|word) •CBOW(Continuous Bag of Words):根据context来预测word概 率P(word|context);
word2vect example:
word2vect 训练:
1.Hierarchical Softmax
使⽤⼀颗⼆分Huffman树的表⽰,叶⼦结点是单词,词频越⾼离 跟节点越低,优化计算效率O(logV)
2. Negative Sampling
5. DNN(Youtube)
training:通过分类任务学习出⽤ 户向量和视频向量:
•每个视频作为⼀个类别,观 看完成是视频作为正例
•把最后⼀层隐层输出作为⽤ 户向量(U+C)
•视频向量? 1、pre trained to feed 2、training together
serving:输⼊⽤户向量,查询出与之向量相似度TopK⾼的视频
Yutobe应用给出的他们业务的总结:
•随机负采样效果好于hierarchical soft-max
•使⽤全⽹数据⽽不是只使⽤推荐数据
•每个⽤户⽣成固定数量的样本
•丢弃搜索词的序列性
•输⼊数据只使⽤历史信息(看了第二集推第一集)
How to choose algorithn?
1、有监督无监督?
播放完成?有监督;收藏列表?无监督
2、有序无序?
视频播放?有序;Airbnb?无序
3、item量级?
LDA量级大了实现慢
4、实时性?
DNN快
5、多样性?
模型融合
6、业务场景?
二、Online Match Serving
- 通用召回框架
上图左边:user-action
item1/brand1(紫色):trigger
右边:candidate item list - K-Nearest Neighbors search (LSH、Kd tree、ball tree)
K-V存储:
•NoSQL存储系统:存储键值对⽂档,修改灵活;
⽆JOIN操作,操作简单,速度快
•kv存储是NoSQL存储的⼀种
•hbase:分布式、持久化(不易丢数据),常⽤于⼤数据存储
•redis:基于内存、速度快,常⽤于缓存
Machine Online Serving
召回的方法:
K-Nearest Neighbors search
(1)sharding(多台机器并行)
(2)Hashing
局部敏感哈希
局部敏感哈希:相似的向量分在一起
Sim-hash、mean hash
参考:https://www.cnblogs.com/maybe2030/p/5203186.html
(3)k-d tree
•k-dimension tree,k维欧⼏⾥德空间组织点的数据结构
•每个节点代表⼀个超平⾯,该超平⾯垂直于当前划分维度的坐 标轴
•⼆叉树,每个节点分裂时选取⼀个维度。其左⼦树上所有点在d 维的坐标值均⼩于当前值,右⼦树上所有点在d维的坐标值均⼤ 于等于当前值
•平衡的k-d tree所有叶⼦节点到根节点的距离近似相等
k-d tree-构建 :
1> 随着树的深度轮流选择轴当作分割⾯。
2> 取数据点在该维度的中值作为切分超平⾯,将中值左侧的数据 点挂在其左⼦树,将中值右侧的数据点挂在其右⼦树。
3> 递归处理其⼦树,直⾄所有数据点挂载完毕(叶⼦节点可以包 含多个数据)。
k-d tree搜索示例:
在建好的k-d tree上搜索(3,5)的最近邻
1> 从根节点开始,递归的往下移
2> 移动到叶节点,并当作”⽬前最佳点”
3> 解开递归,并对每个经过的节点运⾏下列步骤
•如果⽬前所在点⽐⽬前最佳点更靠近输⼊点,则将其变为⽬前最佳点
•检查另⼀边⼦树有没有更近的点, 如果有则从该节点往下找 4>当根节点搜索完毕后完成最邻近搜索
(4)ball-tree
构建:
搜索:
3. K -Nearest Neighbors search总结
•offline computation(实时性不高) + online storage
•Brute force #精确
•sharding(并⾏Brute force)#精确
•Hashing #近似
•space partition(ball tree)#近似
•只计算部分类别下的距离 #近似
近似算法的评估:召回率
词嵌入练习
使用http://mattmahoney.net/dc/textdata训练word2vect skip-gram
1、导入库文件
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE
print('check:libs well prepared')
2、下载数据集并解压
url = 'http://mattmahoney.net/dc/'
def maybe_download(filename, expected_bytes):
#判断文件是否存在
if not os.path.exists(filename):
#下载
print('download...')
filename, _ = urlretrieve(url + filename, filename)
#校验大小
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified %s' % filename)
else:
print('exception %s' % statinfo.st_size)
return filename
filename = maybe_download('text8.zip', 31344016)
3、编码并替换低频次
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]]
#每个词出现的次数
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
#单词到数字的映射
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0
unk_count = unk_count + 1
data.append(index)
count[0][1] = unk_count
#数字到单词的映射
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
#映射之后的训练数据
data, count, dictionary, reverse_dictionary = build_dataset(words)
#
print('Most common words (+UNK)', count[:5])
print('original data', words[:10])
print('training data', data[:10])
输出:
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
original data ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
training data [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156]
4、生成skip-gram训练数据
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
# x y
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # context word context
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
# 循环使用
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window #
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
print('data:', [reverse_dictionary[di] for di in data[:8]])
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=2)
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
输出:
data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']
batch: ['as', 'as', 'a', 'a', 'term', 'term', 'of', 'of']
labels: ['originated', 'anarchism', 'of', 'term', 'as', 'a', 'abuse', 'first']
5、定义网络结构
batch_size = 128
embedding_size = 128 #
skip_window = 1 #
num_skips = 2 #
valid_size = 16 #
valid_window = 100 #
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 #
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# 输入数据
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# 定义变量
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
#本次训练数据对应的embedding
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# batch loss
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
#优化loss,更新参数
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
#归一化
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
#用已有embedding计算valid的相似次
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
6、运行训练流程
num_steps = 100000
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
average_loss = 0
for step in range(num_steps+1):
batch_data, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
#2000次打印loss
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# 打印valid效果
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 5 #相似度最高的5个词
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
运行结果示例:
Average loss at step 0: 8.125039
Nearest to their: liliuokalani, kobe, aeolian, judeo, gutman,
Nearest to state: emulates, matching, heritage, coder, rebounding,
Nearest to the: super, represent, whitacre, swine, clothing,
Nearest to system: populace, harshness, bungee, pounds, nist,
Nearest to between: infimum, macedonians, abyss, ziegler, lorica,
Nearest to such: wrath, comecon, ignite, winfield, revolution,
Nearest to up: coexist, breads, applesoft, azores, dogs,
Nearest to this: apart, vorarlberg, par, jardines, syntax,
Nearest to if: knowles, hindi, defeated, biochemical, lonergan,
Nearest to from: usp, martov, hormonal, pd, clouds,
Nearest to s: mediating, bit, challenges, lys, lavos,
Nearest to however: eps, lambdamoo, eternally, zanetti, cavers,
Nearest to over: yardbirds, duct, mayer, breaks, plagues,
。。。。。。
。。。。。。
Nearest to between: with, within, among, through, against,
Nearest to such: well, intelligent, known, certain, these,
Nearest to up: out, off, down, back, them,
Nearest to this: which, it, another, itself, some,
Nearest to if: when, though, where, before, because,
Nearest to from: through, into, protestors, muir, in,
Nearest to s: whose, his, isbn, my, dedicates,
Nearest to however: but, although, that, though, especially,
Nearest to over: overshadowed, around, between, through, within,
Nearest to also: often, still, now, never, sometimes,
Nearest to used: designed, referred, seen, considered, known,
Nearest to on: upon, in, through, under, bathroom,
7、可视化
num_points = 400
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
def plot(embeddings, labels):
assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
pylab.figure(figsize=(20,20)) # in inches
for i, label in enumerate(labels):
x, y = embeddings[i,:]
pylab.scatter(x, y)
pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
ha='right', va='bottom')
pylab.show()
words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)