ML: Recommender Systems
1. Content-based recommendations
思考:已有用户对部分电影的评价(1-5星),且已知电影的内容构成,如何预测用户对没看过的电影的评价并作出推荐?
1.1 Problem formulation
首先给出一些notations:
nu=number of usersn_u=number\ of\ users\quadnu=number of users 用户数量
nm=number of moviesn_m=number\ of\ movies\quadnm=number of movies 电影数量
r(i,j)=1 if user j has rated movie ir(i,j)=1\ if\ user\ j\ has\ rated\ movie\ i\quadr(i,j)=1 if user j has rated movie i 用户jjj是否评价了电影iii
y(i,j)=rating given by user j to movie iy^{(i,j)}=rating\ given\ by\ user\ j\ to\ movie\ i\quady(i,j)=rating given by user j to movie i 用户jjj对电影iii的评分,只有当 r(i,j)=1r(i,j)=1r(i,j)=1时有定义
x(i)=feature vector for movie ix^{(i)}=feature\ vector\ for\ movie\ i\quadx(i)=feature vector for movie i
电影iii的特征向量,如x1x_1x1表示其爱情成分比重,x2x_2x2表示其动作成分比重。x0x_0x0为偏置项,值为1.
θ(j)=parameter vector for user j\theta^{(j)}=parameter\ vector\ for\ user\ j\quadθ(j)=parameter vector for user j
用户jjj的参数,如θ1\theta_1θ1表示其对爱情片的喜好程度,θ2\theta_2θ2表示其对动作片的喜好程度。θ0\theta_0θ0为偏置项,值为1.
对于用户jjj和电影iii,预测其评价为(θ(j))Tx(i)(\theta^{(j)})^Tx^{(i)}(θ(j))Tx(i)
如下图中,nu=4n_u=4nu=4,nm=5n_m=5nm=5,右侧两栏是各个电影的特征。
可以看出电影1、2、3属于爱情类电影,4、5属于动作类。
观众3(Carol)对电影4和5评价很高,而对电影1和3评价很低,由此可以推断(θ(3))T=[0,0,5](\theta^{(3)})^T=[0,0,5](θ(3))T=[0,0,5]的可能性较高。
因此,观众3对电影2的打分可能为(θ(3))Tx(2)=[0,0,5]×([1,1,0.01])T=0.05(\theta^{(3)})^Tx^{(2)}=[0,0,5]\times([1,1,0.01])^T=0.05(θ(3))Tx(2)=[0,0,5]×([1,1,0.01])T=0.05
1.2 Optimization objective
求某个用户的参数θ(j)\theta^{(j)}θ(j)其实是一个基本的线性回归问题:
minθ(j)12m(j)∑i:r(i,j)=1((θ(j))Tx(i)−y(i.j))2+λ2m(j)∑k=1n(θk(j))2min_{\theta^{(j)}}\quad \frac{1}{2m^{(j)}}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})^2+\frac{\lambda}{2m^{(j)}}\sum_{k=1}^{n}(\theta_k^{(j)})^2minθ(j)2m(j)1i:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))2+2m(j)λk=1∑n(θk(j))2
其中m(j)m^{(j)}m(j)为用户jjj评价过的电影数目。注意k从1开始,偏置项无需正则化。
\quad
在设计推荐系统的实践过程中,简单起见,我们会去掉m(j)m^{(j)}m(j)这一项,于是上式变为:
minθ(j)12∑i:r(i,j)=1((θ(j))Tx(i)−y(i.j))2+λ2∑k=1n(θk(j))2min_{\theta^{(j)}}\quad \frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})^2+\frac{\lambda}{2}\sum_{k=1}^{n}(\theta_k^{(j)})^2minθ(j)21i:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))2+2λk=1∑n(θk(j))2
\quad
更一般地,如果要学习所有用户的参数θ(1),θ(2),....,θ(nu)\theta^{(1)},\theta^{(2)},....,\theta^{(n_u)}θ(1),θ(2),....,θ(nu),只需把nun_unu个线性回归相加:
minθ(1),...,θ(nu)12∑j=1nu∑i:r(i,j)=1((θ(j))Tx(i)−y(i.j))2+λ2∑j=1nu∑k=1n(θk(j))2min_{\theta^{(1)},...,\theta^{(n_u)}}\quad \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2minθ(1),...,θ(nu)21j=1∑nui:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))2+2λj=1∑nuk=1∑n(θk(j))2
\quad
1.3 Optimization algorithm
minθ(1),...,θ(nu)12∑j=1nu∑i:r(i,j)=1((θ(j))Tx(i)−y(i.j))2+λ2∑j=1nu∑k=1n(θk(j))2min_{\theta^{(1)},...,\theta^{(n_u)}}\quad \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2minθ(1),...,θ(nu)21j=1∑nui:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))2+2λj=1∑nuk=1∑n(θk(j))2
依然用梯度下降的方法求最小值,每步的更新值为:
θk(j):=θk(j)−α∑i:r(i,j)=1((θ(j))Tx(i)−y(i.j))xk(i) (for k=0)\theta_k^{(j)}:=\theta_k^{(j)}-\alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})x_k^{(i)}\ (for\ k=0)θk(j):=θk(j)−αi:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))xk(i) (for k=0)
θk(j):=θk(j)−α(∑i:r(i,j)=1((θ(j))Tx(i)−y(i.j))xk(i)+λθk(j)) (for k≠0)\theta_k^{(j)}:=\theta_k^{(j)}-\alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})x_k^{(i)}+\lambda \theta_k^{(j)})\ (for\ k≠0)θk(j):=θk(j)−α(i:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))xk(i)+λθk(j)) (for k̸=0)
\quad
2. Collaborative Filtering
2.1 Introduction
思考:已有用户对部分电影的评价(1-5星),且已知用户的参数θ(1),θ(2),....,θ(nu)\theta^{(1)},\theta^{(2)},....,\theta^{(n_u)}θ(1),θ(2),....,θ(nu),是不是也能反过来求x(i)x^{(i)}x(i)呢?
根据第一部分的分析,很容易得到由θ(1),θ(2),....,θ(nu)\theta^{(1)},\theta^{(2)},....,\theta^{(n_u)}θ(1),θ(2),....,θ(nu),求x(1),x(2),...,x(nm)x^{(1)},x^{(2)},...,x^{(n_m)}x(1),x(2),...,x(nm)的目标函数:
minx(1),x(2),...,x(nm)12∑i=1nm∑j:r(i,j)=1((θ(j))Tx(i)−y(i.j))2+λ2∑i=1nm∑k=1n(xk(i))2min_{x^{(1)},x^{(2)},...,x^{(n_m)}}\quad \frac{1}{2}\sum_{i=1}^{n_m}\sum_{j :r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2minx(1),x(2),...,x(nm)21i=1∑nmj:r(i,j)=1∑((θ(j))Tx(i)−y(i.j))2+2λi=1∑nmk=1∑n(xk(i))2
也就是说,我们知道θ\thetaθ就能学习xxx,知道xxx也能学习θ\thetaθ.
因此,我们可以先随机初始化一组θ\thetaθ,学习出xxx,并根据已有的一些电影的原始特征,优化出更好的θ\thetaθ。重复θ→x→θ→x→θ→x\theta→x→\theta→x→\theta→xθ→x→θ→x→θ→x的过程,最终算法将会收敛到合理的θ\thetaθ和xxx。这一过程,我们称之为协同过滤(collaborative filtering)。
\quad
2.2 Collaborative filtering algorithm
同时最小化θ(1),θ(2),....,θ(nu)\theta^{(1)},\theta^{(2)},....,\theta^{(n_u)}θ(1),θ(2),....,θ(nu)和x(1),x(2),...,x(nm)x^{(1)},x^{(2)},...,x^{(n_m)}x(1),x(2),...,x(nm):
J=12∑(i,j):r(i,j)=1((θ(j))Tx(i)−y(i.j))2+λ2∑i=1nm∑k=1n(xk(i))2+λ2∑j=1nu∑k=1n(θk(j))2J = \frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i.j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2J=21(i,j):r(i,j)=1∑((θ(j))Tx(i)−y(i.j))2+2λi=1∑nmk=1∑n(xk(i))2+2λj=1∑nuk=1∑n(θk(j))2
minθ(1),....,θ(nu),x(1),...,x(nm)J(θ(1),....,θ(nu),x(1),...,x(nm))min_{\theta^{(1)},....,\theta^{(n_u)},x^{(1)},...,x^{(n_m)}}\quad J(\theta^{(1)},....,\theta^{(n_u)},x^{(1)},...,x^{(n_m)})minθ(1),....,θ(nu),x(1),...,x(nm)J(θ(1),....,θ(nu),x(1),...,x(nm))