首先把作品(加作者,因为作品可能同名)和评分者都转换为编号:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
#用来从deviantArtThreaded.db的Art表再生成两个表,分别是作品和评分者与标号的对应索引
import sqlite3
db = sqlite3.connect('./deviantArtThreaded.db')
cur = db.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS Picture (ID INTEGER PRIMARY KEY AUTOINCREMENT, Name varchar(200))')
cur.execute('CREATE TABLE IF NOT EXISTS Follower (ID INTEGER PRIMARY KEY AUTOINCREMENT, Url varchar(200))')
cur.execute('INSERT INTO Picture (Name) SELECT DISTINCT Title||" by "||Author FROM Art')
cur.execute('INSERT INTO Follower (Url) SELECT DISTINCT Follower FROM Art')
得到2192个作品、58207个评分者,共127433条评分记录。以作品为行,评分者为列,则评分矩阵的稀疏度为127433 / (2192 * 58207) < 0.001。
如果用numpy稠密矩阵,消耗空间在4G以上,根本无法建立。利用scipy.sparse模块建立稀疏矩阵:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import sqlite3
import scipy.sparse as sps
import scipy.io
db = sqlite3.connect('./deviantArtThreaded.db')
cu = db.cursor()
cu.execute('SELECT COUNT(ID) FROM Picture')
num_pic = cu.fetchone()[0] #2192
cu.execute('SELECT COUNT(ID) FROM Follower')
num_follow = cu.fetchone()[0] #58207
A = sps.lil_matrix((num_pic, num_follow))
print 'Sparse matrix initialized'
cu.execute('SELECT Title, Author, Follower FROM Art')
lst = cu.fetchall()
print 'Number of records:', len(lst)
for item in lst:
picName = item[0]+' by '+item[1]
follower = item[2]
cmd1 = 'SELECT ID FROM Picture WHERE Name = "' + picName + '"'
cmd2 = 'SELECT ID FROM Follower WHERE Url = "' + follower + '"'
cu.execute(cmd1)
index1 = cu.fetchone()[0] - 1
cu.execute(cmd2)
index2 = cu.fetchone()[0] - 1
A[index1, index2] = 1
scipy.io.mmwrite('./A.mtx', A)
然后可以方便地对其进行SVD分解:
import scipy.sparse.linalg as spsl
import scipy.io
U, S, Vt = spsl.svds(A)
scipy.io.mmwrite('./U.mtx', U)
scipy.io.mmwrite('./Vt.mtx', Vt)
scipy.io.mmwrite('./S.mtx', S)
scipy.sparse.linalg.svds函数默认保留奇异值个数k=6。注意此函数得到的S矩阵其实是一个向量,保存了k个对角元,直接用mmwrite会报错。可以把它还原为对角阵再保存。
现在只要把U、S和Vt载入,保证S为对角阵,则对于任意用户u的已知喜好向量Vu,可以计算其预测评分矩阵Au:
P = dot(U, S)
Au = dot(P, Vu.transpose())
目前效果不甚理想,估计是因为算法的简陋和评分矩阵的极度稀疏性和长尾性。下一步改进包括:
- 结合概率模型,尝试预测填充矩阵
- 解决updating SVD问题
- 加入对冷启动的处理