过程:
- 用爬虫抓取豆瓣电影用户信息
- 用多重分类法,定义电影评价等级
- 计算自己与用户的皮尔逊相关度
- 以人为主体分析相似度:找出志同道合的人,可以发现潜在喜欢的商品
- 以商品为主体分析相似度:找出相似的商品,可以发现潜在的客户(如亚马逊的‘买了该商品的用户还买了’
电影评价多重分类:
用户信息录入:
- #-*- coding: utf-8 -*-
- import json
- import sys
- reload(sys)
- sys.setdefaultencoding( "utf-8" )
- user_info = {}
- #爬取到的数据
- user_dict = {
- 'ns2250225':[4,3,4,5,4],
- 'justin':[3,4,3,4,2],
- 'totox':[2,3,5,1,4],
- 'fabrice':[4,1,3,4,5],
- 'doreen':[3,4,2,5,3]
- }
- #录入用户数据
- def user_data(user_dict):
- for name in user_dict:
- user_info[name] = {u'消失的爱人' : user_dict[name][0]}
- user_info[name][u'霍比特人3'] = user_dict[name][1]
- user_info[name][u'神去村'] = user_dict[name][2]
- user_info[name][u'泰坦尼克号'] = user_dict[name][3]
- user_info[name][u'这个杀手不太冷'] = user_dict[name][4]
- user_data(user_dict)
- #存放用户数据
- try:
- with open('user_data.txt', 'w') as data:
- for key in user_info:
- data.write(key)
- for key2 in user_info[key]:
- data.write('\t')
- data.write(key2)
- data.write('\t')
- data.write('\t')
- data.write(str(user_info[key][key2]))
- data.write('\n')
- data.write('\n')
- except IOError as err:
- print('File error: ' + str(err))
计算皮尔逊相关系数,找出兴趣相投的用户:(插入自己的数据)
- from math import sqrt
- #计算皮尔逊相关度(1为完全正相关,-1为完成负相关)
- def sim_pearson(prefs, p1, p2):
- # Get the list of mutually rated items
- si = {}
- for item in prefs[p1]:
- if item in prefs[p2]:
- si[item] = 1
- # if they are no ratings in common, return 0
- if len(si) == 0:
- return 0
- # Sum calculations
- n = len(si)
- # Sums of all the preferences
- sum1 = sum([prefs[p1][it] for it in si])
- sum2 = sum([prefs[p2][it] for it in si])
- # Sums of the squares
- sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
- sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
- # Sum of the products
- pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
- # Calculate r (Pearson score)
- num = pSum - (sum1 * sum2 / n)
- den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
- if den == 0:
- return 0
- r = num / den
- return r
- #插入自己的数据
- user_info['me'] = {u'消失的爱人' : 5,
- u'神去村' : 3,
- u'炸裂鼓手' : 5}
- #找出皮尔逊相关系数>0的用户,说明该用户与自己的电影品味比较相近
- for user in user_info:
- res = sim_pearson(user_info, 'me', user)
- if res > 0:
- print('the user like %s is : %s' % ('me', user))
- print('result :%f\n' % res)
向某用户推荐电影(加权平均所有人的评价)
- #向某个用户推荐电影(加权平均所有人的评价值)
- def getRecommendations(prefs,person,similarity=sim_pearson):
- totals={}
- simSums={}
- for other in prefs:
- # don't compare me to myself
- if other==person: continue
- sim=similarity(prefs,person,other)
- # ignore scores of zero or lower
- if sim<=0: continue
- for item in prefs[other]:
- # only score movies I haven't seen yet
- if item not in prefs[person] or prefs[person][item]==0:
- # Similarity * Score
- totals.setdefault(item,0)
- totals[item]+=prefs[other][item]*sim
- # Sum of similarities
- simSums.setdefault(item,0)
- simSums[item]+=sim
- # Create the normalized list
- rankings=[(total/simSums[item],item) for item,total in totals.items()]
- # Return the sorted list
- rankings.sort()
- rankings.reverse()
- return rankings
- #向我推荐电影
- res = getRecommendations(user_info, "me")
- print('Recommand watching the movie:')
- print json.dumps(res, encoding='UTF-8', ensure_ascii=False)
结果与分析:
- 与我电影口味相近的用户有:doreen, fabrice
- 推荐我看的电影有:泰坦尼克号,这个杀手不太冷
- 以人为主体分析, 找出有相似爱好的人, 并向这些人推荐商品,可以发现潜在喜欢的商品
- 而若以商品为主体分析, 找出相似的商品, 找出喜欢这个产品的人, 可以发现商品潜在的客户