这个案例跟推荐系统相关,预测用户可能感兴趣的event。关于这个案例更多信息打开event_recommendation_competition。这里我直接讲解第一名的解决方案。这个方案中除了包含经典的机器学习解决步骤,还融合了推荐系统里传统的解决方法:基于用户的协同过滤,基于物品的协同过滤,当然也可以融合LFM模型等等,因为这个解决方案很经典,所以我觉得值得拿出来详细讲讲。我将贴出完整代码,并且进行很详细的讲解。
首先看看比赛给的数据:train,test,users,events,user_friends,attendees这六张表。详细的数据描述请打开上面链接查看。
train表:
test表:
users表:
events表:
event_attendees表:
user_friends表:
- 首先这是一个 推荐 的问题
-
我们有下面这样几类数据
①:用户的历史数据 => 对 event 是否感兴趣/是否参加
②:用户社交数据 => 朋友圈
③:event 相关的数据 => event
-
简单思考
①:要把更多维度的信息纳入考量。
②:协同过滤是基于user -event 历史交互数据。
③:需要把社交数据和event 相关信息 作为影响最后结果的因素纳入考量。
④:视作分类模型,每一个人感兴趣/不感兴趣 是 target,其他影响结果的是feature。
⑤:影响结果的 feature 包括由协同过滤产出的推荐度。(userCF,itemCF)
第一名的解决方案思路:
因为数据很大,用pandas读取数据,内存可能吃不下,所以我们用scipy.sparse里面稀疏矩阵dok_matrix来一行一行的存储数据。
下面边看代码边讲解:
from __future__ import division
import itertools
import cPickle
import datetime
import hashlib
import locale
import numpy as np
import pycountry
import scipy.io as sio
import scipy.sparse as ss
import scipy.spatial.distance as ssd
from collections import defaultdict
from sklearn.preprocessing import normalize
1.数据清洗类
这个类主要做一些数据预处理的事情,例如对user性别进行0,1化;对locale进行编号;将字符串型的日期转成date型;等等。
class DataCleaner:
"""
Common utilities for converting strings to equivalent numbers
or number buckets.
"""
def __init__(self):
self.localeIdMap = defaultdict(int)
for i, l in enumerate(locale.locale_alias.keys()):
self.localeIdMap[l] = i + 1
self.countryIdMap = defaultdict(int)
ctryIdx = defaultdict(int)
for i, c in enumerate(pycountry.countries):
self.countryIdMap[c.name.lower()] = i + 1
if c.name.lower() == "usa":
ctryIdx["US"] = i
if c.name.lower() == "canada":
ctryIdx["CA"] = i
for cc in ctryIdx.keys():
for s in pycountry.subdivisions.get(country_code=cc):
self.countryIdMap[s.name.lower()] = ctryIdx[cc] + 1
self.genderIdMap = defaultdict(int, {"male":1, "female":2})
def getLocaleId(self, locstr):
return self.localeIdMap[locstr.lower()]
def getGenderId(self, genderStr):
return self.genderIdMap[genderStr]
def getJoinedYearMonth(self, dateString):
dttm = datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%S.%fZ")
return "".join([str(dttm.year), str(dttm.month)])
def getCountryId(self, location):
if (isinstance(location, str)
and len(location.strip()) > 0
and location.rfind(" ") > -1):
return self.countryIdMap[location[location.rindex(" ") + 2:].lower()]
else:
return 0
def getBirthYearInt(self, birthYear):
try:
return 0 if birthYear == "None" else int(birthYear)
except:
return 0
def getTimezoneInt(self, timezone):
try:
return int(timezone)
except:
return 0
def getFeatureHash(self, value):
if len(value.strip()) == 0:
return -1
else:
return int(hashlib.sha224(value).hexdigest()[0:4], 16)
def getFloatValue(self, value):
if len(value.strip()) == 0:
return 0.0
else:
return float(value)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
2.处理user和event关联数据
这个类主要用来统计以下几个数据:
①:uniqueUsers :set(),统计train和test中不同的user。
②:uniqueEvents :set(),统计train和test中不同的event。
③:eventsForUser :{event:set(users)},统计train和test中各个event有多少个不同user对其有交互。
④:usersForEvent:{user:set(events)},统计train,test中各个user对多少个不同的event产生交互。
⑤:userEventScores :dok_matrix形式,统计train中各个user对各个event感兴趣程度。shape[len(uniqueUsers ,uniqueEvents )]
⑥:uniqueUserPairs :set(),train,test中对同一个event感兴趣的users的两两组合。
⑦:uniqueEventPairs :set(),train,test中被同一个user感兴趣的events两两组合。
⑧:userIndex:dict(),{user,i},uniqueUsers 中的user与其对应的序号。
⑨:eventIndex:dict(),{event,i},uniqueEvents中的event与其对应的序号。
class ProgramEntities:
"""
我们只关心train和test中出现的user和event,因此重点处理这部分关联数据
"""
def __init__(self):
uniqueUsers = set()
uniqueEvents = set()
eventsForUser = defaultdict(set)
usersForEvent = defaultdict(set)
for filename in ["train.csv", "test.csv"]:
f = open(filename, 'rb')
f.readline().strip().split(",")
for line in f:
cols = line.strip().split(",")
uniqueUsers.add(cols[0])
uniqueEvents.add(cols[1])
eventsForUser[cols[0]].add(cols[1])
usersForEvent[cols[1]].add(cols[0])
f.close()
self.userEventScores = ss.dok_matrix((len(uniqueUsers), len(uniqueEvents)))
self.userIndex = dict()
self.eventIndex = dict()
for i, u in enumerate(uniqueUsers):
self.userIndex[u] = i
for i, e in enumerate(uniqueEvents):
self.eventIndex[e] = i
ftrain = open("train.csv", 'rb')
ftrain.readline()
for line in ftrain:
cols = line.strip().split(",")
i = self.userIndex[cols[0]]
j = self.eventIndex[cols[1]]
self.userEventScores[i, j] = int(cols[4]) - int(cols[5])
ftrain.close()
sio.mmwrite("PE_userEventScores", self.userEventScores)
self.uniqueUserPairs = set()
self.uniqueEventPairs = set()
for event in uniqueEvents:
users = usersForEvent[event]
if len(users) > 2:
self.uniqueUserPairs.update(itertools.combinations(users, 2))
for user in uniqueUsers:
events = eventsForUser[user]
if len(events) > 2:
self.uniqueEventPairs.update(itertools.combinations(events, 2))
cPickle.dump(self.userIndex, open("PE_userIndex.pkl", 'wb'))
cPickle.dump(self.eventIndex, open("PE_eventIndex.pkl", 'wb'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
3.用户与用户相似度矩阵
这个类主要统计以下几个数据:
①userMatrix :dok_matrix形式,根据users表,shape[len(users),len(users.columns)-1],除去users表中的user_id列,将表中其他列数据放进这个矩阵。就是统计userIndex表中各个user在users表中的属性。
②userSimMatrix :dok_matrix形式,shape[len(users),len(users)],根据userMatrix矩阵中用户属性数据,利用scipy.spatial.distance.correlation(欧式距离)来计算uniqueUserPairs中每两个user(这两个user对至少对同一个events发生过行为)相似度,从而得出用户相似度矩阵。这种相似度计算类似于协同过滤里相似度。
class Users:
"""
构建 user/user 相似度矩阵
"""
def __init__(self, programEntities, sim=ssd.correlation):
cleaner = DataCleaner()
nusers = len(programEntities.userIndex.keys())
fin = open("users.csv", 'rb')
colnames = fin.readline().strip().split(",")
self.userMatrix = ss.dok_matrix((nusers, len(colnames) - 1))
for line in fin:
cols = line.strip().split(",")
if programEntities.userIndex.has_key(cols[0]):
i = programEntities.userIndex[cols[0]]
self.userMatrix[i, 0] = cleaner.getLocaleId(cols[1])
self.userMatrix[i, 1] = cleaner.getBirthYearInt(cols[2])
self.userMatrix[i, 2] = cleaner.getGenderId(cols[3])
self.userMatrix[i, 3] = cleaner.getJoinedYearMonth(cols[4])
self.userMatrix[i, 4] = cleaner.getCountryId(cols[5])
self.userMatrix[i, 5] = cleaner.getTimezoneInt(cols[6])
fin.close()
self.userMatrix = normalize(self.userMatrix, norm="l1", axis=0, copy=False)
sio.mmwrite("US_userMatrix", self.userMatrix)
self.userSimMatrix = ss.dok_matrix((nusers, nusers))
for i in range(0, nusers):
self.userSimMatrix[i, i] = 1.0
for u1, u2 in programEntities.uniqueUserPairs:
i = programEntities.userIndex[u1]
j = programEntities.userIndex[u2]
if not self.userSimMatrix.has_key((i, j)):
usim = sim(self.userMatrix.getrow(i).todense(),
self.userMatrix.getrow(j).todense())
self.userSimMatrix[i, j] = usim
self.userSimMatrix[j, i] = usim
sio.mmwrite("US_userSimMatrix", self.userSimMatrix)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
4.用户社交关系挖掘
这个类分析用户社交信息,统计以下几个数据:
①:numFriends :矩阵,shape[1,len(users)],统计userIndex表中每个user有多少个friends。如果用户朋友很多在一定程度上说明用户性格外向。
②:userFriends :dok_matrix,shape[len(users),len(users)],统计第i个user的第j个朋友的活跃程度。如果用户的朋友都很活跃能在一定程度上说明用户也很活跃。
class UserFriends:
"""
找出某用户的那些朋友,想法非常简单
1)如果你有更多的朋友,可能你性格外向,更容易参加各种活动
2)如果你朋友会参加某个活动,可能你也会跟随去参加一下
"""
def __init__(self, programEntities):
nusers = len(programEntities.userIndex.keys())
self.numFriends = np.zeros((nusers))
self.userFriends = ss.dok_matrix((nusers, nusers))
fin = open("user_friends.csv", 'rb')
fin.readline()
ln = 0
for line in fin:
if ln % 200 == 0:
print "Loading line: ", ln
cols = line.strip().split(",")
user = cols[0]
if programEntities.userIndex.has_key(user):
friends = cols[1].split(" ")
i = programEntities.userIndex[user]
self.numFriends[i] = len(friends)
for friend in friends:
if programEntities.userIndex.has_key(friend):
j = programEntities.userIndex[friend]
eventsForUser = programEntities.userEventScores.getrow(j).todense()
score = eventsForUser.sum() / np.shape(eventsForUser)[1]
self.userFriends[i, j] += score
self.userFriends[j, i] += score
ln += 1
fin.close()
sumNumFriends = self.numFriends.sum(axis=0)
self.numFriends = self.numFriends / sumNumFriends
sio.mmwrite("UF_numFriends", np.matrix(self.numFriends))
self.userFriends = normalize(self.userFriends, norm="l1", axis=0, copy=False)
sio.mmwrite("UF_userFriends", self.userFriends)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
5.构造event和event相似度数据
构建event-event相似度,注意这里有2种相似度:
①eventPropMatrix :dok_matrix形式,shape[len(events),7],查看events表会发现,有109列,其中前9列是event与user历史交互信息,除去event_id,user_id两列,将剩余的每列数据放进eventPropMatrix 矩阵中。
②eventContMatrix:dok_matrix形式,shape[len(events),100],events表中剩余的100列表示event本身内容信息,其中count_N是表示第N个最常见词干在该事件的名称或描述中出现的次数的整数。 count_other是其余词的计数。将events中剩余的9-109列数据放进该矩阵中 。
③eventPropSim: dok_matrix形式,shape[len(events),len(events)],event相似度矩阵,根据eventPropMatrix 矩阵中event-user历史交互信息,再利用scipy.spatial.distance.correlation计算uniqueEventPairs中每两个event(这两个event至少被同一个user产生过行为)相似度,类似于协同过滤中相似度计算。(利用历史行为数据)
④eventContSim :dok_matrix形式,shape[len(events),len(events)],根据event本身所包含的信息计算uniqueEventPairs中每两个event(这两个event至少被同一个user产生过行为)相似度。
class Events:
"""
构建event-event相似度,注意这里有2种相似度:
1)由用户-event行为,类似协同过滤算出的相似度
2)由event本身的内容(event信息)计算出的event-event相似度
"""
def __init__(self, programEntities, psim=ssd.correlation, csim=ssd.cosine):
cleaner = DataCleaner()
fin = open("events.csv", 'rb')
fin.readline()
nevents = len(programEntities.eventIndex.keys())
self.eventPropMatrix = ss.dok_matrix((nevents, 7))
self.eventContMatrix = ss.dok_matrix((nevents, 100))
ln = 0
for line in fin.readlines():
cols = line.strip().split(",")
eventId = cols[0]
if programEntities.eventIndex.has_key(eventId):
i = programEntities.eventIndex[eventId]
self.eventPropMatrix[i, 0] = cleaner.getJoinedYearMonth(cols[2])
self.eventPropMatrix[i, 1] = cleaner.getFeatureHash(cols[3])
self.eventPropMatrix[i, 2] = cleaner.getFeatureHash(cols[4])
self.eventPropMatrix[i, 3] = cleaner.getFeatureHash(cols[5])
self.eventPropMatrix[i, 4] = cleaner.getFeatureHash(cols[6])
self.eventPropMatrix[i, 5] = cleaner.getFloatValue(cols[7])
self.eventPropMatrix[i, 6] = cleaner.getFloatValue(cols[8])
for j in range(9, 109):
self.eventContMatrix[i, j-9] = cols[j]
ln += 1
fin.close()
self.eventPropMatrix = normalize(self.eventPropMatrix,
norm="l1", axis=0, copy=False)
sio.mmwrite("EV_eventPropMatrix", self.eventPropMatrix)
self.eventContMatrix = normalize(self.eventContMatrix,
norm="l1", axis=0, copy=False)
sio.mmwrite("EV_eventContMatrix", self.eventContMatrix)
self.eventPropSim = ss.dok_matrix((nevents, nevents))
self.eventContSim = ss.dok_matrix((nevents, nevents))
for e1, e2 in programEntities.uniqueEventPairs:
i = programEntities.eventIndex[e1]
j = programEntities.eventIndex[e2]
if not self.eventPropSim.has_key((i,j)):
epsim = psim(self.eventPropMatrix.getrow(i).todense(),
self.eventPropMatrix.getrow(j).todense())
self.eventPropSim[i, j] = epsim
self.eventPropSim[j, i] = epsim
if not self.eventContSim.has_key((i,j)):
ecsim = csim(self.eventContMatrix.getrow(i).todense(),
self.eventContMatrix.getrow(j).todense())
self.eventContSim[i, j] = epsim
self.eventContSim[j, i] = epsim
sio.mmwrite("EV_eventPropSim", self.eventPropSim)
sio.mmwrite("EV_eventContSim", self.eventContSim)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
6.活跃度/event热度 数据
eventPopularity :dok_matrix形式,shape[len(events),1)],将event表中yes列和no列数据作差作为其event活跃度。
class EventAttendees():
"""
统计某个活动,参加和不参加的人数,从而为活动活跃度做准备
"""
def __init__(self, programEvents):
nevents = len(programEvents.eventIndex.keys())
self.eventPopularity = ss.dok_matrix((nevents, 1))
f = open("event_attendees.csv", 'rb')
f.readline()
for line in f:
cols = line.strip().split(",")
eventId = cols[0]
if programEvents.eventIndex.has_key(eventId):
i = programEvents.eventIndex[eventId]
self.eventPopularity[i, 0] = \
len(cols[1].split(" ")) - len(cols[4].split(" "))
f.close()
self.eventPopularity = normalize(self.eventPopularity, norm="l1",
axis=0, copy=False)
sio.mmwrite("EA_eventPopularity", self.eventPopularity)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
7.串起所有的数据处理和准备流程
def data_prepare():
"""
计算生成所有的数据,用矩阵或者其他形式存储方便后续提取特征和建模
"""
print "第1步:统计user和event相关信息..."
pe = ProgramEntities()
print "第1步完成...\n"
print "第2步:计算用户相似度信息,并用矩阵形式存储..."
Users(pe)
print "第2步完成...\n"
print "第3步:计算用户社交关系信息,并存储..."
UserFriends(pe)
print "第3步完成...\n"
print "第4步:计算event相似度信息,并用矩阵形式存储..."
Events(pe)
print "第4步完成...\n"
print "第5步:计算event热度信息..."
EventAttendees(pe)
print "第5步完成...\n"
data_prepare()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
8.构建特征(DataRewriter类)
将上述保存到本地一些统计数据读取出来。
from __future__ import division
import cPickle
import numpy as np
import scipy.io as sio
class DataRewriter:
def __init__(self):
self.userIndex = cPickle.load(open("PE_userIndex.pkl", 'rb'))
self.eventIndex = cPickle.load(open("PE_eventIndex.pkl", 'rb'))
self.userEventScores = sio.mmread("PE_userEventScores").todense()
self.userSimMatrix = sio.mmread("US_userSimMatrix").todense()
self.eventPropSim = sio.mmread("EV_eventPropSim").todense()
self.eventContSim = sio.mmread("EV_eventContSim").todense()
self.numFriends = sio.mmread("UF_numFriends")
self.userFriends = sio.mmread("UF_userFriends").todense()
self.eventPopularity = sio.mmread("EA_eventPopularity").todense()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
基于用户协同过滤推荐结果作为特征,我们以计算给用户i对活动j感兴趣程度为例,userSimMatrix[i, :]表示用户i与所有用户的相似度,userEventScores[:,j]表示所有用户对活动j的感兴趣程度。以上两矩阵相乘表示基于其他用户对活动j的感兴趣程度上计算用户i对活动j的感兴趣程度,这就是基于用户协同过滤推荐。
def userReco(self, userId, eventId):
"""
根据User-based协同过滤,得到event的推荐度
基本的伪代码思路如下:
for item i
for every other user v that has a preference for i
compute similarity s between u and v
incorporate v's preference for i weighted by s into running aversge
return top items ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
vs = self.userEventScores[:, j]
sims = self.userSimMatrix[i, :]
prod = sims * vs
try:
return prod[0, 0] - self.userEventScores[i, j]
except IndexError:
return 0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
基于event协同过滤推荐结果作为特征,这里event相似度有两种,一个是基于user-event历史交互信息得出的相似度,一种是基于event本身的内容信息得出的相似度。同上面类似的计算出推荐信息作为特征。
def eventReco(self, userId, eventId):
"""
根据基于物品的协同过滤,得到Event的推荐度
基本的伪代码思路如下:
for item i
for every item j tht u has a preference for
compute similarity s between i and j
add u's preference for j weighted by s to a running average
return top items, ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
js = self.userEventScores[i, :]
psim = self.eventPropSim[:, j]
csim = self.eventContSim[:, j]
pprod = js * psim
cprod = js * csim
pscore = 0
cscore = 0
try:
pscore = pprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass
try:
cscore = cprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass
return pscore, cscore
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
统计返回各个user的朋友个数
def userPop(self, userId):
"""
基于用户的朋友个数来推断用户的社交程度
主要的考量是如果用户的朋友非常多,可能会更倾向于参加各种社交活动
"""
if self.userIndex.has_key(userId):
i = self.userIndex[userId]
try:
return self.numFriends[0, i]
except IndexError:
return 0
else:
return 0
朋友对用户的影响
def friendInfluence(self, userId):
"""
朋友对用户的影响
主要考虑用户所有的朋友中,有多少是非常喜欢参加各种社交活动/event的
用户的朋友圈如果都积极参与各种event,可能会对当前用户有一定的影响
userFriends:dok_matrix,shape[len(users),len(users)],统计第i个user的第j个朋友的活跃程度。
"""
nusers = np.shape(self.userFriends)[1]
i = self.userIndex[userId]
return (self.userFriends[i, :].sum(axis=0) / nusers)[0,0]
活动热度
def eventPop(self, eventId):
"""
本活动本身的热度
主要是通过参与的人数来界定的
"""
i = self.eventIndex[eventId]
return self.eventPopularity[i, 0]
生成训练集特征,测试集特征
生成以下特征:
①:train中invited特征
②:基于用户的推荐结果信息
③:基于event的推荐结果信息
④:用户活跃程度(根据朋友数量)
⑤:用户受朋友影响程度
⑥:活动的流行程度
如果是train:
⑦:interest
⑧:not interest
def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤 和 item-based协同过滤,以及各种热度和影响度作为特征组合在一起
生成新的训练数据,用于分类器分类使用
"""
fn = "train.csv" if train else "test.csv"
fin = open(fn, 'rb')
fout = open("data_" + fn, 'wb')
if header:
ocolnames = ["invited", "user_reco", "evt_p_reco",
"evt_c_reco", "user_pop", "frnd_infl", "evt_pop"]
if train:
ocolnames.append("interested")
ocolnames.append("not_interested")
fout.write(",".join(ocolnames) + "\n")
ln = 0
for line in fin:
ln += 1
if ln < start:
continue
cols = line.strip().split(",")
userId = cols[0]
eventId = cols[1]
invited = cols[2]
if ln%500 == 0:
print "%s:%d (userId, eventId)=(%s, %s)" % (fn, ln, userId, eventId)
user_reco = self.userReco(userId, eventId)
evt_p_reco, evt_c_reco = self.eventReco(userId, eventId)
user_pop = self.userPop(userId)
frnd_infl = self.friendInfluence(userId)
evt_pop = self.eventPop(eventId)
ocols = [invited, user_reco, evt_p_reco,
evt_c_reco, user_pop, frnd_infl, evt_pop]
if train:
ocols.append(cols[4])
ocols.append(cols[5])
fout.write(",".join(map(lambda x: str(x), ocols)) + "\n")
fin.close()
fout.close()
def rewriteTrainingSet(self):
self.rewriteData(True)
def rewriteTestSet(self):
self.rewriteData(False)
dr = DataRewriter()
print "生成训练数据...\n"
dr.rewriteData(train=True, start=2, header=True)
print "生成预测数据...\n"
dr.rewriteData(train=False, start=2, header=True)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
9.建模与预测
实际上在上述特征构造好了之后,我们有很多的办法去训练得到模型和完成预测,这里用了sklearn中的SGDClassifier 事实上xgboost有更好的效果(显然我们的特征大多是密集型的浮点数,很适合GBDT这样的模型)
注意交叉验证,我们这里用了10折的交叉验证
用训练集特征训练模型
from __future__ import division
import math
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
from sklearn.linear_model import SGDClassifier
def train():
"""
在我们得到的特征上训练分类器,target为1(感兴趣),或者是0(不感兴趣)
"""
trainDf = pd.read_csv("data_train.csv")
trainDf.fillna(0,inplace=True)
X = np.matrix(pd.DataFrame(trainDf, index=None,
columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco",
"user_pop", "frnd_infl", "evt_pop"]))
y = np.array(trainDf.interested)
clf = SGDClassifier(loss="log", penalty="l2")
clf.fit(X, y)
return clf
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
def validate():
"""
10折的交叉验证,并输出交叉验证的平均准确率
"""
trainDf = pd.read_csv("data_train.csv")
trainDf.fillna(0,inplace=True)
X = np.matrix(pd.DataFrame(trainDf, index=None,
columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco",
"user_pop", "frnd_infl", "evt_pop"]))
y = np.array(trainDf.interested)
kfold = KFold(n_splits=10)
avgAccuracy = 0
run = 0
for train, test in kfold.split(X):
Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
clf = SGDClassifier(loss="log", penalty="l2")
clf.fit(Xtrain, ytrain)
accuracy = 0
ntest = len(ytest)
for i in range(0, ntest):
yt = clf.predict(Xtest[i, :])
if yt == ytest[i]:
accuracy += 1
accuracy = accuracy / ntest
print "accuracy (run %d): %f" % (run, accuracy)
avgAccuracy += accuracy
run += 1
print "Average accuracy", (avgAccuracy / run)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28

我们再用学习曲线看看是否过拟合还是前拟合
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1,
train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
"""
画出data在某模型上的learning curve.
参数解释
----------
estimator : 你用的分类器。
title : 表格的标题。
X : 输入的feature,numpy类型
y : 输入的target vector
ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
n_jobs : 并行的的任务数(默认1)
"""
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
if plot:
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(u"训练样本数")
plt.ylabel(u"得分")
plt.gca().invert_yaxis()
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha=0.1, color="b")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha=0.1, color="r")
plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"训练集上得分")
plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"交叉验证集上得分")
plt.legend(loc="best")
plt.draw()
plt.gca().invert_yaxis()
plt.show()
midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
return midpoint, diff
trainDf = pd.read_csv("data_train.csv")
trainDf.fillna(0,inplace=True)
X = np.matrix(pd.DataFrame(trainDf, index=None,columns=["invited", "user_reco", "evt_p_reco", "evt_c_reco", "user_pop", "frnd_infl", "evt_pop"]))
y = np.array(trainDf.interested)
plot_learning_curve(clf, u"学习曲线", X, y,cv=10)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58

对test数据进行预测
def test(clf):
"""
读取test数据,用分类器完成预测
"""
origTestDf = pd.read_csv("test.csv")
users = origTestDf.user
events = origTestDf.event
testDf = pd.read_csv("data_test.csv")
fout = open("result.csv", 'wb')
fout.write(",".join(["user", "event", "outcome", "dist"]) + "\n")
nrows = len(testDf)
Xp = np.matrix(testDf)
yp = np.zeros((nrows, 2))
for i in range(0, nrows):
xp = Xp[i, :]
yp[i, 0] = clf.predict(xp)
yp[i, 1] = clf.decision_function(xp)
fout.write(",".join(map(lambda x: str(x),
[users[i], events[i], yp[i, 0], yp[i, 1]])) + "\n")
fout.close()