一、itemCF 测试
mahout版本 0.10.0
mahout 提供了很多的算法,其中比较常用的算是itemCF了这里记录一下itemcf的使用方法
1、数据准备,这里是使用自己采集的一些行为数据 ,数据不多,但是可以测试出结果:
下面三列分别是 user_id , item_id , perfence
把以下数据存放到hdfs上,我存放的路径是/mahout/itemcf/data1/itemdata.data
0162381440670851711,4,7.00162381440670851711,11,4.00162381440670851711,32,1.00162381440670851711,176,27.00162381440670851711,183,11.00162381440670851711,184,5.00162381440670851711,207,9.00162381440670851711,256,3.00162381440670851711,258,4.00162381440670851711,259,16.00162381440670851711,260,8.00162381440670851711,261,18.00162381440670851711,301,1.00162381440670851711,307,1.00162381440670851711,477,1.00162381440670851711,518,1.00162381440670851711,549,3.00162381440670851711,570,1.00162381440670851711,826,2.00357211441096952115,207,1.00617721441096186493,184,1.00617721441096186493,207,1.01205421441071459451,5,1.01214361441096861254,207,1.01401731441095483081,258,1.01401731441095483081,814,4.01401731441095483081,826,1.01917281441163686119,259,10.01917281441163686119,260,1.01917281441163686119,261,3.01966141441163860798,176,1.02294491441095342047,176,1.02441031440670827430,4,13.02441031440670827430,259,29.02441031440670827430,261,14.02441031440670827430,460,2.02441031440670827430,477,6.02441031440670827430,570,1.02441031440670827430,577,6.02441031440670827430,702,1.02441031440670827430,758,2.02441031440670827430,809,1.02475791441161318569,176,1.02987091441068878630,261,1.03114261440726814722,549,1.03445831441096810087,207,1.03846061441096937902,207,1.04266911441160164599,176,1.04698311441097046150,176,2.04698311441097046150,183,2.04698311441097046150,184,4.04698311441097046150,207,6.04946291441097563245,183,1.04956331440750398178,159,1.04956331440750398178,160,1.05307571441160362208,4,1.05307571441160362208,176,1.05719691441098504387,176,5.05719691441098504387,184,1.05719691441098504387,207,1.05813281441095425044,184,2.05813281441095425044,258,1.05894601441095265604,184,1.05981521441096106535,207,1.06292291441096870187,207,1.06533651441161410910,176,1.06810691441096902907,207,1.06836071440729632252,4,3.06836071440729632252,49,1.06836071440729632252,259,2.06836071440729632252,570,1.06836071440729632252,577,2.06964141441160527746,176,1.07495291441096796843,207,1.07616681441095305067,183,1.07616681441095305067,184,2.07616681441095305067,258,2.07616681441095305067,261,1.07732211441095211112,183,1.07732211441095211112,259,2.07732211441095211112,260,9.07732211441095211112,261,1.07732211441095211112,632,6.08211761441096060717,176,1.08211761441096060717,183,1.08305691441168039389,259,3.08305691441168039389,260,2.08305691441168039389,261,1.08375281440837772178,527,1.08432311440724457499,290,1.08641451441097297246,183,1.08641451441097297246,184,1.08641451441097297246,207,1.08641451441097297246,259,1.08641451441097297246,263,1.08641451441097297246,838,1.08641451441097297246,839,1.08641451441097297246,840,1.08651081441095283643,176,2.08651081441095283643,183,7.08753221441095342356,176,1.0
2、使用mahout自带的算法 实现协同过滤:
语句如下:
bin/hadoop jar /home/lin/hadoop/mahout-distribution-0.10.0/mahout-examples-0.10.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -i /mahout/itemcf/data1 -o /mahout/itemcf/result1 -s SIMILARITY_LOGLIKELIHOOD --tempDir /mahout/itemcf/temp1
其中 -i 后面是输入数据存放地址,也就是上面给的测试数据;
-o 后面是结果输出地址,这个文件夹不用建立,mahout会自动建立,若是已经存在则会报错;
--tempDir 是指临时存放的一些输出数据,mahout自己的一些输出 ,这个路径mahout自动创建,若是存在会报错;
-s 是指定使用算法;可以根据自己的需要选择;
具体的help如下
Job-Specific Options:--input (-i) input Path to job inputdirectory.--output (-o) output The directorypathname for output.--similarityClassname (-s) similarityClassname Name of distributedsimilarity measuresclass to instantiate,alternatively use oneof the predefinedsimilarities([SIMILARITY_COOCCURRENCE,SIMILARITY_LOGLIKELIHOOD,SIMILARITY_TANIMOTO_COEFFICIENT,SIMILARITY_CITY_BLOCK,SIMILARITY_COSINE,SIMILARITY_PEARSON_CORRELATION,SIMILARITY_EUCLIDEAN_DISTANCE])--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem try to cap the numberof similar items peritem to this number(default: 100)--maxPrefs (-mppu) maxPrefs max number ofpreferences toconsider per user oritem, users or itemswith more preferenceswill be sampled down(default: 500)--minPrefsPerUser (-mp) minPrefsPerUser ignore users withless preferences thanthis (default: 1)--booleanData (-b) booleanData Treat input aswithout pref values--threshold (-tr) threshold discard item pairswith a similarityvalue below this--randomSeed randomSeed use this seed forsampling--help (-h) Print out help--tempDir tempDir Intermediate outputdirectory--startPhase startPhase First phase to run--endPhase endPhase Last phase to run
3、执行上述命令后,等待执行完毕,在目录 /mahout/itemcf/result1 可以看到如下数据:
162381440670851711 [809:13.535571,702:13.535571,460:13.535571,758:13.535571,632:13.182321,577:12.929438,49:11.368558,307:10.562227,32:10.562227,518:10.562227]617721441096186493 [839:1.0,259:1.0,518:1.0,826:1.0,11:1.0,260:1.0,4:1.0,32:1.0,176:1.0,840:1.0]1401731441095483081 [11:1.0,570:1.0,518:1.0,307:1.0,260:1.0,259:1.0,549:1.0,32:1.0,207:1.0,184:1.0]1917281441163686119 [577:7.365086,702:6.5,809:6.5,758:6.5,460:6.5,184:5.9840446,176:5.981493,4:5.577299,570:5.3220325,477:4.9567957]2441031440670827430 [632:21.5,176:18.084661,183:15.684914,260:14.2175,207:13.510652,11:12.28147,307:12.28147,32:12.28147,518:12.28147,256:12.28147]4698311441097046150 [263:3.9337947,839:3.9337947,840:3.9337947,838:3.9337947,11:3.4747553,307:3.4747553,32:3.4747553,518:3.4747553,256:3.4747553,301:3.4747553]5307571441160362208 [826:1.0,259:1.0,518:1.0,307:1.0,11:1.0,260:1.0,549:1.0,32:1.0,207:1.0,184:1.0]5719691441098504387 [4:3.6454906,259:3.6147578,260:2.67091,261:2.6694102,183:2.517088,307:2.2876854,11:2.2876854,32:2.2876854,518:2.2876854,256:2.2876854]5813281441095425044 [207:1.8607497,259:1.6642486,183:1.5539461,301:1.4806436,11:1.4806436,307:1.4806436,32:1.4806436,518:1.4806436,256:1.4806436,549:1.4099455]6836071440729632252 [207:2.6088793,176:2.3617313,477:1.9966183,460:1.9945599,758:1.9945599,809:1.9945599,702:1.9945599,11:1.9926376,307:1.9926376,32:1.9926376]7616681441095305067 [826:1.5790755,207:1.5721571,549:1.535743,301:1.50748,307:1.50748,11:1.50748,32:1.50748,518:1.50748,256:1.50748,839:1.5]7732211441095211112 [826:3.7059078,549:3.7059078,307:3.3461132,256:3.3461132,518:3.3461132,11:3.3461132,301:3.3461132,32:3.3461132,570:3.1800203,477:3.1795032]8211761441096060717 [826:1.0,259:1.0,518:1.0,307:1.0,11:1.0,260:1.0,549:1.0,32:1.0,207:1.0,184:1.0]8305691441168039389 [577:2.2471673,4:2.083036,570:2.0549815,809:2.0,460:2.0,11:2.0,826:2.0,32:2.0,307:2.0,549:2.0]8641451441097297246 [11:1.0,632:1.0,518:1.0,826:1.0,260:1.0,570:1.0,549:1.0,32:1.0,307:1.0,477:1.0]8651081441095283643 [184:6.597979,258:6.1955295,260:6.1955295,826:5.5266876,549:5.5266876,477:5.5266876,259:4.662548,261:4.662548,11:4.626224,307:4.626224]
mahout 还有一个经常用到的算法 物品相似度 ,这样得到的结果是物品间的相度:
mahout itemsimilarity -i /mahout/itemcf/data1 -o /mahout/itemcf/result1 -s SIMILARITY_LOGLIKELIHOOD --tempDir /mahout/itemcf/temp1
本文介绍了如何使用Mahout 0.10.0版本的itemCF算法进行推荐系统搭建的过程,包括数据准备、命令行参数配置及结果解析。
2527

被折叠的 条评论
为什么被折叠?



