StumbleUpon Evergreen Classification Challenge
------2013/08/16 -- 2013/10/31
一 背景
Build a classifier to categorize webpages as evergreen or non-evergreen
Stumbleupon是美国的UGC网站,用户分享内容,网站通过用户行为数据构建兴趣图谱和对用户喜好进行一个个性化定位。
Stumbleupon 发布一个比赛,公司提供数据集,包括有标记的训练集和待预测的测试集,根据StumbleUpon提供历史数据,设计分类模型,预测StumbleUpon提供的网页是否是长期流行,还是短暂流行。
训练集是网页的内容和标记(网页是否是evergreen-长期备受欢迎)
测试集是网页内容,
预测目标y:0,1 (0:non-evergreen,1:evergreen)
官网上数据集格式如下:
FieldName |
Type |
Description |
url |
string |
Url of the webpage to be classified |
urlid |
integer |
StumbleUpon's unique identifier for each url |
boilerplate |
json |
Boilerplate text |
alchemy_category |
string |
Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
alchemy_category_score |
double |
Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
avglinksize |
double |
Average number of words in each link |
commonLinkRatio_1 |
double |
# of links sharing at least 1 word with 1 other links / # of links |
commonLinkRatio_2 |
double |
# of links sharing at least 1 word with 2 other links / # of links |
commonLinkRatio_3 |
double |
# of links sharing at least 1 word with 3 other li |