利用svmrank实现ensemble learning的方法研究

本文介绍了如何利用svmrank实现ensemble learning,通过训练基分类器并结合其得分构建数据格式,使用ranksvm训练集成分类器,最终根据置信度选择分类。这种方法已在实践中得到验证,适用于多分类器投票。

机器学习分类过程中,如果遇到多个分类器表现差不多,想综合各个分类器的优势时,可以考虑多分类器投票,即VOTING的方法,也可以考虑learning to rank的方法优选偏重于正确分类标签的预测得分组合。下面简单总结一下使用svmrank进行集成学习的方法。

首先把数据分为训练集、验证集、测试集,然后都进行特征提取和量化

训练集(training):

原始数据,每一列都是特征,用来提取原始特征,训练多个基分类器

验证集(developing):

结合多个基分类器对每种类别的得分,训练集成分类器

测试集(testing):

最终测试用

######ranksvm数据格式######

验证集和测试集
根据svmrank的输入形式构建
验证集示例:
1 qid:1 1:0.8 2:0.2 3:0.2 4:0.1 5:0.5 #l1
0 qid:1 1:0.1 2:0.7 3:0.2 4:0.4 5:0.3 #l2
0 qid:1 1:0.1 2:0.1 3:0.6 4:0.5 5:0.2 #l3

第一步:不必说,使用原始数据训练基分类器,每个基分类器都对验证集数据进行预测,对验证集的每一个sample,不同基分类器对应不同的类别有不同的得分,上例中,第一行的5个特征1:~5:,即表示5个基分类器对于类别l1的不同预测得分,同理第二行和第三行

第二步:按照如上形式构建数据格式,上例的三行表示一共有三个待判断类别l1~l3,正确的类标为l1,前面使用qid来标识这三条均是针对一个sample的数据,对于正确的类标,该行标记为1,其他标记为0,本例中l1的行标记为1。这与网页排序一致:多个返回结果针对同一个query,不同的result有不同的rank得分,这里只是认为相关的只有一个,其他都不相关,rank为0而已。

解释:到此,集成分类器的输入数据已经构建完成,这么构建什么原理呢?我们的目的是,让分类器自动学习出一组权值,用这组权值对多个分类器进行线性加权,如果用该数据集自身训练和

This paper focuses on the problem of Question Routing (QR) in Community Question Answering (CQA), which aims to route newly posted questions to the potential answerers who are most likely to answer them. Traditional methods to solve this problem only consider the text similarity features between the newly post-ed question and the user profile, while ignoring the important statistical features, including the question-specific statistical fea-ture and the user-specific statistical features. Moreover, tradition-al methods are based on unsupervised learning, which is not easy to introduce the rich features into them. This paper proposes a general framework based on the learning to rank concepts for QR. Training sets consist of triples (q, asker, answerers) are first col-lected. Then, by introducing the intrinsic relationships between the asker and the answerers in each CQA session to capture the intrinsic labels/orders of the users about their expertise degree of the question q, two different methods, including the SVM-based and RankingSVM-based methods, are presented to learn the mod-els with different example creation processes from the training set. Finally, the potential answerers are ranked using the trained mod-els. Extensive experiments conducted on a real world CQA da-taset from Stack Overflow show that our proposed two methods can both outperform the traditional query likelihood language model (QLLM) as well as the state-of-the-art Latent Dirichlet Allocation based model (LDA). Specifically, the RankingSVM-based method achieves statistical significant improvements over the SVM-based method and has gained the best performance.
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值