self-training and co-training

本文介绍了半监督学习中广泛使用的方法,包括EM算法结合生成混合模型、自我训练(self-training)、协同训练(co-training)、传导支持向量机和基于图的方法等。详细探讨了自我训练与协同训练的工作原理及其应用场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Semi-supervised learning methods widely used include:

 

1.EM with generative mixture models

 

2.self-training

 

3.co-training

 

4.transductive support vector machines

 

5.graph-based methods

 

self-training:

 

A classifier is first traind with the small amount of labeled data. The classifier is then used to classify the unlabeled

 

data. Typically the most confident unlabeled data points, together with their predicted labels, are added to the

 

training set. The classifier is re-trained and the procedure repeated.

 

When the existing supervised classifier is complicated and hard to modify, self-training is a practical wrapper method.

 

applied to several natural language processing tasks, word sense disambiguation, parsing, machine translation and

 

object detection system from images.

 

co-training

Co-training assumes that features can be split into two sets. Each sub-features is sufficient to train a good classifier.

 

The two sets sre conditionally independent given the class. Initially two seperate classifiers are trained with the

 

labeled data, on the two sub-features sets respectively. Each classifier then classifies the unlabeled data, and

 

'teaches' the other classifier with the few unlabeled examples(and the predicted labels) they feel most confident.

 

Each classifier is retrained with the additional training examples given by the other classifer, and the process

 

repeats.

 

When the features naturally split into two sets, co-training may be appropriate.

 

Reference:

 

Xiaojin Zhu. Semi-Supervised Learning with Graphs.

Co-training 是一种半监督学习方法,它可以利用未标记的数据来提高模型的性能。下面是一个 Python 实现的 Co-training 代码示例: ```python import numpy as np from sklearn.naive_bayes import MultinomialNB class CoTrainer: def __init__(self, clf1, clf2, n_iter=10): self.clf1 = clf1 self.clf2 = clf2 self.n_iter = n_iter def fit(self, X_unlabeled, y_unlabeled, X_labeled1, y_labeled1, X_labeled2, y_labeled2): for i in range(self.n_iter): # 训练第一个分类器 self.clf1.fit(X_labeled1, y_labeled1) # 使用第一个分类器预测未标记数据的标签 y_pred1 = self.clf1.predict(X_unlabeled) # 找出第一个分类器预测的置信度最高的样本 idx1 = np.argsort(-self.clf1.predict_proba(X_unlabeled), axis=1)[:,:1] # 将这些样本加入第一个标记集 X_labeled1 = np.vstack((X_labeled1, X_unlabeled[idx1])) y_labeled1 = np.hstack((y_labeled1, y_pred1[idx1])) # 从未标记集中删除这些样本 X_unlabeled = np.delete(X_unlabeled, idx1, axis=0) y_unlabeled = np.delete(y_unlabeled, idx1, axis=0) # 训练第二个分类器 self.clf2.fit(X_labeled2, y_labeled2) # 使用第二个分类器预测未标记数据的标签 y_pred2 = self.clf2.predict(X_unlabeled) # 找出第二个分类器预测的置信度最高的样本 idx2 = np.argsort(-self.clf2.predict_proba(X_unlabeled), axis=1)[:,:1] # 将这些样本加入第二个标记集 X_labeled2 = np.vstack((X_labeled2, X_unlabeled[idx2])) y_labeled2 = np.hstack((y_labeled2, y_pred2[idx2])) # 从未标记集中删除这些样本 X_unlabeled = np.delete(X_unlabeled, idx2, axis=0) y_unlabeled = np.delete(y_unlabeled, idx2, axis=0) # 在两个标记集上合并训练数据 X_train = np.vstack((X_labeled1, X_labeled2)) y_train = np.hstack((y_labeled1, y_labeled2)) # 使用合并后的训练集重新训练两个分类器 self.clf1.fit(X_train, y_train) self.clf2.fit(X_train, y_train) def predict(self, X): # 合并两个分类器的预测结果 y_pred1 = self.clf1.predict(X) y_pred2 = self.clf2.predict(X) return np.hstack((y_pred1.reshape(-1, 1), y_pred2.reshape(-1, 1))) ``` 这个 Co-training 的实现使用了朴素贝叶斯分类器作为基分类器,可以根据需要替换为其他分类器。在 `fit` 方法中,我们首先训练两个基分类器,然后将它们用于预测未标记数据的标签。接着,我们分别找出两个分类器预测置信度最高的样本,将它们加入两个标记集,并从未标记集中删除这些样本。这个过程重复进行多次,直到未标记集为空。最后,我们使用两个标记集合并后的训练数据重新训练两个分类器,并在预测时合并两个分类器的预测结果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值