主动学习之 ALEC

最新推荐文章于 2025-04-28 00:01:04 发布

原创

最新推荐文章于 2025-04-28 00:01:04 发布 · 429 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#学习

本文介绍了一种基于密度峰值的主动学习算法(ALEC)，用于数据分类。算法通过构建树结构并计算对象的代表性，选择最具代表性的样本进行查询。如果查询样本具有相同标签，则认为当前数据块纯化，其余样本被打上相同标签；否则，将数据块分裂并递归处理。整个过程旨在减少查询次数，提高学习效率。

1. 主动学习 Active learning

图 1. 主动学习场景

2. 三支主动学习

基于聚类的主动学习
样本处于三种状态: 被查询、被分类、延迟处理

图 2. 三支主动学习

2.1 ALEC 算法

图 3. ALEC 算法运行示例

Step 1. 根据 Density peaks 将数据组织成一棵树, 同时计算每个对象的代表性;
Step 2. 查询当前块代表性最高的若干样本;
Step 3. 如果被查询样本具有同样的标签, 则认为当前块纯了, 将其余样本全部打上同样标签;
Step 4. 否则将当前块分裂为两块, 递归到下一级的 Step 2;

基于聚类主动学习的基本思想如下:
Step 1. 将对象按代表性递减排序.
Step 2. 假设当前数据块有 $N$ 个对象, 选择最具代表性的 $\sqrt{N}$ 个查询其标签 (类别).
Step 3. 如果这 $\sqrt{N}$ 个标签具有相同类别, 就认为该块为纯的, 其它对象均分类为同一类别.结束.
Step 4. 将当前块划分为两个子块, 分别 Goto Step 3.

完整代码：

package machinelearning.activelearning;

import java.io.FileReader;
import java.util.*;
import weka.core.Instances;

/**
 * Active learning through density clustering.
 * http://www.sciencedirect.com/science/article/pii/S095741741730369X
 * 
 * @author Rui Chen 1369097405@qq.com.
 */
public class Alec {
	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * The maximal number of queries that can be provided.
	 */
	int maxNumQuery;

	/**
	 * The actual number of queries.
	 */
	int numQuery;

	/**
	 * The radius, also dc in the paper. It is employed for density computation.
	 */
	double radius;

	/**
	 * The densities of instances, also rho in the paper.
	 */
	double[] densities;

	/**
	 * distanceToMaster
	 */
	double[] distanceToMaster;

	/**
	 * Sorted indices, where the first element indicates the instance with the
	 * biggest density.
	 */
	int[] descendantDensities;

	/**
	 * Priority
	 */
	double[] priority;

	/**
	 * The maximal distance between any pair of points.
	 */
	double maximalDistance;

	/**
	 * Who is my master?
	 */
	int[] masters;

	/**
	 * Predicted labels.
	 */
	int[] predictedLabels;

	/**
	 * Instance status. 0 for unprocessed, 1 for queried, 2 for classified.
	 */
	int[] instanceStatusArray;

	/**
	 * The descendant indices to show the representativeness of instances in a
	 * descendant order.
	 */
	int[] descendantRepresentatives;

	/**
	 * Indicate the cluster of each instance. It is only used in
	 * clusterInTwo(int[]);
	 */
	int[] clusterIndices;

	/**
	 * Blocks with size no more than this threshold should not be split further.
	 */
	int smallBlockThreshold = 3;

	/**
	 ********************************** 
	 * The constructor.
	 * 
	 * @param paraFilename
	 *            The data filename.
	 ********************************** 
	 */
	public Alec(String paraFilename) {
		try {
			FileReader tempReader = new FileReader(paraFilename);
			dataset = new Instances(tempReader);
			dataset.setClassIndex(dataset.numAttributes() - 1);
			tempReader.close();
		} catch (Exception ee) {
			System.out.println(ee);
			System.exit(0);
		} // Of fry
		computeMaximalDistance();