Stanford - Algorithms: Design and Analysis, Part 2 - Week 2 Assignment: Clustering

本文详细介绍了一个具体的聚类问题实例,包括如何使用Kruskal算法进行最大间距聚类,并提出了一种巧妙的方法来解决大规模数据集上的聚类问题,旨在找到具有特定间距限制的最大聚类数量。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本周作业是有关Clustering的。。

第一题如下:

Question 1

In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing  k -clustering. Download the text file  here. This file describes a distance function (equivalently, a complete graph with edge costs). It has the following format:

[number_of_nodes]
[edge 1 node 1] [edge 1 node 2] [edge 1 cost]
[edge 2 node 1] [edge 2 node 2] [edge 2 cost]
...
There is one edge  (i,j)  for each choice of  1i<jn , where  n  is the number of nodes. For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 (equivalently, the cost of the edge (1,3)) is 5250. You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number  k  of clusters is set to 4. What is the maximum spacing of a 4-clustering?

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!

首先先介绍下有关Clustering的背景吧:


比较概括的说,靠的比较近的几个点就算是cluster。


首先一个概念就是spacing of k-clustering:就是所处在不同clustering的两点pq的最小距离,如果比这个距离还小,那肯定在同一个clustering里面

之后一个概念就是本题需要求解的东西:给定k,我们要求这个spacing的最大值,这个可以理解,我们想获得k个clustering,自然是clustering之间的区分越大越好


最后一张图介绍了Clustering和Kruskal算法之间的关系:还是照常跑Clustering算法,但是需要及早停止(此时Clustering的数量正好为K)。。

之后就是第一题的解法了,这里就默认大家Kruskal算法都已经完全了解了。。

计算spacing的第一步就是运行Kruskal算法,记住要在只剩k个clustering的时候及时跳出:

	/* kruskal */
	for (i = 0; i < u_graph.size(); ++i) {
		if (!connected(id, u_graph[i].start, u_graph[i].end))
			union_1(id, u_graph[i].start, u_graph[i].end, count, num_of_ver);
		if (count == k)
			break;
	}
之后一步是返回spacing,spacing其实就是从count == k跳出的时候缩放问的edge开始计算。。

看看之后最先得到的u-v不相连的edge,这个edge就是clustering之间最小的距离,这个edge的weight就是我们需要的返回值。。之后的没有访问过的edge都会大于这个值,这是kruskal算法的特点。。

代码如下:

	while (i < u_graph.size()) {
		if (connected(id, u_graph[i].start, u_graph[i].end))
			++i;
		else
			return u_graph[i].weight;
	}
然后上传完整代码:

# include <iostream>
# include <string>
# include <fstream>
# include <vector>
# include <algorithm>

using namespace std;

struct edge{
	int start;
	int end;
	int weight;
};

vector<edge> u_graph;

/* comparator of the u_graph */
bool comp_function(edge ed1, edge ed2) { return (ed1.weight < ed2.weight); }

int store_file(string);			/* return the number of vertices */
void print_all(void);			/* print the vertices */
int clustering(int, int);
bool connected(int*, int, int);		/* helper function in union find */	
int find(int*, int);
bool connected(int*, int, int);
void union_1(int*, int, int, int&, int);	

int main(int argc, char** argv) {
	int num_of_ver = store_file("clustering1.txt");
	sort(u_graph.begin(), u_graph.end(), comp_function);
//	print_all();
	int res = clustering(num_of_ver, 4);
	cout << "spacing: " << res << endl;
	return 0;
}

int store_file(string filename) {
	ifstream infile;
	infile.open(filename, ios::in);
	int num_of_ver = 0;
	infile >> num_of_ver;
	int s, e, w;
	while (infile >> s >> e >> w) {
		edge ed1 = {s, e, w};
		u_graph.push_back(ed1);
	}
	infile.close();
	return num_of_ver;
}

void print_all(void) {
	for (int i = 0; i < u_graph.size(); ++i)
		cout << u_graph[i].start << " " << u_graph[i].end << " " << u_graph[i].weight << endl;
}

int clustering(int num_of_ver, int k) {
	int* id = new int[num_of_ver+1];
	/* constructor */
	int i;
	for (i = 1; i <= num_of_ver; ++i)
		id[i] = i;
	int count = num_of_ver;

	/* kruskal */
	for (i = 0; i < u_graph.size(); ++i) {
		if (!connected(id, u_graph[i].start, u_graph[i].end))
			union_1(id, u_graph[i].start, u_graph[i].end, count, num_of_ver);
		if (count == k)
			break;
	}

	while (i < u_graph.size()) {
		if (connected(id, u_graph[i].start, u_graph[i].end))
			++i;
		else
			return u_graph[i].weight;
	}
}

void union_1(int* id, int u, int v, int& count, int num_of_ver) {
	int u_id = find(id, u);
	int v_id = find(id, v);

	if (u_id == v_id)
		return;

	/* re-point u'component to v's id */
	for (int i = 1; i <= num_of_ver; ++i) {
		if (id[i] == u_id)
			id[i] = v_id;
	}
	--count;
}

int find(int* id, int u) {
	return id[u];
}

bool connected(int* id, int u, int v) {
	return find(id, u) == find(id, v);
}


接下来看第二题:

Question 2

In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined  implicitly, rather than being provided as an explicit list.

The data set is here. The format is:
[# of nodes] [# of bits for each node's label]
[first bit of node 1] ... [last bit of node 1]
[first bit of node 2] ... [last bit of node 2]
...
For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes  u  and  v  in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of  k  such that there is a  k -clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

第二题就是用hamming distance表示两个vertices之间的距离。。题干里面都有说明。。

本题有点不同的是给定了distance,求k

这道题的解法相当tricky,我是没有想到,看了讨论区才做出来的。。

我就直接借用Naveen Sharma的讲解,只不过把他的话翻译并用我的理解说出来:

首先举一个简单的例子:

7 3
1 0 0
0 1 1
1 1 1
1 0 0
1 0 1
0 1 1
1 1 1
就是有7个点,我们求3个clustering的spacing。。

第一步,简化,把重复点去掉,因为重复的点必然在一个clustering里面,可以把他们当成同一个点,不影响spacing的计算

4 3
1 0 0
0 1 1
1 1 1
1 0 1
这个我就是用的hashmap数据结构,去掉了重复点,代码如下:

		/* eliminate the duplicate vertices */
		if (hash_map.find(val) == hash_map.end())
			hash_map.insert(pair<int, int>(val, hash_map.size()));
第二步,给这四个vertex标号,比如:

1 0 0 1
0 1 1 2
1 1 1 3
1 0 1 4

这在我的代码中不需要特殊处理

第三步,异或有一个很重要的性质,即A xor B = C,则A xor C = B。

由题中可知我们需要spacing at least 3,所以,所有hamming distance = 1 以及 2的点需要在一个clustring里面。即经过union处理。。

如果A xor B = 100则hamming distance为1,如果A xor C = 011,hamming distance是2,我们同样可以反向,通过A,以及给定距离,找所有该距离的点,比如A xor 001 = B, A xor 011 = C,然后把这些点union之后,看看clustering的数量,即为k:

代码如下:

<span style="white-space:pre">	</span>/* for one_bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < one_bit.size(); ++k) {
			int seek = it->first ^ one_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

	/* for two bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < two_bit.size(); ++k) {
			int seek = it->first ^ two_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}
完整代码如下:

# include <iostream>
# include <fstream>
# include <unordered_map>
# include <sstream>
# include <string>
# include <cmath>
# include <vector>

using namespace std;

unordered_map<int, int> hash_map;		/* first argument is distance, second is node name */
vector<int> one_bit;
vector<int> two_bit;

void store_file(string);				/* prototype */
void print_all(void);
void generate(int);
bool connected(int*, int, int);
void union_1(int*, int, int, int&);
int find(int*, int);
int clustering(void);

int main(int argc, char** argv) {

	store_file("clustering_big.txt");
//	print_all();
	cout << "cluster: " << clustering() << endl;

	return 0;
}

void store_file(string filename) {
	int line = 1;
	int num, bits;
	ifstream infile;
	infile.open(filename, ios::in);
	string str;
	getline(infile, str);
	istringstream iss(str);
	iss >> num >> bits;
	generate(bits);
	while (line <= num) {
		string str;
		getline(infile, str);
		istringstream iss(str);
		int i = 0;
		int val = 0;
		while (i < bits) {
			int n;
			iss >> n;
			val = val * 2 + n;
			++i;
		}
		/* eliminate the duplicate vertices */
		if (hash_map.find(val) == hash_map.end())
			hash_map.insert(pair<int, int>(val, hash_map.size()));
		++line;
	}
	infile.close();
}

void print_all(void) {
	cout << "hash_map: " << endl;
	for(auto it = hash_map.begin(); it != hash_map.end(); ++it)
		cout << it->first << "  " << it->second << endl;
	cout << "one_bit: " << endl;
	for (auto it = one_bit.begin(); it != one_bit.end(); ++it)
		cout << *it << endl;
	cout << "two_bit: " << endl;
	for (auto it = two_bit.begin(); it != two_bit.end(); ++it)
		cout << *it << endl;
}

void generate(int bits) {
	/* for one bit */
	for (int i = 0; i < bits; ++i) {
		int val = pow(2, i);
		one_bit.push_back(val);
	}
	/* for two bit */
	for (int i = 0; i < bits; ++i) {
		int val = pow(2, i);
		for (int j = i + 1; j < bits; ++j) {
			int val_1 = val + pow(2, j);
			two_bit.push_back(val_1);
		}
	}
}

int clustering(void) {
	int num_of_ver = hash_map.size();
	int* id = new int[num_of_ver];
	/* constructor */
	int i;
	for (i = 0; i < num_of_ver; ++i)
		id[i] = i;
	int count = num_of_ver;

	/* for one_bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < one_bit.size(); ++k) {
			int seek = it->first ^ one_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

	/* for two bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < two_bit.size(); ++k) {
			int seek = it->first ^ two_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

	return count;
}

bool connected(int* id, int u, int v) {
	return find(id, u) == find(id, v);
}

void union_1(int* id, int u, int v, int& count) {
	int u_id = find(id, u);
	int v_id = find(id, v);

	if (u_id == v_id)
		return;

	/* re-point u'component to v's id */
	for (int i = 0; i <= hash_map.size(); ++i) {
		if (id[i] == u_id)
			id[i] = v_id;
	}
	--count;
}

int find(int* id, int u) {
	return id[u];
}

注意:Kruskal算法用到了union-find数据结构,个人认为普林斯顿的那本红本的Algorithms讲的很好,不明白的可以看一看。。另外个人认为,作为算法的菜鸟,普林斯顿的算法远比CLRS更加适合。。。。








### Hamming Weight 的概念 Hamming Weight 表示二进制表示中位为 `1` 的数量。对于给定的一个整数,其汉明重量即该整数的二进制形式中有多少个 `1` 。这一属性广泛应用于编码理论、密码学等领域。 ### 计算方法 #### 方法一:逐位检查法 最直观的方法是对每一位进行遍历并计数: ```cpp int hammingWeight(uint32_t n) { int count = 0; while (n != 0) { if ((n & 1) == 1) ++count; // 如果最低位是1,则增加计数器 n >>= 1; // 右移一位继续判断下一位 } return count; } ``` 这种方法的时间复杂度取决于输入数值的有效位长度,理论上最多需要执行 32 次循环(针对 32-bit 整型)[^1]。 #### 方法二:Brian Kernighan 算法 更高效的算法利用了这样一个事实——每次将一个正整数减去 1 后再与其本身按位与运算会消除掉最右边的那个 `1` ,因此可以通过不断重复此操作直到原数变为零来统计其中含有的 `1` 的数目: ```cpp int hammingWeight(uint32_t n) { int count = 0; while (n != 0) { n &= (n - 1); // 清除最低位的'1' ++count; // 对应清除了一次就有一个'1' } return count; } ``` 上述代码片段展示了如何高效地计算任意无符号整数的汉明权重,平均情况下所需的迭代次数远少于前者。 ### 应用场景 - **错误检测与纠正**: 在通信协议中用来设计奇偶校验码或其他类型的纠错码。 - **压缩感知**: 利用信号稀疏特性,在某些特定条件下可以减少采样率而不失真还原原始数据。 - **加密技术**: 如在流密钥生成过程中作为随机性衡量标准之一;或是用于构建轻量级硬件友好的分组密码结构。 - **生物信息学**: DNA序列分析中的相似度匹配等问题也涉及到大量二元串的操作,此时汉明距离/权重成为重要工具。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值