Stanford - Algorithms: Design and Analysis, Part 2 - Week 2 Assignment: Clustering

最新推荐文章于 2020-06-25 00:46:18 发布

lym1108csu

最新推荐文章于 2020-06-25 00:46:18 发布

阅读量1.9k

点赞数

CC 4.0 BY-SA版权

分类专栏： Algorithms: Coursera(Stanford) 文章标签： C++ clustering kruskal 算法 MST

本文链接：https://blog.youkuaiyun.com/u012290414/article/details/45193307

Algorithms: Coursera(Stanford) 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了一个具体的聚类问题实例，包括如何使用Kruskal算法进行最大间距聚类，并提出了一种巧妙的方法来解决大规模数据集上的聚类问题，旨在找到具有特定间距限制的最大聚类数量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本周作业是有关Clustering的。。

第一题如下：

Question 1

In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing

k -clustering. Download the text file here. This file describes a distance function (equivalently, a complete graph with edge costs). It has the following format:

[number_of_nodes]
[edge 1 node 1] [edge 1 node 2] [edge 1 cost]
[edge 2 node 1] [edge 2 node 2] [edge 2 cost]
...
There is one edge (i,j) for each choice of 1≤i<j≤n , where n is the number of nodes. For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 (equivalently, the cost of the edge (1,3)) is 5250. You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number k of clusters is set to 4. What is the maximum spacing of a 4-clustering?

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!

首先先介绍下有关Clustering的背景吧：

比较概括的说，靠的比较近的几个点就算是cluster。

首先一个概念就是spacing of k-clustering:就是所处在不同clustering的两点pq的最小距离，如果比这个距离还小，那肯定在同一个clustering里面

之后一个概念就是本题需要求解的东西：给定k，我们要求这个spacing的最大值，这个可以理解，我们想获得k个clustering，自然是clustering之间的区分越大越好

最后一张图介绍了Clustering和Kruskal算法之间的关系：还是照常跑Clustering算法，但是需要及早停止（此时Clustering的数量正好为K）。。

之后就是第一题的解法了，这里就默认大家Kruskal算法都已经完全了解了。。

计算spacing的第一步就是运行Kruskal算法，记住要在只剩k个clustering的时候及时跳出：

	/* kruskal */
	for (i = 0; i < u_graph.size(); ++i) {
		if (!connected(id, u_graph[i].start, u_graph[i].end))
			union_1(id, u_graph[i].start, u_graph[i].end, count, num_of_ver);
		if (count == k)
			break;
	}

之后一步是返回spacing，spacing其实就是从count == k跳出的时候缩放问的edge开始计算。。

看看之后最先得到的u-v不相连的edge，这个edge就是clustering之间最小的距离，这个edge的weight就是我们需要的返回值。。之后的没有访问过的edge都会大于这个值，这是kruskal算法的特点。。

代码如下：

	while (i < u_graph.size()) {
		if (connected(id, u_graph[i].start, u_graph[i].end))
			++i;
		else
			return u_graph[i].weight;
	}

然后上传完整代码：

# include <iostream>
# include <string>
# include <fstream>
# include <vector>
# include <algorithm>

using namespace std;

struct edge{
	int start;
	int end;
	int weight;
};

vector<edge> u_graph;

/* comparator of the u_graph */
bool comp_function(edge ed1, edge ed2) { return (ed1.weight < ed2.weight); }

int store_file(string);			/* return the number of vertices */
void print_all(void);			/* print the vertices */
int clustering(int, int);
bool connected(int*, int, int);		/* helper function in union find */	
int find(int*, int);
bool connected(int*, int, int);
void union_1(int*, int, int, int&, int);	

int main(int argc, char** argv) {
	int num_of_ver = store_file("clustering1.txt");
	sort(u_graph.begin(), u_graph.end(), comp_function);
//	print_all();
	int res = clustering(num_of_ver, 4);
	cout << "spacing: " << res << endl;
	return 0;
}

int store_file(string filename) {
	ifstream infile;
	infile.open(filename, ios::in);
	int num_of_ver = 0;
	infile >> num_of_ver;
	int s, e, w;
	while (infile >> s >> e >> w) {
		edge ed1 = {s, e, w};
		u_graph.push_back(ed1);
	}
	infile.close();
	return num_of_ver;
}

void print_all(void) {
	for (int i = 0; i < u_graph.size(); ++i)
		cout << u_graph[i].start << " " << u_graph[i].end << " " << u_graph[i].weight << endl;
}

int clustering(int num_of_ver, int k) {
	int* id = new int[num_of_ver+1];
	/* constructor */
	int i;
	for (i = 1; i <= num_of_ver; ++i)
		id[i] = i;
	int count = num_of_ver;

	/* kruskal */
	for (i = 0; i < u_graph.size(); ++i) {
		if (!connected(id, u_graph[i].start, u_graph[i].end))
			union_1(id, u_graph[i].start, u_graph[i].end, count, num_of_ver);
		if (count == k)
			break;
	}

	while (i < u_graph.size()) {
		if (connected(id, u_graph[i].start, u_graph[i].end))
			++i;
		else
			return u_graph[i].weight;
	}
}

void union_1(int* id, int u, int v, int& count, int num_of_ver) {
	int u_id = find(id, u);
	int v_id = find(id, v);

	if (u_id == v_id)
		return;

	/* re-point u'component to v's id */
	for (int i = 1; i <= num_of_ver; ++i) {
		if (id[i] == u_id)
			id[i] = v_id;
	}
	--count;
}

int find(int* id, int u) {
	return id[u];
}

bool connected(int* id, int u, int v) {
	return find(id, u) == find(id, v);
}

接下来看第二题：

Question 2

In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined implicitly, rather than being provided as an explicit list.

The data set is here. The format is:
[# of nodes] [# of bits for each node's label]
[first bit of node 1] ... [last bit of node 1]
[first bit of node 2] ... [last bit of node 2]
...
For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes u and v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of k such that there is a k -clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

第二题就是用hamming distance表示两个vertices之间的距离。。题干里面都有说明。。

本题有点不同的是给定了distance，求k

这道题的解法相当tricky，我是没有想到，看了讨论区才做出来的。。

我就直接借用Naveen Sharma的讲解，只不过把他的话翻译并用我的理解说出来：

首先举一个简单的例子：

就是有7个点，我们求3个clustering的spacing。。

第一步，简化，把重复点去掉，因为重复的点必然在一个clustering里面，可以把他们当成同一个点，不影响spacing的计算

这个我就是用的hashmap数据结构，去掉了重复点，代码如下：

		/* eliminate the duplicate vertices */
		if (hash_map.find(val) == hash_map.end())
			hash_map.insert(pair<int, int>(val, hash_map.size()));

第二步，给这四个vertex标号，比如：

1 0 0 1
0 1 1 2
1 1 1 3
1 0 1 4

这在我的代码中不需要特殊处理

第三步，异或有一个很重要的性质，即A xor B = C，则A xor C = B。

由题中可知我们需要spacing at least 3，所以，所有hamming distance = 1 以及 2的点需要在一个clustring里面。即经过union处理。。

如果A xor B = 100则hamming distance为1，如果A xor C = 011，hamming distance是2，我们同样可以反向，通过A，以及给定距离，找所有该距离的点，比如A xor 001 = B, A xor 011 = C，然后把这些点union之后，看看clustering的数量，即为k:

代码如下：

<span style="white-space:pre">	</span>/* for one_bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < one_bit.size(); ++k) {
			int seek = it->first ^ one_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

	/* for two bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < two_bit.size(); ++k) {
			int seek = it->first ^ two_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

完整代码如下：

# include <iostream>
# include <fstream>
# include <unordered_map>
# include <sstream>
# include <string>
# include <cmath>
# include <vector>

using namespace std;

unordered_map<int, int> hash_map;		/* first argument is distance, second is node name */
vector<int> one_bit;
vector<int> two_bit;

void store_file(string);				/* prototype */
void print_all(void);
void generate(int);
bool connected(int*, int, int);
void union_1(int*, int, int, int&);
int find(int*, int);
int clustering(void);

int main(int argc, char** argv) {

	store_file("clustering_big.txt");
//	print_all();
	cout << "cluster: " << clustering() << endl;

	return 0;
}

void store_file(string filename) {
	int line = 1;
	int num, bits;
	ifstream infile;
	infile.open(filename, ios::in);
	string str;
	getline(infile, str);
	istringstream iss(str);
	iss >> num >> bits;
	generate(bits);
	while (line <= num) {
		string str;
		getline(infile, str);
		istringstream iss(str);
		int i = 0;
		int val = 0;
		while (i < bits) {
			int n;
			iss >> n;
			val = val * 2 + n;
			++i;
		}
		/* eliminate the duplicate vertices */
		if (hash_map.find(val) == hash_map.end())
			hash_map.insert(pair<int, int>(val, hash_map.size()));
		++line;
	}
	infile.close();
}

void print_all(void) {
	cout << "hash_map: " << endl;
	for(auto it = hash_map.begin(); it != hash_map.end(); ++it)
		cout << it->first << "  " << it->second << endl;
	cout << "one_bit: " << endl;
	for (auto it = one_bit.begin(); it != one_bit.end(); ++it)
		cout << *it << endl;
	cout << "two_bit: " << endl;
	for (auto it = two_bit.begin(); it != two_bit.end(); ++it)
		cout << *it << endl;
}

void generate(int bits) {
	/* for one bit */
	for (int i = 0; i < bits; ++i) {
		int val = pow(2, i);
		one_bit.push_back(val);
	}
	/* for two bit */
	for (int i = 0; i < bits; ++i) {
		int val = pow(2, i);
		for (int j = i + 1; j < bits; ++j) {
			int val_1 = val + pow(2, j);
			two_bit.push_back(val_1);
		}
	}
}

int clustering(void) {
	int num_of_ver = hash_map.size();
	int* id = new int[num_of_ver];
	/* constructor */
	int i;
	for (i = 0; i < num_of_ver; ++i)
		id[i] = i;
	int count = num_of_ver;

	/* for one_bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < one_bit.size(); ++k) {
			int seek = it->first ^ one_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

	/* for two bit */
	for (auto it = hash_map.begin(); it != hash_map.end(); ++it) {
		for (int k = 0; k < two_bit.size(); ++k) {
			int seek = it->first ^ two_bit[k];
			if (hash_map.find(seek) != hash_map.end()) {
				if (!connected(id, hash_map[seek], it->second))
					union_1(id, hash_map[seek], it->second, count);
				if (count % 100 == 0)
					cout << count << endl;
			}
		}
	}

	return count;
}

bool connected(int* id, int u, int v) {
	return find(id, u) == find(id, v);
}

void union_1(int* id, int u, int v, int& count) {
	int u_id = find(id, u);
	int v_id = find(id, v);

	if (u_id == v_id)
		return;

	/* re-point u'component to v's id */
	for (int i = 0; i <= hash_map.size(); ++i) {
		if (id[i] == u_id)
			id[i] = v_id;
	}
	--count;
}

int find(int* id, int u) {
	return id[u];
}

注意：Kruskal算法用到了union-find数据结构，个人认为普林斯顿的那本红本的Algorithms讲的很好，不明白的可以看一看。。另外个人认为，作为算法的菜鸟，普林斯顿的算法远比CLRS更加适合。。。。