牛客题解 | 实现AdaBoost拟合方法

最新推荐文章于 2025-03-21 09:37:55 发布

原创最新推荐文章于 2025-03-21 09:37:55 发布 · 853 阅读

30 ·

CC 4.0 BY-SA版权

文章标签：

#算法 #kmeans #聚类 #算法力扣 #面试 #算法力扣面试

题目## 题目

题目链接

AdaBoost（Adaptive Boosting）是一种常用的集成学习方法，其计算公式为：
$\sum_{t=1}^{T} \alpha_t h_t(x)$
其中， $h_t(x)$ 是弱分类器， $\alpha_t$ 是权重， $T$ 是迭代次数。
该算法是深度学习中常用的集成学习方法之一。
本题使用了分类错误率作为错误率，并使用错误率来更新分类器权重。其公式为：
$\alpha_t = 0.5 \log \left( \frac{1 - \epsilon_t}{\epsilon_t} \right)$
其中， $\epsilon_t$ 是分类错误率， $\epsilon_t$ =sum(w[y != prediction])/len(y)。

本题的一个小难点是，若错误率大于0.5，则需要将错误率取反后再进行对比，否则会导致分类器权重更新没有达到预期效果。

标准代码如下

def adaboost_fit(X, y, n_clf):
    n_samples, n_features = np.shape(X)
    w = np.full(n_samples, (1 / n_samples))
    clfs = []
    
    for _ in range(n_clf):
        clf = {}
        min_error = float('inf')
        
        for feature_i in range(n_features):
            feature_values = np.expand_dims(X[:, feature_i], axis=1)
            unique_values = np.unique(feature_values)
            
            for threshold in unique_values:
                p = 1
                prediction = np.ones(np.shape(y))
                prediction[X[:, feature_i] < threshold] = -1
                error = sum(w[y != prediction])
                
                if error > 0.5:
                    error = 1 - error
                    p = -1
                
                if error < min_error:
                    clf['polarity'] = p
                    clf['threshold'] = threshold
                    clf['feature_index'] = feature_i
                    min_error = error
        
        clf['alpha'] = 0.5 * math.log((1.0 - min_error) / (min_error + 1e-10))
        predictions = np.ones(np.shape(y))
        if clf['polarity'] == 1:
            predictions[X[:, clf['feature_index']] < clf['threshold']] = -1
        else:
            predictions[X[:, clf['feature_index']] > clf['threshold']] = -1
        w *= np.exp(-clf['alpha'] * y * predictions)
        w /= np.sum(w)
        clfs.append(clf)

    return clfs

题目链接

k-Means 聚类算法（k-Means Clustering）是一种常用的聚类算法，用于将数据集分为 $k$ 个簇。具体步骤如下：

随机选择 $k$ 个点作为初始聚类中心。
将每个点分配到最近的聚类中心
本题使用欧几里得距离作为距离度量，即
$\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2}$
更新聚类中心为每个簇的平均值。
重复步骤2和步骤3，直到聚类中心不再变化或达到最大迭代次数。

通俗点说，就是把n个人分到k个组中，每次都要计算每个人到每个组的距离，然后选择距离最小的组，然后更新组中心。这样最后每个人都会分到离他最近的组中。

标准代码如下

def euclidean_distance(a, b):
    return np.sqrt(((a - b) ** 2).sum(axis=1))

def k_means_clustering(points, k, initial_centroids, max_iterations):
    points = np.array(points)
    centroids = np.array(initial_centroids)
    
    for iteration in range(max_iterations):
        # Assign points to the nearest centroid
        distances = np.array([euclidean_distance(points, centroid) for centroid in centroids])
        assignments = np.argmin(distances, axis=0)

        new_centroids = np.array([points[assignments == i].mean(axis=0) if len(points[assignments == i]) > 0 else centroids[i] for i in range(k)])
        
        # Check for convergence
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
        centroids = np.round(centroids,4)
    return [tuple(centroid) for centroid in centroids]