RDD用法与实例（十）：spark中rdd实现k-means

最新推荐文章于 2023-04-09 14:35:38 发布

水母君98

最新推荐文章于 2023-04-09 14:35:38 发布

阅读量542

点赞数

分类专栏：大数据基础文章标签： python spark 大数据机器学习

本文链接：https://blog.youkuaiyun.com/m0_37754282/article/details/108838365

版权

本文深入探讨了如何在Apache Spark中使用RDD实现K-Means聚类算法。通过Python API，我们详细解析了数据预处理、K-Means迭代过程以及结果评估的关键步骤，展示了Spark在大数据机器学习场景下的高效处理能力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

import numpy as np

def parseVector(line):
    return np.array([float(x) for x in line.split()])

def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/kmeans_data.txt
lines = sc.textFile('/Users/huangluyu/data/kmeans_data.txt', 5)  

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/kmeans_bigdata.txt
# lines = sc.textFile('../data/kmeans_bigdata.txt', 5)  
# lines is an RDD of strings
K = 3
convergeDist = 0.01  
# terminate algorithm when the total distance from old center to new centers is less than this value

data = lines.map(parseVector).cache() # data is an RDD of arrays

kCenters = data.t