K-Nearest Neighbors algorithm

本文探讨了机器学习算法在不同领域的应用,包括图像处理、音视频处理、自然语言处理等,展示了如何利用这些技术解决实际问题。
from numpy import *
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir

def createDataSet():
    group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels=['A','A','B','B']
    return group, labels

group, labels = createDataSet()
  
def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize, 1))-dataSet
    sqDiffMat = diffMat ** 2
    sqlDistances = sqDiffMat.sum(axis=1)
    distances = sqlDistances**0.5
    sortedDistIndicies = distances.argsort()
    classCount = {}
    for i in range(k):
        voteLabel = labels[sortedDistIndicies[i]]
        classCount[voteLabel] = classCount.get(voteLabel, 0)+1
    
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
   # print sortedClassCount
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())
    returnMat = zeros((numberOfLines, 3))
    classLabelVector = []
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:]=listFromLine[0:3]
        classLabelVector.append(listFromLine[-1])
        index += 1
    return returnMat, classLabelVector
    
datingDataMat, datingLabels = file2matrix('datingTestSet.txt')

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals-minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m,1))
    normDataSet = normDataSet/tile(ranges, (m,1))
    return normDataSet, ranges, minVals
    

def datingClassTest():
    hoRatio = 0.10;
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errCount = 0.0
    
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:], datingLabels[numTestVecs:m], 4)
        print 'the classifier came back with: %s, the real answer is: %s' % (classifierResult, datingLabels[i])
        if(classifierResult != datingLabels[i]):
            errCount += 1.0
    print "the total error rate is: %f" % (errCount/float(numTestVecs))
    

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(raw_input("percent of time spent playing video games?"))
    ffMiles = float(raw_input("frequent filter miles earned per year?"))
    iceCream = float(raw_input("liters of ice cream consumed per year?"))
    
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([percentTats, ffMiles, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)
    
    print "You will probably like this person: ", resultList[classifierResult-1]
    
def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    
    return returnVect
    
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')
    m=len(trainingFileList)
    #print "m=",m
    trainingMat = zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:]=img2vector('trainingDigits/%s' % fileNameStr)
        
    testFileList = listdir('testDigits')
    errCount = 0.0
    mTest = len(testFileList)
    
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = fileStr.split('_')[0]
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print "the classifier came back with: %s, the real answer is: %s" % (classifierResult, classNumStr)
        #print 'classifierResult = ', classifierResult,"classNumStr = ", classNumStr
        if (int(classifierResult) != int(classNumStr)):
            errCount += 1.0
    
    print "\n the total number of error is: %d" % (errCount)
    print "\n the total error rate is: %f" % (errCount/float(mTest))

<think>好的,用户想了解K近邻算法的实现与应用,并且提到了可能涉及访问特定网址的内容,但引用中并没有提供具体的URL。首先,我需要整理K近邻的基本原理和实现步骤,确保涵盖算法核心。根据用户提供的引用,kNN通过计算距离并找出最近的k个样本进行分类或回归,这点需要详细说明。 然后,用户提到可能与实现和应用相关,特别是使用Python和scikit-learn库。我需要参考引用中的代码示例,比如使用KNeighborsClassifier,划分训练集和测试集,以及评估准确率。同时,用户可能对实际应用场景感兴趣,比如分类和回归问题,或者变种如加权kNN,这些都需要提及。 关于读取特定网址的内容,用户可能想从网上获取数据集或代码,但引用中没有给出具体链接,所以可能需要建议常见的数据源,如UCI或Kaggle,或者使用内置数据集。此外,用户之前的问题可能涉及代码实操,所以需要确保示例代码的正确性,并解释关键步骤,比如距离度量、k值选择的影响。 还需要注意用户的中文回答要求,保持结构清晰,分步骤说明。可能的问题包括如何选择k值、处理不同数据尺度、优化方法等。最后,生成的相关问题应围绕实现细节、参数调优和应用扩展,帮助用户深入理解。</think>### K近邻算法(KNN)的实现与应用 #### 一、算法原理 K近邻算法通过测量输入样本与所有训练样本的**距离**来确定最近邻域,其核心公式为: $$ d(x_i, x_j) = \sqrt{\sum_{k=1}^n (x_{ik}-x_{jk})^2} $$ 当k=1时直接采用最近邻标签,k>1时使用多数表决机制确定类别[^2]。 #### 二、Python实现步骤 1. **数据准备**(以鸢尾花数据集为例): ```python from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target ``` 2. **模型构建**: ```python from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier( n_neighbors=5, # 选择5个近邻 weights='uniform', # 可改为'distance'进行加权 metric='minkowski' # 支持欧式/曼哈顿等距离 ) ``` 3. **训练与预测**: ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print(f"准确率: {knn.score(X_test, y_test):.2%}") ``` #### 三、关键参数解析 | 参数 | 说明 | 典型值 | |------|------|-------| | n_neighbors | 投票邻居数量 | 3-15 | | weights | 'uniform'或距离加权 | 'distance' | | metric | 距离度量方式 | 'euclidean','manhattan' | #### 四、实际应用场景 1. **分类任务**:手写数字识别(MNIST数据集) 2. **回归预测**:房价预测(带权重的近邻平均) 3. **推荐系统**:基于用户行为相似度的商品推荐 #### 五、性能优化建议 1. **数据预处理**:标准化处理消除量纲影响 $$ x' = \frac{x - \mu}{\sigma} $$ 2. **KD-Tree优化**:将时间复杂度从$O(n^2)$降低到$O(n\log n)$[^1] 3. **交叉验证**选择最佳k值: ```python from sklearn.model_selection import GridSearchCV params = {'n_neighbors': range(3,15)} grid = GridSearchCV(knn, params, cv=5) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值