KNN with R

最新推荐文章于 2024-07-08 22:23:25 发布

原创最新推荐文章于 2024-07-08 22:23:25 发布 · 428 阅读

1 ·

CC 4.0 BY-SA版权

机器学习专栏收录该内容

4 篇文章

订阅专栏

DATACAMP–machine learning with R

Not including the mechanism of knn, this passage focus on how to apply the package “KNN” in R to conduct the classification.
It’s my study notes from DATACAMP. link: https://campus.datacamp.com

Chapter 1: k-Nearest Neighbors (kNN)

Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.
As it begins to drive away, its camera captures the following image:
在这里插入图片描述
Apply a kNN classifier to help the car recognize this sign.
KNN聚类算法：可以用class包中的knn函数进行处理。
Codes:

# Load the 'class' package
library(class)
# Create a vector of labels
sign_types <- signs$sign_type
# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)

How did the knn() function correctly classify the stop sign?
----The sign was in some way similar to another stop sign

Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.
在这里插入图片描述
The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.

codes:

# Examine the structure of the signs dataset
str(signs)

##'data.frame':	146 obs. of  49 variables:
 $ sign_type: chr  "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
 $ r1       : int  155 142 57 22 169 75 136 149 13 123 ...
 $ g1       : int  228 217 54 35 179 67 149 225 34 124 ...
......
 $ r16      : int  22 164 58 19 160 180 188 237 83 43 ...
 $ g16      : int  52 227 60 27 183 107 211 254 125 29 ...
 $ b16      : int  53 237 60 29 187 26 227 53 19 11 ...


# Count the number of signs of each type
table(signs$sign_type)

## pedestrian      speed       stop 
        46         49         51

# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)

## sign_type       r10
1 pedestrian 113.71739
2      speed  80.63265
3       stop 132.39216

Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:
在这里插入图片描述
At the conclusion of the trial, you are asked to measure the car’s overall performance at recognizing these signs.

# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)

## signs_actual
signs_pred   pedestrian speed stop
  pedestrian         19     2    0
  speed               0    17    0
  stop                0     2   19

# Compute the accuracy
mean(signs_pred == signs_actual)
#[1] 0.9322034--accurary rate

How to choose ‘k’? — Try it

Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types)
mean(k_1==signs_actual)
##[1] 0.9322034
# Modify the above to set k = 7
k_7 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types,k=7)
mean(k_7==signs_actual)
##[1] 0.9491525
# Set k = 15 and compare to the above
k_15 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types,k=15)
mean(k_15==signs_actual)
##[1] 0.8813559

Thus, k=7 is the best among 1,7,15.

Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this exercise, you will learn how to obtain the voting results from the knn() function.

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train=signs[-1],test=signs_test[-1],cl=sign_types,prob=TRUE,k=7)
sign_pred
# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")
sign_prob
# Examine the first several predictions
head(sign_pred)

Why normalize data?

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization.
What is the purpose of this step?
To ensure all data elements may contribute equal shares to distance.

KNN benefits from the normalized data.
Codes:

normalization <- function(x){
   return((x-min(x))/(max(x)-min(x)))
   }

C’est tout, merci beaucoup.