KNN- K nearest neighbors It simply assigns a label to new data based on the distance between the old data and new data.

KNN 概念: 物以類聚, 適用於離散型資料,也適用於連續型資料。 原理: 找到最近的K個鄰居→ 進行投票→決定類別 計算鄰居與我們的距離→ 用K值決定鄰居數目,並進行投票 (在連續型資料中,則是計算平均數)

We want a K value that minimizes error: Error = 1 - Accuracy

Two methods:

Elbow method

Untitled

Cross validate a grid search of multiple K values and choose K that results in lowest error or highest accuracy.

Cross validation only takes into account the K value with the lowest error rate across multiple folds. This could result in a more complex model (higher value of K). Consider the context of the problem to decide if larger K values are an issue.

Cross Validation in Machine Learning - GeeksforGeeks

Distance Metric Minkowski Euclidean Manhattan Chebyshev

總結 (1)k值選擇很關鍵,且最好避免選擇偶數 (2)要不斷的切割樣本(交叉驗證) (3)選擇合適的計算方式

優點 (1)簡單易懂 (2)資料型態不受限 (3)在多種類別預測有較好的表現

缺點

(1)計算成本高 (2)資料不平衡時容易預測不準確

Reference