KNN- K nearest neighbors It simply assigns a label to new data based on the distance between the old data and new data.
KNN 概念: 物以類聚, 適用於離散型資料,也適用於連續型資料。 原理: 找到最近的K個鄰居→ 進行投票→決定類別 計算鄰居與我們的距離→ 用K值決定鄰居數目,並進行投票 (在連續型資料中,則是計算平均數)
We want a K value that minimizes error: Error = 1 - Accuracy
Two methods:
Elbow method
Cross validate a grid search of multiple K values and choose K that results in lowest error or highest accuracy.
Cross validation only takes into account the K value with the lowest error rate across multiple folds. This could result in a more complex model (higher value of K). Consider the context of the problem to decide if larger K values are an issue.
Cross Validation in Machine Learning - GeeksforGeeks
Distance Metric Minkowski Euclidean Manhattan Chebyshev
總結 (1)k值選擇很關鍵,且最好避免選擇偶數 (2)要不斷的切割樣本(交叉驗證) (3)選擇合適的計算方式
優點 (1)簡單易懂 (2)資料型態不受限 (3)在多種類別預測有較好的表現
缺點
(1)計算成本高 (2)資料不平衡時容易預測不準確
Reference