K Nearest Neighbor (KNN)

K-Nearest Neighbors (KNN) Algorithm: Comprehensive Guide

Overview

K-Nearest Neighbors (KNN) is a non-parametric, supervised learning algorithm used for both classification and regression tasks. It’s often called a “lazy learner” because it doesn’t build an explicit model during training; instead, it stores all training data and makes predictions based on similarity when needed.

The fundamental principle of KNN is simple: similar data points tend to have similar labels or values. When making a prediction for a new data point, KNN finds the ‘k’ most similar (nearest) points in the training dataset and uses their labels to make a prediction.

Core Concepts

How KNN Works

The KNN algorithm operates through four main steps:

  1. Choose the value of k – Determine how many neighbors to consider
  2. Calculate distances – Measure similarity between the new point and all training points
  3. Find nearest neighbors – Select the k closest points based on distance
  4. Make prediction:
    • Classification: Use majority voting among the k neighbors
    • Regression: Take the average of the k neighbors’ values

Key Characteristics

  • Non-parametric: Makes no assumptions about the underlying data distribution
  • Instance-based: Stores all training examples rather than learning parameters
  • Lazy learning: No training phase; computation happens at prediction time
  • Memory-based: Requires storing the entire training dataset

Distance Metrics

KNN relies on distance metrics to determine similarity between data points:

1. Euclidean Distance

The most commonly used distance metric, representing the straight-line distance between two points:

distance(x, Xi) = sqrt(sum((xj – Xi_j)^2) for j=1 to d)

where d is the number of dimensions.

2. Manhattan Distance

Also called “taxicab distance,” it measures distance along grid-like paths:

d(x,y) = sum(|xi – yi|) for i=1 to n

3. Minkowski Distance

A generalized distance metric that includes both Euclidean and Manhattan as special cases:

d(x,y) = (sum(|xi – yi|^p) for i=1 to n)^(1/p)

When p=2, it becomes Euclidean distance; when p=1, it becomes Manhattan distance.

Choosing the Optimal k Value

Selecting the right k value is crucial for KNN performance:

Guidelines for k Selection

  • Small k values (e.g., k=1):
    • More sensitive to noise and outliers
    • Can lead to overfitting
    • Creates complex decision boundaries
  • Large k values:
    • More robust to noise
    • Risk of underfitting
    • Creates smoother decision boundaries
    • May ignore local patterns

Statistical Methods for k Selection

  1. Cross-Validation: Use k-fold cross-validation to test different k values and choose the one with highest accuracy
  2. Elbow Method: Plot error rates for different k values and choose the point where the curve forms an “elbow”
  3. Odd Values: Use odd numbers for k in classification to avoid ties

Assumptions and Considerations

Key Assumptions

  • Feature relevance: All features contribute meaningfully to similarity
  • Feature scaling: Features should be on similar scales (distance-based algorithm)
  • Local similarity: Nearby points should have similar labels/values
  • Sufficient data: Adequate training examples in all regions of the feature space

Advantages

  • Simple to understand and implement
  • No assumptions about data distribution
  • Works well with small datasets
  • Can handle multi-class classification naturally
  • Effective for non-linear decision boundaries

Disadvantages

  • Computationally expensive at prediction time
  • Sensitive to irrelevant features
  • Requires feature scaling
  • Poor performance with high-dimensional data (curse of dimensionality)
  • Sensitive to local structure of data

Python Implementation

Here’s a complete implementation of KNN with dataset loading, preprocessing, training, evaluation, and visualization:

# Simple KNN Classification on Iris Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load and prepare data
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Find optimal k
k_range = range(1, 21)
cv_scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X_train_scaled, y_train, cv=5).mean() for k in k_range]
optimal_k = k_range[np.argmax(cv_scores)]

# Train and evaluate KNN
knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
print(f"Optimal k: {optimal_k}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.imshow(cm, cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.xticks(np.arange(len(iris.target_names)), iris.target_names)
plt.yticks(np.arange(len(iris.target_names)), iris.target_names)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, cm[i, j], ha='center', va='center', color='black')
plt.tight_layout()
plt.show()

# k-value optimization curve
plt.figure(figsize=(6,4))
plt.plot(k_range, cv_scores, 'bo-')
plt.axvline(x=optimal_k, color='red', linestyle='--', label=f'Optimal k={optimal_k}')
plt.xlabel('k Value')
plt.ylabel('CV Accuracy')
plt.title('K-Value Optimization')
plt.legend()
plt.grid(True)
plt.show()

Cheers!