K-Nearest Neighbors (K-NN) Algorithm

Implementation and Analysis

Overview

This project is about building a k-Nearest Neighbors (k-NN) classifier from scratch using Python. The k-NN algorithm is a simple machine learning method used for classification tasks. Instead of learning a mathematical model, it memorizes the training data and makes predictions by looking at the k closest data points.

How It Works:

  1. The algorithm calculates the distance between a new data point and all training examples.
  2. It selects the k nearest neighbors (most similar examples).
  3. The class (label) with the most votes among these neighbors is assigned to the new data point.

Implementation

K-Nearest Neighbors (K-NN) is a simple method for classification. It works by comparing new data points with existing ones and classifying them based on similarity. The main steps include:

Data Preparation and Preprocessing

Before using K-NN, the data needs to be prepared. The dataset includes:

To make sure all features contribute equally, the StandardScaler from Scikit-learn is used to scale them.

Testing

The testing was performed using K-Fold Cross Validation, where the dataset (X,y) was divided into 10 equal parts (folds). The model was trained on 9 of these parts and tested on the remaining part. This process was repeated 10 times, ensuring that each part of the data served as a test set once. To determine the optimal hyperparameter, different values of K (ranging from 1 to 30) were tested for the K-Nearest Neighbors (KNN) model. For each value of K, the model’s accuracy was computed across all folds, and the average accuracy was recorded. The best value of K was identified as the one that yielded the highest average accuracy. The results were stored in two lists: one containing the tested K K values and the other containing their corresponding average accuracies. These results helped in selecting the most effective KNN model for the dataset.

Performance Analysis

The results below were obtained by testing the K-Nearest Neighbors (KNN) algorithm on a modified version of the MNIST dataset. This dataset includes only images of the digits 5 and 6. The test aimed to evaluate how well KNN can classify these two digits

The accuracy of K-NN depends on the choice of k:

Additional Tests

To further evaluate the performance of K-NN, additional tests were conducted using various datasets from Scikit-learn:

Conclusion

The K-NN algorithm is effective for classification. Choosing the right k is key to its performance.