# Exercice 2: Classification system with KNN - To Loan or Not To Loan

## Imports

Import some useful libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [6]:
data = pd.read_csv("loandata.csv")

Display the head of the data.

In [7]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [8]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [9]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])

Split into `X` and `y`.

In [10]:
X = data.drop(columns="LoanStatus")
y = data.LoanStatus

Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [11]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns])

Convert `y` type to `int` 

In [12]:
y = y.astype(int)

Split dataset into train and test sets.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [None]:
class DummyClassifier():
    
    def __init__(self):
        """
        Initialize the class.
        """
        self.classes_ = None
    
    def fit(self, X, y):
        """
        Fit the dummy classifier.
        """
        self.classes_ = np.unique(y)
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.
        """
        n_samples = len(X)
        return np.random.choice(self.classes_, size=n_samples)


Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [15]:
def accuracy_score(y_true, y_pred):
    pass

Compute the performance of the dummy classifier using the provided test set.

## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [16]:
class KNNClassifier():
    
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """

        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """

        self.X_train = np.array(X)
        self.y_train = np.array(y)
    
    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """

        return np.sqrt(np.sum((a - b) ** 2, axis=1))
    
    @staticmethod
    def _manhattan_distance(a, b):
        """
        Utility function to compute the Manhattan distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        
        return np.sum(np.abs(a - b), axis=1)
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        X = np.array(X)
        predictions = []
        for x in X:
            distances = self._euclidian_distance(self.X_train, x)
            neighbor_idxs = np.argsort(distances)[:self.n_neighbors]
            neighbor_labels = self.y_train[neighbor_idxs]
            
            counts = np.bincount(neighbor_labels.astype(int))
            predictions.append(np.argmax(counts))
        return np.array(predictions)

Compute the performance of the system as a function of $k = 1...7$.

In [18]:
knn_accuracies = []
for k in range(1, 8):
    knn = KNNClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    knn_accuracies.append(acc)

print("KNN accuracies for k=1 to 7:", knn_accuracies)

KNN accuracies for k=1 to 7: [None, None, None, None, None, None, None]


Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

Re-run the KNN algorithm using all features.