601 lines
14 KiB
Plaintext
601 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5da8da61",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exercice 2: Classification system with KNN - To Loan or Not To Loan"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9669e493",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Imports"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "22bbd869",
|
|
"metadata": {},
|
|
"source": [
|
|
"Import some useful libraries"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "26758936",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n",
|
|
"from sklearn.model_selection import train_test_split"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "abc131ca",
|
|
"metadata": {},
|
|
"source": [
|
|
"## a. Getting started"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "45b518e5",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Data loading"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1ef061f2",
|
|
"metadata": {},
|
|
"source": [
|
|
"The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "a23f62b5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = pd.read_csv(\"loandata.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "02ca77c7",
|
|
"metadata": {},
|
|
"source": [
|
|
"Display the head of the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "f4bec500",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Gender</th>\n",
|
|
" <th>Married</th>\n",
|
|
" <th>Education</th>\n",
|
|
" <th>TotalIncome</th>\n",
|
|
" <th>LoanAmount</th>\n",
|
|
" <th>CreditHistory</th>\n",
|
|
" <th>LoanStatus</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Male</td>\n",
|
|
" <td>Yes</td>\n",
|
|
" <td>Graduate</td>\n",
|
|
" <td>6091.0</td>\n",
|
|
" <td>128.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>N</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Male</td>\n",
|
|
" <td>Yes</td>\n",
|
|
" <td>Graduate</td>\n",
|
|
" <td>3000.0</td>\n",
|
|
" <td>66.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Male</td>\n",
|
|
" <td>Yes</td>\n",
|
|
" <td>Not Graduate</td>\n",
|
|
" <td>4941.0</td>\n",
|
|
" <td>120.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Male</td>\n",
|
|
" <td>No</td>\n",
|
|
" <td>Graduate</td>\n",
|
|
" <td>6000.0</td>\n",
|
|
" <td>141.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>Male</td>\n",
|
|
" <td>Yes</td>\n",
|
|
" <td>Graduate</td>\n",
|
|
" <td>9613.0</td>\n",
|
|
" <td>267.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>Y</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Gender Married Education TotalIncome LoanAmount CreditHistory \\\n",
|
|
"0 Male Yes Graduate 6091.0 128.0 1.0 \n",
|
|
"1 Male Yes Graduate 3000.0 66.0 1.0 \n",
|
|
"2 Male Yes Not Graduate 4941.0 120.0 1.0 \n",
|
|
"3 Male No Graduate 6000.0 141.0 1.0 \n",
|
|
"4 Male Yes Graduate 9613.0 267.0 1.0 \n",
|
|
"\n",
|
|
" LoanStatus \n",
|
|
"0 N \n",
|
|
"1 Y \n",
|
|
"2 Y \n",
|
|
"3 Y \n",
|
|
"4 Y "
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e271b475",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data's columns:\n",
|
|
"* **Gender:** Applicant gender (Male/ Female)\n",
|
|
"* **Married:** Is the Applicant married? (Y/N)\n",
|
|
"* **Education:** Applicant Education (Graduate/ Not Graduate)\n",
|
|
"* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)\n",
|
|
"* **LoanAmount:** Loan amount in thousands\n",
|
|
"* **CreditHistory:** Credit history meets guidelines\n",
|
|
"* **LoanStatus** (Target)**:** Loan approved (Y/N)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "702ce4e6",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Data preprocessing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7fce724c",
|
|
"metadata": {},
|
|
"source": [
|
|
"Define a list of categorical columns to encode."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "2c56efa5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"categorical_columns = [\"Gender\", \"Married\", \"Education\", \"LoanStatus\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d8915a68",
|
|
"metadata": {},
|
|
"source": [
|
|
"Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "dc5f9cda",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "df9c84b4",
|
|
"metadata": {},
|
|
"source": [
|
|
"Split into `X` and `y`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "83beacfb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = data.drop(columns=\"LoanStatus\")\n",
|
|
"y = data.LoanStatus"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e25c8f24",
|
|
"metadata": {},
|
|
"source": [
|
|
"Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "9c567bb7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X[X.columns] = StandardScaler().fit_transform(X[X.columns])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7437ea21",
|
|
"metadata": {},
|
|
"source": [
|
|
"Convert `y` type to `int` "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "c0db7c1f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y = y.astype(int)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6d1d1f10",
|
|
"metadata": {},
|
|
"source": [
|
|
"Split dataset into train and test sets."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "b05be2cc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8f6d3ce6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## b. Dummy classifier"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "80ec4058",
|
|
"metadata": {},
|
|
"source": [
|
|
"Build a dummy classifier that takes decisions randomly."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "30919672",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class DummyClassifier():\n",
|
|
" \n",
|
|
" def __init__(self):\n",
|
|
" \"\"\"\n",
|
|
" Initialize the class.\n",
|
|
" \"\"\"\n",
|
|
" pass\n",
|
|
" \n",
|
|
" def fit(self, X, y):\n",
|
|
" \"\"\"\n",
|
|
" Fit the dummy classifier.\n",
|
|
" \n",
|
|
" Parameters\n",
|
|
" ----------\n",
|
|
" X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)\n",
|
|
" Training data.\n",
|
|
" y : Numpy array or Pandas DataFrame of shape (n_samples,)\n",
|
|
" Target values.\n",
|
|
" \"\"\"\n",
|
|
" pass\n",
|
|
" \n",
|
|
" def predict(self, X):\n",
|
|
" \"\"\"\n",
|
|
" Predict the class labels for the provided data.\n",
|
|
"\n",
|
|
" Parameters\n",
|
|
" ----------\n",
|
|
" X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)\n",
|
|
" Test samples.\n",
|
|
"\n",
|
|
" Returns\n",
|
|
" -------\n",
|
|
" y : Numpy array or Pandas DataFrame of shape (n_queries,)\n",
|
|
" Class labels for each data sample.\n",
|
|
" \"\"\"\n",
|
|
" pass"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1dd67c48",
|
|
"metadata": {},
|
|
"source": [
|
|
"Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "184f3905",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def accuracy_score(y_true, y_pred):\n",
|
|
" pass"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "90dcae17",
|
|
"metadata": {},
|
|
"source": [
|
|
"Compute the performance of the dummy classifier using the provided test set."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "fa666b66",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9e10cd97",
|
|
"metadata": {},
|
|
"source": [
|
|
"## c. K-Nearest Neighbors classifier"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "70009457",
|
|
"metadata": {},
|
|
"source": [
|
|
"Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "759e924e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class KNNClassifier():\n",
|
|
" \n",
|
|
" def __init__(self, n_neighbors=3):\n",
|
|
" \"\"\"\n",
|
|
" Initialize the class.\n",
|
|
" \n",
|
|
" Parameters\n",
|
|
" ----------\n",
|
|
" n_neighbors : int, default=3\n",
|
|
" Number of neighbors to use by default.\n",
|
|
" \"\"\"\n",
|
|
" pass\n",
|
|
" \n",
|
|
" def fit(self, X, y):\n",
|
|
" \"\"\"\n",
|
|
" Fit the k-nearest neighbors classifier.\n",
|
|
" \n",
|
|
" Parameters\n",
|
|
" ----------\n",
|
|
" X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)\n",
|
|
" Training data.\n",
|
|
" y : Numpy array or Pandas DataFrame of shape (n_samples,)\n",
|
|
" Target values.\n",
|
|
" \"\"\"\n",
|
|
" pass\n",
|
|
" \n",
|
|
" @staticmethod\n",
|
|
" def _euclidian_distance(a, b):\n",
|
|
" \"\"\"\n",
|
|
" Utility function to compute the euclidian distance.\n",
|
|
" \n",
|
|
" Parameters\n",
|
|
" ----------\n",
|
|
" a : Numpy array or Pandas DataFrame\n",
|
|
" First operand.\n",
|
|
" b : Numpy array or Pandas DataFrame\n",
|
|
" Second operand.\n",
|
|
" \"\"\"\n",
|
|
" pass\n",
|
|
" \n",
|
|
" def predict(self, X):\n",
|
|
" \"\"\"\n",
|
|
" Predict the class labels for the provided data.\n",
|
|
"\n",
|
|
" Parameters\n",
|
|
" ----------\n",
|
|
" X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)\n",
|
|
" Test samples.\n",
|
|
"\n",
|
|
" Returns\n",
|
|
" -------\n",
|
|
" y : Numpy array or Pandas DataFrame of shape (n_queries,)\n",
|
|
" Class labels for each data sample.\n",
|
|
" \"\"\"\n",
|
|
" pass"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6c2b4811",
|
|
"metadata": {},
|
|
"source": [
|
|
"Compute the performance of the system as a function of $k = 1...7$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cf589e66",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "71c51f35",
|
|
"metadata": {},
|
|
"source": [
|
|
"Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2f6f262b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e2b1a682",
|
|
"metadata": {},
|
|
"source": [
|
|
"Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c0bda7ee",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2724167a",
|
|
"metadata": {},
|
|
"source": [
|
|
"Re-run the KNN algorithm using all features."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "46ec9699",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "648aa52e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.11"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|