Files
MachLePublic/PW-2/ex2-to-loan-or-not-to-loan/to-loan-or-not-to-loan-stud.ipynb
gabriel.marinoja f0e1453d13 Started PW-2
2025-09-23 13:18:25 +02:00

601 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "5da8da61",
"metadata": {},
"source": [
"# Exercice 2: Classification system with KNN - To Loan or Not To Loan"
]
},
{
"cell_type": "markdown",
"id": "9669e493",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "markdown",
"id": "22bbd869",
"metadata": {},
"source": [
"Import some useful libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "26758936",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"id": "abc131ca",
"metadata": {},
"source": [
"## a. Getting started"
]
},
{
"cell_type": "markdown",
"id": "45b518e5",
"metadata": {},
"source": [
"### Data loading"
]
},
{
"cell_type": "markdown",
"id": "1ef061f2",
"metadata": {},
"source": [
"The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a23f62b5",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"loandata.csv\")"
]
},
{
"cell_type": "markdown",
"id": "02ca77c7",
"metadata": {},
"source": [
"Display the head of the data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f4bec500",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Gender</th>\n",
" <th>Married</th>\n",
" <th>Education</th>\n",
" <th>TotalIncome</th>\n",
" <th>LoanAmount</th>\n",
" <th>CreditHistory</th>\n",
" <th>LoanStatus</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Male</td>\n",
" <td>Yes</td>\n",
" <td>Graduate</td>\n",
" <td>6091.0</td>\n",
" <td>128.0</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Male</td>\n",
" <td>Yes</td>\n",
" <td>Graduate</td>\n",
" <td>3000.0</td>\n",
" <td>66.0</td>\n",
" <td>1.0</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Male</td>\n",
" <td>Yes</td>\n",
" <td>Not Graduate</td>\n",
" <td>4941.0</td>\n",
" <td>120.0</td>\n",
" <td>1.0</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Graduate</td>\n",
" <td>6000.0</td>\n",
" <td>141.0</td>\n",
" <td>1.0</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Male</td>\n",
" <td>Yes</td>\n",
" <td>Graduate</td>\n",
" <td>9613.0</td>\n",
" <td>267.0</td>\n",
" <td>1.0</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Gender Married Education TotalIncome LoanAmount CreditHistory \\\n",
"0 Male Yes Graduate 6091.0 128.0 1.0 \n",
"1 Male Yes Graduate 3000.0 66.0 1.0 \n",
"2 Male Yes Not Graduate 4941.0 120.0 1.0 \n",
"3 Male No Graduate 6000.0 141.0 1.0 \n",
"4 Male Yes Graduate 9613.0 267.0 1.0 \n",
"\n",
" LoanStatus \n",
"0 N \n",
"1 Y \n",
"2 Y \n",
"3 Y \n",
"4 Y "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "markdown",
"id": "e271b475",
"metadata": {},
"source": [
"Data's columns:\n",
"* **Gender:** Applicant gender (Male/ Female)\n",
"* **Married:** Is the Applicant married? (Y/N)\n",
"* **Education:** Applicant Education (Graduate/ Not Graduate)\n",
"* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)\n",
"* **LoanAmount:** Loan amount in thousands\n",
"* **CreditHistory:** Credit history meets guidelines\n",
"* **LoanStatus** (Target)**:** Loan approved (Y/N)"
]
},
{
"cell_type": "markdown",
"id": "702ce4e6",
"metadata": {},
"source": [
"### Data preprocessing"
]
},
{
"cell_type": "markdown",
"id": "7fce724c",
"metadata": {},
"source": [
"Define a list of categorical columns to encode."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2c56efa5",
"metadata": {},
"outputs": [],
"source": [
"categorical_columns = [\"Gender\", \"Married\", \"Education\", \"LoanStatus\"]"
]
},
{
"cell_type": "markdown",
"id": "d8915a68",
"metadata": {},
"source": [
"Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dc5f9cda",
"metadata": {},
"outputs": [],
"source": [
"data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])"
]
},
{
"cell_type": "markdown",
"id": "df9c84b4",
"metadata": {},
"source": [
"Split into `X` and `y`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "83beacfb",
"metadata": {},
"outputs": [],
"source": [
"X = data.drop(columns=\"LoanStatus\")\n",
"y = data.LoanStatus"
]
},
{
"cell_type": "markdown",
"id": "e25c8f24",
"metadata": {},
"source": [
"Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9c567bb7",
"metadata": {},
"outputs": [],
"source": [
"X[X.columns] = StandardScaler().fit_transform(X[X.columns])"
]
},
{
"cell_type": "markdown",
"id": "7437ea21",
"metadata": {},
"source": [
"Convert `y` type to `int` "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c0db7c1f",
"metadata": {},
"outputs": [],
"source": [
"y = y.astype(int)"
]
},
{
"cell_type": "markdown",
"id": "6d1d1f10",
"metadata": {},
"source": [
"Split dataset into train and test sets."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b05be2cc",
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"id": "8f6d3ce6",
"metadata": {},
"source": [
"## b. Dummy classifier"
]
},
{
"cell_type": "markdown",
"id": "80ec4058",
"metadata": {},
"source": [
"Build a dummy classifier that takes decisions randomly."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "30919672",
"metadata": {},
"outputs": [],
"source": [
"class DummyClassifier():\n",
" \n",
" def __init__(self):\n",
" \"\"\"\n",
" Initialize the class.\n",
" \"\"\"\n",
" pass\n",
" \n",
" def fit(self, X, y):\n",
" \"\"\"\n",
" Fit the dummy classifier.\n",
" \n",
" Parameters\n",
" ----------\n",
" X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)\n",
" Training data.\n",
" y : Numpy array or Pandas DataFrame of shape (n_samples,)\n",
" Target values.\n",
" \"\"\"\n",
" pass\n",
" \n",
" def predict(self, X):\n",
" \"\"\"\n",
" Predict the class labels for the provided data.\n",
"\n",
" Parameters\n",
" ----------\n",
" X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)\n",
" Test samples.\n",
"\n",
" Returns\n",
" -------\n",
" y : Numpy array or Pandas DataFrame of shape (n_queries,)\n",
" Class labels for each data sample.\n",
" \"\"\"\n",
" pass"
]
},
{
"cell_type": "markdown",
"id": "1dd67c48",
"metadata": {},
"source": [
"Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "184f3905",
"metadata": {},
"outputs": [],
"source": [
"def accuracy_score(y_true, y_pred):\n",
" pass"
]
},
{
"cell_type": "markdown",
"id": "90dcae17",
"metadata": {},
"source": [
"Compute the performance of the dummy classifier using the provided test set."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa666b66",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "9e10cd97",
"metadata": {},
"source": [
"## c. K-Nearest Neighbors classifier"
]
},
{
"cell_type": "markdown",
"id": "70009457",
"metadata": {},
"source": [
"Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "759e924e",
"metadata": {},
"outputs": [],
"source": [
"class KNNClassifier():\n",
" \n",
" def __init__(self, n_neighbors=3):\n",
" \"\"\"\n",
" Initialize the class.\n",
" \n",
" Parameters\n",
" ----------\n",
" n_neighbors : int, default=3\n",
" Number of neighbors to use by default.\n",
" \"\"\"\n",
" pass\n",
" \n",
" def fit(self, X, y):\n",
" \"\"\"\n",
" Fit the k-nearest neighbors classifier.\n",
" \n",
" Parameters\n",
" ----------\n",
" X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)\n",
" Training data.\n",
" y : Numpy array or Pandas DataFrame of shape (n_samples,)\n",
" Target values.\n",
" \"\"\"\n",
" pass\n",
" \n",
" @staticmethod\n",
" def _euclidian_distance(a, b):\n",
" \"\"\"\n",
" Utility function to compute the euclidian distance.\n",
" \n",
" Parameters\n",
" ----------\n",
" a : Numpy array or Pandas DataFrame\n",
" First operand.\n",
" b : Numpy array or Pandas DataFrame\n",
" Second operand.\n",
" \"\"\"\n",
" pass\n",
" \n",
" def predict(self, X):\n",
" \"\"\"\n",
" Predict the class labels for the provided data.\n",
"\n",
" Parameters\n",
" ----------\n",
" X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)\n",
" Test samples.\n",
"\n",
" Returns\n",
" -------\n",
" y : Numpy array or Pandas DataFrame of shape (n_queries,)\n",
" Class labels for each data sample.\n",
" \"\"\"\n",
" pass"
]
},
{
"cell_type": "markdown",
"id": "6c2b4811",
"metadata": {},
"source": [
"Compute the performance of the system as a function of $k = 1...7$."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf589e66",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "71c51f35",
"metadata": {},
"source": [
"Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f6f262b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "e2b1a682",
"metadata": {},
"source": [
"Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0bda7ee",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "2724167a",
"metadata": {},
"source": [
"Re-run the KNN algorithm using all features."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46ec9699",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "648aa52e",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}