334 lines
7.2 KiB
Plaintext
334 lines
7.2 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"source": [
|
||
"## Exercise 1 - Bayes classification system"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Import some useful libraries\n",
|
||
"\n",
|
||
"import math\n",
|
||
"\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.preprocessing import OrdinalEncoder, StandardScaler"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1a. Getting started with Bayes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"a) Read the training data from file ex1-data-train.csv. The first two columns are x1 and x2. The last column holds the class label y."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def read_data(file):\n",
|
||
" dataset = pd.read_csv(file, names=['x1','x2','y'])\n",
|
||
" print(dataset.head())\n",
|
||
" return dataset[[\"x1\", \"x2\"]], dataset[\"y\"].values"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_train, y_train = read_data(\"ex1-data-train.csv\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Prepare a function to compute accuracy\n",
|
||
"def accuracy_score(y_true, y_pred):\n",
|
||
" return (y_true == y_pred).sum() / y_true.size"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"b) Compute the priors of both classes P(C0) and P(C1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# TODO: Compute the priors\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"c) Compute histograms of x1 and x2 for each class (total of 4 histograms). Plot these histograms. Advice : use the numpy `histogram(a, bins=\"auto\")` function."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# TODO: Compute histograms\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"# TODO: plot histograms\n",
|
||
"\n",
|
||
"plt.figure(figsize=(16,6))\n",
|
||
"\n",
|
||
"plt.subplot(1, 2, 1)\n",
|
||
"...\n",
|
||
"plt.xlabel('Likelihood hist - Exam 1')\n",
|
||
"\n",
|
||
"plt.subplot(1, 2, 2)\n",
|
||
"...\n",
|
||
"plt.xlabel('Likelihood hist - Exam 2')\n",
|
||
"\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"d) Use the histograms to compute the likelihoods p(x1|C0), p(x1|C1), p(x2|C0) and p(x2|C1). For this define a function `likelihood_hist(x, hist_values, edge_values)` that returns the likelihood of x for a given histogram (defined by its values and bin edges as returned by the numpy `histogram()` function)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def likelihood_hist(x: float, hist_values: np.ndarray, bin_edges: np.ndarray) -> float:\n",
|
||
" # TODO: compute likelihoods from histograms outputs\n",
|
||
"\n",
|
||
" return ..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"e) Implement the classification decision according to Bayes rule and compute the overall accuracy of the system on the test set ex1-data-test.csv. :\n",
|
||
"- using only feature x1\n",
|
||
"- using only feature x2\n",
|
||
"- using x1 and x2 making the naive Bayes hypothesis of feature independence, i.e. p(X|Ck) = p(x1|Ck) · p(x2|Ck)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_test, y_test = read_data(\"ex1-data-test.csv\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# TODO: predict on test set in the 3 cases described above\n",
|
||
"\n",
|
||
"y_pred = []\n",
|
||
"\n",
|
||
"...\n",
|
||
"\n",
|
||
"accuracy_score(y_test, y_pred)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Which system is the best ?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"TODO: answer"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1b. Bayes - Univariate Gaussian distribution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Do the same as in a) but this time using univariate Gaussian distribution to model the likelihoods p(x1|C0), p(x1|C1), p(x2|C0) and p(x2|C1). You may use the numpy functions `mean()` and `var()` to compute the mean μ and variance σ2 of the distribution. To model the likelihood of both features, you may also do the naive Bayes hypothesis of feature independence, i.e. p(X|Ck) = p(x1|Ck) · p(x2|Ck).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def likelihood_univariate_gaussian(x: float, mean: float, var: float) -> float:\n",
|
||
" # TODO: compute likelihoods from histograms outputs\n",
|
||
"\n",
|
||
" return ..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"pycharm": {
|
||
"is_executing": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# TODO: Compute mean and variance for each classes and each features (8 values)\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# TODO: predict on test set in the 3 cases\n",
|
||
"\n",
|
||
"y_pred = []\n",
|
||
"\n",
|
||
"...\n",
|
||
"\n",
|
||
"accuracy_score(y_test, y_pred)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.7"
|
||
},
|
||
"pycharm": {
|
||
"stem_cell": {
|
||
"cell_type": "raw",
|
||
"metadata": {
|
||
"collapsed": false
|
||
},
|
||
"source": []
|
||
}
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 1
|
||
}
|