# Done by Aviolat Charline, Bach Joachim and Marino Gabriel

# Exercice 3 - Review questions

**a) Assuming an univariate input *x*, what is the complexity at inference time of a Bayesian classifier based on histogram computation of the likelihood ?**

For each class, we must compute the likelyhood, which is one calculus per class, so O(nb_class). Then, for each x we must compute the posteriori probability, which is looking into a pre-computed histogram (done in the training phase), so this is O(nb_class). The a priori probability only needs to be computed for each class, so O(nb_class). So, the total complexity of the Bayesian classifier is O(2 * nb_class * nb_x)

**b) Bayesian models are said to be generative as they can be used to generate new samples. Taking the implementation of the exercise 1.a, explain the steps to generate new samples using the system you have put into place.**
 

To generate a new sample, we need to create a y and a x. This can be done by firstly picking the class Ck randomly according to the a priori probabilities P(Ck). This means that, if there is two classes and P(C1) = 0.6 and P(C2) = 0.4, we take 60% of the time C1 and 40% of the time C2

Then, we can pick a random x based on the probability density function p(x|Ck). This means we choose a class and, in the density function (like histogram), we take a random x based on the probablilities. If there is two x values possibles and one is distributed as 0.4 and the other 0.6, we will taxe x1 40% of the time and x2 60%

***Optional*: Provide an implementation in a function generateSample(priors, histValues, edgeValues, n)**

In [9]:
def generateSample(priors, histValues, edgeValues, n):
    # pick a class according to the proba
    # to do that, compute the different probabilities sum. This is done by creating intervals between 0 and 1. The size of those intervals represents the probability of the random
    # number generator to land on it.
    cumulative_probs = np.cumsum(priors)

    # take a random number and see in which interval it falls. The index of this interval will be the class we chose
    chosen_class = 0
    r = random.random() 
    for i, cp in enumerate(cumulative_probs):
        if r < cp:
            chosen_class = i
            break
    

    # The same logic is used to find the new x value. We take the proba of x given c and chose randomly weighted by those proba.
    # we have to compute the "probabilities" differently, because the histogram is only the count of each x in the c.
    # here, we kept the count instead of proba and when generating the random number, instead of chosing between 0 1 and 1 we chose between 0 and total_hist
    # which does the same job in the end
    total_hist = np.sum(histValues[chosen_class])

    cumulative_probs_hist = np.cumsum(histValues[chosen_class])

    # take a random number and see in which interval it falls. The index of this interval will be the class we chose
    chosen_x_index = 0
    r = random.uniform(0, total_hist) 
    for i, cp in enumerate(cumulative_probs_hist):
        if r < cp:
            chosen_x_index = i
            break

    chosen_x = edgeValues[chosen_x_index]
    


**c) What is the minimum overall accuracy of a 2-class system relying only on priors and that is built on a training set that includes 5 times more samples in class A than in class B?**

If we only take the priors, then the posterior probability only depends on it. The system will chose the highest posterior probability, so the highest prior because it is all it has. This means it will always choose the class A. If the repartition of the test set is the same as the training set, then always choosing A will give a 5/6 success rate, which will be all the correct A and all the missed B. If the test set is balanced, the success rate will be 50% because it will find all the A and miss all the B. Finally, if the system is unbalanced in the other way, the success rate will only be the portion of the A class in comparaison to the B class. The absolute minimum is then how low the portion of A can be compared to B in the test set.

**d) Letâ€™s look back at the PW02 exercise 3 of last week. We have built a knn classification systems for images of digits on the MNIST database.**

**How would you build a Bayesian classification for the same task ? Comment on the prior probabilities and on the likelihood estimators. More specifically, what kind of likelihood estimator could we use in this case ?**

The a priori probability is simply the repartition of each class in the dataset.
The likelihood is the tricky part, because the system would need to be multivariate (because of all the pixels), which makes it very complex. We could use the Naive Bayes formula witch states that the features (pixels) are completely uncorrelated and then we cound perform the operation pixel per pixel for each image. However, the pixels ARE NOT uncorrelated, because a pixel is spatially positionned. If a pixel is white, there is a strong change that there are white pixels somewhere around as well. The Naive Bayes would technically still work-ish, but with a false presomption.

To do it correctly, we would have to use something like the multivariate gaussian distribution

***Optional:* implement it and report performance !**

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os


# This is a method to read the MNIST dataset from a ROOT directory
def load_MNIST(ROOT):
  '''load all of mnist
  training set first'''
  Xtr = []
  train = pd.read_csv(os.path.join(ROOT, 'mnist_train.csv'))
  X = np.array(train.drop('label', axis=1))
  Ytr = np.array(train['label'])
  # With this for-loop we give the data a shape of the acctual image (28x28)
  # instead of the shape in file (1x784)
  for row in X:
      Xtr.append(row.reshape(28,28))
  # load test set second
  Xte = []
  test = pd.read_csv(os.path.join(ROOT, 'mnist_test.csv'))
  X = np.array(test.drop('label', axis=1))
  Yte = np.array(test['label'])
  # same reshaping
  for row in X:
      Xte.append(row.reshape(28,28))
  
  return np.array(Xtr), np.array(Ytr), np.array(Xte), np.array(Yte)

# Load the raw MNIST data.
mnist_dir = '' 
X_train, y_train, X_test, y_test = load_MNIST(mnist_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
X_train = np.reshape(X_train, (X_train.shape[0], -1)) 
X_test = np.reshape(X_test, (X_test.shape[0], -1))    

print(X_train.shape, X_test.shape)
def predict_gaussian(X_test, mu, sigma2, priors):
    n_samples = X_test.shape[0]
    y_pred = np.zeros(n_samples)

    K, n_pixels = mu.shape
    
    for idx in range(n_samples):
        x = X_test[idx]
        proba_classes = np.zeros(K)

        for c in range(K):
            log_likelihood = -0.5 * np.log(2 * np.pi * sigma2[c]) - ((x - mu[c])**2) / (2 * sigma2[c])
            proba_classes[c] = np.log(priors[c]) + np.sum(log_likelihood)
  
        y_pred[idx] = np.argmax(proba_classes)

    return y_pred
classes = np.unique(y_train)
priors = np.array([np.mean(y_train == c) for c in classes])

n_pixels = X_train.shape[1]

mu = np.zeros((len(classes), n_pixels))
sigma2 = np.zeros((len(classes), n_pixels))

for c in classes:
    X_c = X_train[y_train == c]
    mu[c, :] = X_c.mean(axis=0)
    sigma2[c, :] = X_c.var(axis=0) + 1e-5
    
y_pred = predict_gaussian(X_test, mu, sigma2, priors)
accuracy = np.mean(y_pred == y_test)

print("Accuracy score :", accuracy)

Training data shape:  (10000, 28, 28)
Training labels shape:  (10000,)
Test data shape:  (10000, 28, 28)
Test labels shape:  (10000,)
(10000, 784) (10000, 784)
Accuracy score : 0.5711


The .57 accuracy observed here might prove that the method is not the right one for this type of problems, because if each pixel is a feature, the number of dimensions become way to big. This might also be caused by the fact that pixels are not uncorrelated.

**e) Read [europe-border-control-ai-lie-detector](https://theintercept.com/2019/07/26/europe-border-control-ai-lie-detector/). The described system is "a virtual policeman designed to strengthen European borders". It can be seen as a 2-class problem, either you are a suspicious traveler or you are not. If you are declared as suspicious by the system, you are routed to a human border agent who analyses your case in a more careful way.**

1. What kind of errors can the system make ? Explain them in your own words.
2. Is one error more critical than the other ? Explain why.
3. According to the previous points, which metric would you recommend to tune your MLsystem ?

1. The system can make false positives or false negatives. This means it could say that an innocent man is a threat or that a dangerous person is safe to cross the border.
2. Yes, a false negative is the most critical one. In the case of a false positive, the only consequence is a lost of time because you have to interrogate the "suspect", maybe resulting in a angry customer. On the other hand, a false negative means a real threat has entered the country and has not been detect, wich could have way more concequences than an angry customer.
3. In this case, we could use the Area Under the Curve with this system. This would allow to tune the the treshold and impact the decision to tend more to false positives rather than false negatives

**f) When a deep learning architecture is trained using an unbalanced training set, we usually observe a problem of bias, i.e. the system favors one class over another one. Using the Bayes equation, explain what is the origin of the problem.**

The bayes equation : P(Ck|x) = (p(x|Ck)*P(Ck))/p(x).

The a priori probability (P(Ck)) is what reprensents the unbalance in the training set. This value is the probability to have this class, so it is linked to the number of data in it. This means that a class with 3x more data in it will have a way bigger P(Ck) witch will impact the decision in favor of the biggest P(Ck).