Lab 12 Distance Based Machine Learning

Posted on 12/07/2019 in Python Data Science

Module 12 Lab - Distance Based Machine Learning

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import random
import patsy
import sklearn.neighbors as neighbors
import sklearn.linear_model as linear
import models

sns.set(style="whitegrid")

Problem 1. kNN Regression

Use k-Nearest Neighbors regression for the insurance data set. Make sure you do the following:

  1. Pick an appropriate evaluation metric.
  2. Validation curves to find the best value of k.
  3. Learning curves to see if we are high bias or high variance and suggest ways to improve the model.
  4. 10 fold cross validation to estimate the mean metric and its credible interval.
  5. Was this better than the best linear regression model you estimated in Lab 11? Use Bayesian statistical inference to generate and evaluate the posterior distribution of the difference of means.

We start by loading the data.

In [2]:
insurance = pd.read_csv( "insurance.csv", header=0)

Linear Regression Model

We're going to reverse the order a bit and complete the cross-validation of our linear regression model from the last Lab:

In [3]:
insurance = pd.concat([insurance, pd.get_dummies(insurance["sex"])], axis=1)
insurance = pd.concat([insurance, pd.get_dummies(insurance["region"])], axis=1)
insurance = pd.concat([insurance, pd.get_dummies(insurance["smoker"], prefix="smoke")], axis=1)
insurance["age_sq"] = insurance.age**2
insurance["bmi_above_30"] = insurance.bmi.apply(lambda bmi: 1 if bmi > 30.0 else 0)
In [4]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + smoke_yes:bmi_above_30 + children + male:children"
result = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/base.py:485: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  linalg.lstsq(X, y)
Out[4]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)1976.96(937.24, 3137.72)
age_sq ($\beta_1$)3.34(3.18, 3.61)
male ($\beta_2$)-423.27(-1099.59, 89.65)
bmi ($\beta_3$)3.32(-32.25, 37.29)
smoke_yes ($\beta_4$)1590.26(-1805.07, 5593.70)
smoke_yes:bmi ($\beta_5$)466.06(310.53, 589.28)
smoke_yes:bmi_above_30 ($\beta_6$)15211.86(13762.55, 16791.31)
children ($\beta_7$)695.83(460.49, 965.33)
male:children ($\beta_8$)-85.88(-468.79, 280.57)
Metrics
$\sigma$4380.03(3977.57, 4693.10)
$R^2$0.87(0.85, 0.90)
In [5]:
def chunk(xs, n):
    k, m = divmod(len(xs), n)
    return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]

def cross_validation(algorithm, formula, data, evaluate, fold_count=10, repetitions=3):
    indices = list(range(len( data)))
    metrics = []
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[fold]
            train_indices = [idx not in fold for idx in indices]
            train_data = data.iloc[train_indices]
            result = algorithm(formula, data=train_data)
            model = result["model"]
            y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
            # y = np.ravel( y) # might need for logistic regression
            results = models.summarize(formula, X, y, model)
            metric = evaluate(results)
            metrics.append(metric)
    return metrics

Let's do our 3 rounds of 10 fold cross validation:

In [6]:
formula = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + smoke_yes:bmi_above_30 + children + male:children"
lr_result = cross_validation(models.linear_regression, formula, insurance, lambda r: (r["sigma"], r["r_squared"]))

Evaluation Metric

MSE can be difficult to interpret although it is definitely something you need to take in account when building regression models. Not all models with the same $R^2$ will have the same MSE. The latter is model dependent. Still we're look at $R^2$. Another thing to remember as we get into validation curves, is that, unlike MSE, bigger $R^2$ is better.

In [7]:
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result], [0.025, 0.975]))
95% CI for R^2: [0.75904357 0.94945157]

k Nearest Neighbors

This is what we want to beat with our kNN model. First, we need to scale all of our numeric variables which are bmi and age. k Nearest Neighbors is very sensitive to differences in magnitude, perhaps even more than linear regression.

In [8]:
from sklearn.preprocessing import StandardScaler
In [9]:
scaler = StandardScaler()
In [10]:
columns = ["bmi", "age"]
insurance[columns] = scaler.fit_transform(insurance[columns])
/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/usr/local/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)

Here is a special function for cross validation for kNN:

In [11]:
def knn_cross_validation(formula, builder, data, fold_count=10, repetitions=3):
    indices = list(range(len( data)))
    metrics = {"train": [], "test": []}
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[fold]
            train_indices = [idx not in fold for idx in indices]
            train_data = data.iloc[train_indices]
            # y, X for training data
            y, X = patsy.dmatrices(formula, train_data, return_type="matrix")
            model = builder.fit(X, y)
            y_hat = model.predict(X)
            training_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
            metrics["train"].append(training_r_squared)
            # y, X for training data
            y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
            y_hat = model.predict(X)
            test_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
            metrics["test"].append(test_r_squared)
    return metrics

Finding the best k using Validation Curves

We're going to try a vanilla "all in" model with kNN without any transformations. kNN doesn't use the values of $X$ to estimate $y$...the estimate of $y$ comes directly from the data points. This means that transformations on $X$ aren't quite as important.

Our first goal is to find the best k for our kNN model using validation curves:

In [12]:
formula = "charges ~ 0 + bmi + age + male + smoke_yes + children"
In [13]:
test_curve = []
train_curve = []
for k in range(1, 21):
    builder = neighbors.KNeighborsRegressor(k)
    results = knn_cross_validation(formula, builder, insurance)
    test_curve.append(stats.mstats.mquantiles(results['test'], [0.025, 0.5, 0.975]))
    train_curve.append(stats.mstats.mquantiles(results['train'], [0.025, 0.5, 0.975]))
    
In [14]:
xs = list(range(1, 21))
figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
axes = figure.add_subplot(1, 1, 1)

test_lower, test_mid, test_upper = zip(*test_curve)
train_lower, train_mid, train_upper = zip(*train_curve)

axes.fill_between(xs, test_lower, test_upper, alpha=0.25, color="firebrick", label="test")
axes.plot(xs, test_mid, color="firebrick")

axes.fill_between(xs, train_lower, train_upper, alpha=0.25, color="cornflowerblue", label="train")
axes.plot(xs, train_mid, color="cornflowerblue")

axes.set_xticks(xs)
axes.set_xticklabels([str(x) for x in xs])

axes.legend()
axes.set_title("Validation Curves for kNN")
axes.set_xlabel("k")
axes.set_ylabel("$R^2$")

plt.show()
plt.close()

It appears that k=1 is the best value for this model (You may have gotten a different result if you use different variables or did not mean scale age and bmi).

[Notice that I changed the title and axis lables on the chart]

Learning Curves

The next step is to plot learning curves to see if and how we can improve the model.

In [15]:
from collections import defaultdict
In [16]:
def data_collection():
    result = dict()
    result[ "train"] = defaultdict( list)
    result[ "test"] = defaultdict( list)
    return result
In [17]:
def knn_learning_curves(builder, formula, data, fold_count=10, repetitions=3, increment=1):
    indices = list(range(len( data)))
    results = data_collection()
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[ fold]
            train_indices = [idx for idx in indices if idx not in fold]
            train_data = data.iloc[train_indices]
            for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
                # the indices are already shuffled so we only need to take ever increasing chunks
                train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
                train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
                # we calculate the model
                y, X = patsy.dmatrices(formula, train_data_chunk, return_type="matrix")
                model = builder.fit(X, y)
                y_hat = model.predict(X)
                training_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
                results["train"][i].append(training_r_squared)
                
                # y, X for test data
                y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
                y_hat = model.predict(X)
                test_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
                results["test"][i].append(test_r_squared)
    # process results
    # Rely on the CLT...
    statistics = {}
    for k, v in results["train"].items():
        statistics[ k] = (np.mean(v), np.std(v))
    results["train"] = statistics
    statistics = {}
    for k, v in results["test"].items():
        statistics[ k] = (np.mean(v), np.std(v))
    results["test"] = statistics
    return results
#
In [18]:
result = knn_learning_curves(neighbors.KNeighborsRegressor(1), formula, insurance)
In [19]:
def results_to_curves( curve, results):
    all_statistics = results[ curve]
    keys = list( all_statistics.keys())
    keys.sort()
    mean = []
    upper = []
    lower = []
    for k in keys:
        m, s = all_statistics[ k]
        mean.append( m)
        upper.append( m + 2 * s)
        lower.append( m - 2 * s)
    return keys, lower, mean, upper
In [20]:
def plot_learning_curves( results, metric, zoom=False):
    figure = plt.figure(figsize=(10,6))

    axes = figure.add_subplot(1, 1, 1)

    xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
    _, test_lower, test_mean, test_upper = results_to_curves( "test", results)

    axes.plot( xs, train_mean, color="steelblue")
    axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
    axes.plot( xs, test_mean, color="firebrick")
    axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
    axes.legend()
    axes.set_xlabel( "training set (%)")
    axes.set_ylabel( metric)
    axes.set_title("Learning Curves")

    if zoom:
        y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
        y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
        axes.set_ylim((y_lower, y_upper))

    plt.show()
    plt.close()
In [21]:
plot_learning_curves(result, "$R^2$")

The unfortunate part of k=1 is that the "train" curve is a bit weird, in fact, it suggests we're overfitting our data. The test curve has converged.

(Again, you may have gotten different results, depending. One thing I have noticed is students saying that the curves have not curved even though their confidence/credible intervals overlap. This is a bit weird...statistically, we cannot say they haven't converged if their confidence/credible intervals overlap]

Cross Validation

In order to get a sense of the generalization error of our final model, we will perform 3 x 10 fold cross validation and look at the credible interval from the posterior distribution of $R^2$:

In [22]:
builder = neighbors.KNeighborsRegressor(1)
knn_results = knn_cross_validation(formula, builder, insurance)
In [23]:
print(r"95% CI for R^2:", stats.mstats.mquantiles(knn_results['test'], [0.025, 0.975]))
95% CI for R^2: [0.83623528 0.99982295]

Our 95% credible interval for the posterior distribution of $R^2$ is 84.3 to 99.9. This is a little off from our estimate using the mean and standard deviation when we built our learning curves.

Model Comparison

We have 30 $R^2$ estimates for both the linear regression and the kNN models for this problem. With that data, we can do separate bootstrap estimates of the posterior distributions of the mean $R^2$ of the models and the differences in the models.

In [24]:
def bootstrap_sample( data, f, n=100):
    result = []
    m = len( data)
    for _ in range( n):
        sample = np.random.choice( data, len(data), replace=True)
        r = f( sample)
        result.append( r)
    return np.array( result)
In [25]:
knn_bootstrap = bootstrap_sample(knn_results['test'], np.mean)
lr_bootstrap = bootstrap_sample([r[1] for r in lr_result], np.mean)
difference = knn_bootstrap - lr_bootstrap
In [26]:
figure = plt.figure(figsize=(20, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 3, 1)

axes.hist(knn_bootstrap, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "Posterior Distribution of kNN $R^2$")

axes = figure.add_subplot(1, 3, 2)

axes.hist(lr_bootstrap, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "Posterior Distribution of Linear Regression $R^2$")

axes = figure.add_subplot(1, 3, 3)

axes.hist( difference, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "Difference")
axes.set_title( "Posterior Distribution of Difference in $R^2$")

plt.show()
plt.close()
In [27]:
print("P(kNN > LR)", np.mean(difference > 0))
P(kNN > LR) 1.0

This is basically a difference of means test. Here we can see that there is little evidence that that linear regression is better than kNN. Of course, we cannot really rule out that the linear regression is better but the evidence doesn't support that claim very well.

(again, you may get different results depending on your linear regression and kNN results; notice that the chart labels are changed.)

Problem 2. Clustering

Use k-Means Clustering on clustering problems of your own creation in two dimensions ($x_1$ and $x_2$). You should explore the following points:

  1. What if the data has no clusters (there are no hidden categorical variables)?
  2. Now assume that you have some "hidden" categorical variable and the clusters are compact and distinct as well as having the same variance? What does the Elbow Method show for the k you should use?
  3. Now assume that you have some "hidden" categorical variable and the clusters are disperse? Different variances? What does the Elbow Method show for the k you should use?

There were a number of goals here:

  1. Here you see that k Means will find clusters even if there aren't any.
  2. If the clusters have some actual support from the domain and the clusters are very distinct, k Means will find them.
  3. However, if the clusters are not compact or well separated, k Means will have trouble separating points at the edges. Clustering is not classification.

Let's have a look:

No Real Clusters

In [28]:
from sklearn.cluster import KMeans
In [29]:
np.random.seed(87475)

xs = np.random.uniform(0, 100, size=(100, 2))

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
axes = figure.add_subplot(1, 1, 1)
axes.scatter(xs[:,0],xs[:,1])

plt.show()
plt.close()
In [30]:
kmeans = KMeans(n_clusters=2)  
kmeans.fit(xs)

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
axes = figure.add_subplot(1, 1, 1)
axes.scatter(xs[:,0],xs[:,1], c=kmeans.labels_, cmap='rainbow')

plt.show()
plt.close()

First lesson successfully learned! (The rainbow color map is probably just fine here)

Clearly Distinct Clusters

In [31]:
np.random.seed(776472)

xs1 = np.random.normal(25, 10, size=(50, 2))
xs2 = np.random.normal(75, 10, size=(50, 2))

xs = np.concatenate((xs1, xs2), axis=0)

plt.scatter(xs[:,0],xs[:,1])

plt.show()
plt.close()
In [32]:
figure = plt.figure(figsize=(10, 10)) # first element is width, second is height.

sse = {}
for k in range(1, 7):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(xs)
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
    axes = figure.add_subplot(2, 3, k)
    axes.scatter(xs[:,0],xs[:,1], c=kmeans.labels_, cmap='rainbow')

plt.show()
plt.close()

figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.plot(list(sse.keys()), list(sse.values()))
axes.set_xlabel("Number of clusters")
axes.set_ylabel("SSE")

plt.show()
plt.close()

The "elbow" method seems to have identified the correct number of clusters here. Lesson #2 learned.

Indistinct clusters

In [33]:
np.random.seed(776472)

xs1 = np.random.normal(25, 20, size=(50, 2))
xs2 = np.random.normal(75, 20, size=(50, 2))

xs = np.concatenate((xs1, xs2), axis=0)

plt.scatter(xs[:,0],xs[:,1])

plt.show()
plt.close()
In [34]:
figure = plt.figure(figsize=(10, 10)) # first element is width, second is height.

sse = {}
for k in range(1, 7):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(xs)
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
    axes = figure.add_subplot(2, 3, k)
    axes.scatter(xs[:,0],xs[:,1], c=kmeans.labels_, cmap='rainbow')

plt.show()
plt.close()

figure = plt.figure(figsize=(10, 6))
axes = figure.add_subplot(1, 1, 1)
axes.plot(list(sse.keys()), list(sse.values()))
axes.set_xlabel("Number of clusters")
axes.set_ylabel("SSE")

plt.show()
plt.close()

This seems ok. What's the problem? Well, let's plot the original points but this time, we'll plot each "cluster" in a different color:

In [35]:
np.random.seed(776472)

xs1 = np.random.normal(25, 20, size=(50, 2))
xs2 = np.random.normal(75, 20, size=(50, 2))

xs = np.concatenate((xs1, xs2), axis=0)

plt.scatter(xs1[:,0],xs1[:,1], color="firebrick")
plt.scatter(xs2[:,0],xs2[:,1], color="cornflowerblue")

plt.show()
plt.close()

Clustering doesn't recover the "fuzzy" boundary between the two groups. This may be problematic, depending on what you want to do with such clusters but more importantly. How do you tell this is even happening?

The moral of the story is that clustering is not classification.