In [1]:

import warnings
warnings.filterwarnings('ignore')

Lab 10 - Linear Models - Solution¶

In [2]:

% matplotlib inline

Linear Regression¶

In a previous module (Lab 5), you performed EDA on the insurance data set. In this Lab, you should build a linear regression model trying to estimate charges.

In [3]:

import numpy as np
import random as py_random
import numpy.random as np_random
import time
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

sns.set(style="whitegrid")

Answer¶

(Obviously, this is not the solution but a possible solution.)

Let's review the EDA from Lab 5. What did we learn?

The data contains a number of variables. We have age, sex, BMI, number of children, smoking, region, and the target variable, charges.

Age - Uniform distribution
Sex - Bernoulli distribution, 50/50
BMI - Normal distribution
Children - Multinomial distribution, possibly geometric.
Smoking - Bernoulli distribution, 20/80
Region - Multinomial distribution
Charges - Exponential distribution-ish (no zero charges).

For the features, there's nothing that particularly stands out as needing our attention. Sex and smoker will need to be converted into a dummy variable and Region will need to be converted into four dummy variables. Let's load the data:

In [4]:

insurance = pd.read_csv( "insurance.csv", header=0)

In [5]:

insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
charges     1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB

We noted in our EDA that charges is not normally distributed. This may cause problems later but let's leave it for now.

In [6]:

figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(insurance.charges)
axes.set_title("Charges Histogram")
plt.show()
plt.close()

Let's review our pairwise EDA with the target variable, charges:

age - definite positive progression but banding suggests an interaction term.
sex - possibly some higher charges for men.
BMI - overall positive but banding suggests there are other factors.
smoking - definitely a relationship.

We didn't look at children but the connection doesn't seem likely to be strong on the face of it.

Since we've decided to use a linear model, it is much more important to look at the correlation coefficients between each variable and the target, and each other but first we need those dummy variables:

In [7]:

insurance = pd.concat([insurance, pd.get_dummies(insurance["sex"])], axis=1)
insurance = pd.concat([insurance, pd.get_dummies(insurance["region"])], axis=1)
insurance = pd.concat([insurance, pd.get_dummies(insurance["smoker"], prefix="smoke")], axis=1)

In [8]:

insurance.head()

Out[8]:

	age	sex	bmi	children	smoker	region	charges	female	male	northwest	southeast	southwest	smoke_no	smoke_yes
0	19	female	27.900	0	yes	southwest	16884.92400	1	0	0	0	1	0	1
1	18	male	33.770	1	no	southeast	1725.55230	0	1	0	1	0	1	0
2	28	male	33.000	3	no	southeast	4449.46200	0	1	0	1	0	1	0
3	33	male	22.705	0	no	northwest	21984.47061	0	1	1	0	0	1	0
4	32	male	28.880	0	no	northwest	3866.85520	0	1	1	0	0	1	0

And now we can calculate our correlation coefficients:

In [9]:

def correlations(data, y, xs):
    rs = []
    rhos = []
    for x in xs:
        r = stats.pearsonr(data[y], data[x])[0]
        rs.append(r)
        rho = stats.spearmanr(data[y], data[x])[0]
        rhos.append(rho)
    return pd.DataFrame({"feature": xs, "r": rs, "rho": rhos})

In [10]:

correlations(insurance, "charges", ["age", "male", "female", 
                                    "bmi", "smoke_yes", "children", 
                                    "northwest", "southwest", "southeast"])

Out[10]:

	feature	r	rho
0	age	0.299008	0.534392
1	male	0.057292	0.009490
2	female	-0.057292	-0.009490
3	bmi	0.198341	0.119396
4	smoke_yes	0.787251	0.663460
5	children	0.067998	0.133339
6	northwest	-0.039905	-0.021634
7	southwest	-0.043210	-0.042354
8	southeast	0.073982	0.017275

There are some interesting results here:

age - as we saw in the EDA, there is a positive relationship (30%) but that Spearman's suggests that the relationship is non-linear.
male - not a huge relationship.
female - exact same but opposite relationship as male. I included it to demonstrate why it's not needed to include both. I'm including the sex dummy variable for male because our EDA suggested that there was some relationship there.
bmi - not a huge relationship and the Spearman doesn't suggest that this is because it's nonlinear.
smoking - definite relationship. We already knew this.
children - not a huge relationship but the larger Spearman's indicates the relationship might be non-linear.

The southest variable looks at least as promising as male but we'll hold off on the regional variables for now. There's good reason to believe that for any effect we find, it's likely that regional differences are likely the result of the different distribution of those effects.

Before estimating any linear models, let's look at the Null model (the mean):

In [11]:

insurance.charges.describe()

Out[11]:

count     1338.000000
mean     13270.422265
std      12110.011237
min       1121.873900
25%       4740.287150
50%       9382.033000
75%      16639.912515
max      63770.428010
Name: charges, dtype: float64

The lowest charge is \$1,121 and the highest is \$63,770. The mean is \$13,270 with a standard deviation of \$12,110. You can think of \$12,110 as the $\sigma$ of the Null model.

(This is somewhat artificial...modeling like this is likely to follow directly after the EDA)

Let's try the "all in" model. We have domain knowledge and some statistics to support the inclusion of all the variables (exception regions):

In [12]:

import models

In [13]:

model = "charges ~ age + male + bmi + smoke_yes + children"
result1 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result1)

Out[13]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	-12052.46	(-13877.54, -10472.09)
age ($\beta_1$)	257.73	(238.52, 291.71)
male ($\beta_2$)	-128.64	(-851.32, 402.91)
bmi ($\beta_3$)	322.36	(262.35, 379.68)
smoke_yes ($\beta_4$)	23823.39	(22881.75, 24853.05)
children ($\beta_5$)	474.41	(269.26, 725.05)
Metrics
$\sigma$	6069.73	(5700.21, 6302.76)
$R^2$	0.75	(0.72, 0.78)

The correlation coefficient is 75% which is pretty good. The error has been cut in half compared to the Null model to \$6,070. But we have already seen that charges is not Normally distributed so the overall error is probably not good to use. What about coefficients:

The intercept isn't particularly interesting here because there are no childless, non-smoking females with zero age and BMI.
age is positive and strongly supported by the data.
male is negative (suprise!) and has mixed support in the data.
bmi is positive and strongly supported by the data.
smoke_yes is positive and strongly supported by the data.
children is positive (surprise!) and strongly supported by the data.

With our first model done, here's what we think we know:

Charges is not normally distributed so a transformation is probably in order.
We could improve interpretability if we mean centered and possibly mean scaled the data.
male and children are surprises and it's not clear why for now.

Let's look at the residuals:

In [14]:

def plot_residuals(result, variables):
    figure = plt.figure(figsize=(20,6))

    variables = ["age", "bmi", "children"]

    plots = len( variables)
    rows = (plots // 3) + 1

    residuals = np.array([r[0] for r in result["residuals"]])
    limits = max(np.abs(residuals.min()), residuals.max())
    
    n = result["n"]
    for i, variable in enumerate( variables):
        axes = figure.add_subplot(rows, 3, i + 1)

        keyed_values = sorted(zip(insurance[variable].values, residuals), key=lambda x: x[ 0])
        ordered_residuals = [x[ 1] for x in keyed_values]

        axes.plot(list(range(0, n)), ordered_residuals, '.', color="dimgray", alpha=0.75)
        axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
        axes.set_ylim((-limits, limits))
        axes.set_ylabel("residuals")
        axes.set_xlabel(variable)

    plt.show()
    plt.close()
    
    return residuals

In [15]:

residuals1 = plot_residuals(result1, ["age", "bmi", "children"])

This is very interesting.

For age, we see three bands. We saw these bands in our EDA. They basically say, at every age, there is some information you are missing that is causing you to be systematically wrong. But they are also slightly curved in a way that suggests increasing returns just like in our discussion of residuals. Additionally, the errors are not symmetric. For every age, we are more likely to overestimate charges.

We see the same banding for children and the same systematic over estimation.

The most interesting result is for BMI. First, the middle band has an inflection that suggests a non-linearity. Second, there's an abrupt break in the error just over the 600th observation. Additionally, if you connected the those two lines, it suggests that you'd have the opposite inflection as the middle line.

We mentioned before that charges is not normally distributed and that this might cause some problems. Let's see if we can address that issue.

In [16]:

insurance["log_charges"] = insurance["charges"].apply(np.log)

Let's see a histogram:

In [17]:

figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(insurance.log_charges)
axes.set_title("Log(Charges) Histogram")
plt.show()
plt.close()

That looks better than before. It's not exactly normal but is fairly symmetric. Let's try to use this as our response variable:

In [18]:

model = "log_charges ~ age + male + bmi + smoke_yes + children"
result2 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result2)

Out[18]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	7.01	(6.88, 7.11)
age ($\beta_1$)	0.03	(0.03, 0.04)
male ($\beta_2$)	-0.08	(-0.12, -0.03)
bmi ($\beta_3$)	0.01	(0.01, 0.01)
smoke_yes ($\beta_4$)	1.55	(1.50, 1.65)
children ($\beta_5$)	0.10	(0.09, 0.12)
Metrics
$\sigma$	0.45	(0.42, 0.48)
$R^2$	0.76	(0.73, 0.79)

In [19]:

residuals2 = plot_residuals(result2, ["age", "bmi", "children"])

Where did all the bands go? This doesn't look very symmetric either. I think we need to compare the histograms of charges versus residuals:

In [20]:

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(2, 2, 1)
axes.hist(insurance.charges)
axes.set_title("Charges Histogram")

axes = figure.add_subplot(2, 2, 2)
axes.hist(residuals1)
axes.set_title("Residuals Histogram")

axes = figure.add_subplot(2, 2, 3)
axes.hist(insurance.log_charges)
axes.set_title("Log(Charges) Histogram")

axes = figure.add_subplot(2, 2, 4)
axes.hist(residuals2)
axes.set_title("Residuals Histogram")

plt.show()
plt.close()

Interesting indeed. Our errors are much more symmetric with the actual charges variable rather than the transformed charges variable.

This is a key point that is often misunderstood or even misstated: linear regression does not require your variables to be normally distributed. We only expect normally distributed errors (residuals) and then only for gauging the confidence in our estimates. Lacking such normality, we can easily fall back on Bootstrap estimates.

Let's continue with the untransformed charges variable...

Unlike many such situations, there are no variables we haven't added to the data so we're left with:

transforms
interaction terms.

We have some ideas for transforms so let's start there. Age looks like it should be squared. This basically means that as you age, charges increase and at an increasing amount. Let's see if that's true:

In [21]:

insurance["age_sq"] = insurance.age**2

In [22]:

model = "charges ~ age_sq + male + bmi + smoke_yes + children"
result3 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result3)

Out[22]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	-7655.22	(-9054.61, -6295.49)
age_sq ($\beta_1$)	3.26	(3.02, 3.56)
male ($\beta_2$)	-133.72	(-756.45, 547.77)
bmi ($\beta_3$)	318.93	(268.62, 374.20)
smoke_yes ($\beta_4$)	23834.29	(22459.90, 24825.84)
children ($\beta_5$)	612.15	(317.76, 864.23)
Metrics
$\sigma$	6036.46	(5644.66, 6396.84)
$R^2$	0.75	(0.72, 0.78)

That didn't change the $R^2$ what about the residuals?

In [23]:

residuals3 = plot_residuals(result3, ["age_sq", "bmi", "children"])

For comparison,

In [24]:

plot_residuals(result1, ["age", "bmi", "children"]); # the semicolon suppresses printing out the result.

So the curves in age are gone, the error ($\sigma$) went down slightly but the artifacts on BMI are less disperse with the new model. Let's see if we can fix that. What if there are decreasing returns to BMI?

In [25]:

insurance["bmi_sqrt"] = insurance.bmi.apply(np.sqrt)

In [26]:

model = "charges ~ age_sq + male + bmi_sqrt + smoke_yes + children"
result4 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result4)

Out[26]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	-17455.49	(-21675.24, -14665.81)
age_sq ($\beta_1$)	3.26	(2.96, 3.51)
male ($\beta_2$)	-135.82	(-685.47, 543.67)
bmi_sqrt ($\beta_3$)	3555.99	(3005.42, 4273.00)
smoke_yes ($\beta_4$)	23842.02	(22800.44, 25079.69)
children ($\beta_5$)	611.69	(302.64, 910.06)
Metrics
$\sigma$	6031.23	(5708.24, 6452.22)
$R^2$	0.75	(0.72, 0.78)

In [27]:

residuals4 = plot_residuals(result4, ["age", "bmi", "children"])

Doesn't look like it...squared?

In [28]:

insurance["bmi_sq"] = insurance.bmi**2

In [29]:

model = "charges ~ age_sq + male + bmi_sq + smoke_yes + children"
result5 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result5)

Out[29]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	-2643.19	(-3621.65, -1741.01)
age_sq ($\beta_1$)	3.29	(3.00, 3.62)
male ($\beta_2$)	-124.93	(-820.99, 478.72)
bmi_sq ($\beta_3$)	4.84	(3.95, 5.78)
smoke_yes ($\beta_4$)	23820.25	(22874.36, 25022.69)
children ($\beta_5$)	613.36	(360.73, 859.40)
Metrics
$\sigma$	6057.28	(5668.81, 6371.76)
$R^2$	0.75	(0.72, 0.78)

In [30]:

residuals5 = plot_residuals(result5, ["age", "bmi", "children"])

We don't seem to be making much progress here. There's a non-linearity but we can't identify it. Let's switch for a bit and think of interaction terms.

There are a lot of potential interaction terms: male, age, bmi, smoke_yes, but we only have 1,336 observations so we can't get overly specific. Let's try the big one: male and bmi.

In [31]:

model = "charges ~ age_sq + male + bmi + male:bmi + smoke_yes + children"
result6 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result6)

Out[31]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	-7664.57	(-9805.43, -5343.57)
age_sq ($\beta_1$)	3.26	(3.03, 3.52)
male ($\beta_2$)	-115.44	(-2588.57, 3332.09)
bmi ($\beta_3$)	319.24	(235.79, 400.00)
male:bmi ($\beta_4$)	-0.60	(-114.21, 77.63)
smoke_yes ($\beta_5$)	23834.54	(22670.36, 24831.61)
children ($\beta_6$)	612.13	(381.95, 796.39)
Metrics
$\sigma$	6038.72	(5696.85, 6449.33)
$R^2$	0.75	(0.72, 0.78)

Wow, that seemed like a slam dunk but the $R^2$ didn't even move. Smoking and bmi?

In [32]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children"
result7 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result7)

Out[32]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	2005.66	(758.39, 3365.79)
age_sq ($\beta_1$)	3.34	(3.15, 3.57)
male ($\beta_2$)	-500.57	(-992.58, 46.85)
bmi ($\beta_3$)	3.53	(-34.83, 40.81)
smoke_yes ($\beta_4$)	-20236.54	(-23518.81, -17378.45)
smoke_yes:bmi ($\beta_5$)	1437.00	(1337.01, 1543.57)
children ($\beta_6$)	653.84	(431.70, 837.16)
Metrics
$\sigma$	4827.80	(4307.24, 5298.97)
$R^2$	0.84	(0.81, 0.87)

The $R^2$ went up a lot but what about the adjusted $R^2$?

In [33]:

print(models.adjusted_r_squared(result1))
print(models.adjusted_r_squared(result6))

0.7484052599228702
0.7509684533349236

In [34]:

residuals7 = plot_residuals(result7, ["age", "bmi", "children"])

So that took care of a whole lot of our error. It shifted the residuals all around but we still have a break in the BMI. Are the residuals correlated with anything?

In [35]:

insurance["residuals"] = residuals7

In [36]:

correlations(insurance, "residuals", ["northeast", "southeast", "southwest"])

Out[36]:

	feature	r	rho
0	northeast	0.087398	0.424320
1	southeast	-0.048958	-0.279556
2	southwest	-0.053899	-0.263691

Hmm, maybe there are real regional differences in health! Let's add these into the regression:

In [37]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children + northeast + southwest + southeast"
result8 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result8)

Out[37]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	1679.25	(72.47, 3126.24)
age_sq ($\beta_1$)	3.33	(3.09, 3.50)
male ($\beta_2$)	-505.26	(-1019.81, 32.25)
bmi ($\beta_3$)	20.32	(-22.42, 68.58)
smoke_yes ($\beta_4$)	-20353.76	(-23588.39, -16884.88)
smoke_yes:bmi ($\beta_5$)	1441.41	(1341.48, 1542.87)
children ($\beta_6$)	657.35	(516.90, 856.61)
northeast ($\beta_7$)	597.70	(-401.09, 1429.05)
southwest ($\beta_8$)	-630.77	(-1373.65, 4.18)
southeast ($\beta_9$)	-608.20	(-1433.58, 118.93)
Metrics
$\sigma$	4807.98	(4336.76, 5164.95)
$R^2$	0.84	(0.81, 0.88)

In [38]:

residuals8 = plot_residuals(result8, ["age", "bmi", "children"])

The error goes down a little and it appears that those in the Southwest are slightly healthier for some reason. The evidence is pretty strong for the northeast having higher charges and for the southeast having lower charges...which sort of defies stereotypes. There may be something else going on here so let's continue to leave the regional dummy variables out.

Let's look at the actual data. First we need to figure out the actual BMI associated with those indices. We can do this by replotting with BMI on the axis instead of index:

In [39]:

n = result7["n"]

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(insurance.bmi.values, insurance.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Notice how we can't see the curvature when plotted as BMI because of overstriking. Still, with this plot, we can see that our problem happens almost exactly at BMI=30. Since we put residuals into the dataframe, we can do a bit of filtering. What happens if we take out women?

In [40]:

subset = insurance[insurance.male == 1]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Is it only men?

In [41]:

subset = insurance[insurance.male == 0]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Nope, still there. What about smoking?

In [42]:

subset = insurance[insurance.smoke_yes == 1]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Interesting. What about non-smoking?

In [43]:

subset = insurance[insurance.smoke_yes == 0]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Well, there are still problems but not this problem. Whatever "this" is, it happens only to smokers, still, it's an incredibly strange discontinuity. Whatever "it" is, we over then underestimate charges until BMI 30...and exactly at BMI 30, it shifts to over estimating and underestimating again.

Let's look at the data:

In [44]:

pd.options.display.max_rows = 100

In [45]:

by_bmi = insurance.sort_values(by="bmi")

In [46]:

data = by_bmi[by_bmi.smoke_yes == 1][((28 <= by_bmi.bmi) & (by_bmi.bmi <= 32))]
data[["age", "male", "bmi", "children", "region", "charges", "residuals"]]

/usr/local/Cellar/ipython/6.5.0/libexec/vendor/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """Entry point for launching an IPython kernel.

Out[46]:

	age	male	bmi	children	region	charges	residuals
52	48	1	28.000	1	southwest	23568.27200	-6394.900885
827	36	1	28.025	1	northeast	20773.62775	-5854.268234
105	20	1	28.025	1	northwest	17560.37975	-6070.813774
1040	35	0	28.025	0	northwest	20234.85475	-6002.302697
280	40	0	28.120	1	northeast	22331.56680	-5950.484142
1307	32	1	28.120	4	northwest	21472.47880	-6344.082100
1007	47	1	28.215	3	northwest	24915.22085	-6347.621165
514	39	1	28.300	1	southwest	21082.16000	-6694.403153
126	19	0	28.300	0	southwest	17081.08000	-6662.547096
375	23	0	28.310	0	northwest	18033.96790	-6285.946253
604	19	0	28.310	0	northwest	17468.98390	-6289.048542
475	61	1	28.310	1	northwest	28868.66390	-6280.279388
641	42	1	28.310	3	northwest	32787.45859	2876.083414
465	30	0	28.380	1	southeast	19521.96820	-6793.447947
1184	23	0	28.490	1	southeast	18328.23810	-6904.814743
476	24	1	28.500	0	northeast	35147.52848	10897.285486
795	27	1	28.500	0	northwest	18310.74200	-6451.214695
144	30	1	28.690	3	northwest	20745.98910	-6823.111860
773	19	0	28.880	0	northwest	17748.50620	-6830.630975
885	32	1	28.930	1	southeast	19719.69470	-7302.171868
886	57	1	28.975	0	northeast	27218.43725	-6655.999057
1337	61	0	29.070	0	northwest	29141.36030	-6949.112676
238	19	1	29.070	0	northwest	17352.68030	-6999.592618
741	27	1	29.150	0	southeast	18246.49550	-7451.808698
949	25	1	29.700	3	southwest	19933.45800	-8170.836077
1053	47	1	29.800	3	southwest	25309.48900	-8236.600387
1179	31	1	29.810	0	southeast	19350.36890	-8074.620034
843	57	0	29.810	0	southeast	27533.91290	-8043.935649
1250	24	1	29.830	0	northeast	18648.42170	-7517.732338
92	59	1	29.830	3	northeast	30184.93670	-7658.615967
103	61	0	29.920	3	southeast	30942.19180	-8334.262977
1278	39	1	29.925	1	northeast	22462.04375	-7655.388160
461	42	1	30.000	0	southwest	22144.03200	-8240.319309
1196	19	0	30.020	0	northwest	33307.55080	7086.204158
1308	25	0	30.200	0	southwest	33900.65300	6537.053152
953	44	1	30.200	2	southwest	38998.54600	6443.143004
587	34	0	30.210	1	northwest	43943.87610	14136.080897
503	19	1	30.250	0	southeast	32548.34050	6496.236731
1300	45	1	30.360	0	southeast	62592.87309	30817.005945
1093	22	0	30.400	0	northwest	33907.54800	6727.420521
146	46	1	30.495	3	northwest	40720.55105	6484.331764
1037	45	0	30.495	1	northwest	39725.51805	6600.770438
1042	20	1	30.685	0	northeast	33475.81715	6666.643995
381	55	1	30.685	0	northeast	42303.69215	6715.117258
828	41	1	30.780	3	northeast	39597.40720	6405.507835
850	37	0	30.780	0	northeast	37270.15120	6582.707980
322	34	1	30.800	0	southwest	35491.64000	5988.337664
956	54	1	30.800	1	southeast	41999.52000	5955.995375
1301	62	1	30.875	3	northwest	46718.16325	6155.186065
1049	49	1	30.900	0	southwest	39727.61400	5916.313378
1021	22	0	31.020	3	southeast	35595.58980	5560.803483
1267	24	1	31.065	0	northeast	34254.05335	6308.839057
689	27	1	31.130	1	southeast	34806.46770	5602.062497
86	57	0	31.160	0	northwest	43578.93940	6056.369114
94	64	0	31.300	2	southwest	47291.05500	5426.304658
677	60	1	31.350	3	northwest	46130.52650	5699.361665
123	44	1	31.350	1	northeast	39556.49450	5998.319149
1120	23	0	31.400	0	southwest	34166.27300	5395.106872
314	27	0	31.400	0	southwest	34838.87300	5398.800073
57	18	1	31.680	2	southeast	34303.16720	5007.161767
1078	28	1	31.680	0	southeast	34672.14720	5145.341045
911	18	1	31.730	0	northeast	33732.68670	5672.339451
738	23	1	31.730	3	northeast	36189.10170	5481.597609
826	56	1	31.790	2	southeast	43813.86610	4954.572264
254	50	1	31.825	0	northeast	41097.16175	5622.257739
23	34	0	31.920	1	northeast	37701.87680	5430.767397
259	19	1	31.920	0	northwest	33750.29180	5292.495216

We can see the break exactly at $BMI=30$. It seems incredibly strange that diseases just magically know that your BMI is over 30 and you smoke and therefore you get more of them and your healthcare charges go up.

Could it be as simple as the insurance company is just plain charging more for smokers with a high BMI? Let's create a dummy variable:

In [47]:

insurance["bmi_above_30"] = insurance.bmi.apply(lambda bmi: 1 if bmi > 30.0 else 0)

In [48]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children + bmi_above_30"
result9 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result9)

Out[48]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	6650.04	(5237.08, 8487.74)
age_sq ($\beta_1$)	3.34	(3.13, 3.55)
male ($\beta_2$)	-526.29	(-1049.67, -92.78)
bmi ($\beta_3$)	-200.40	(-263.41, -141.22)
smoke_yes ($\beta_4$)	-20415.43	(-23515.74, -17456.82)
smoke_yes:bmi ($\beta_5$)	1443.44	(1357.06, 1546.13)
children ($\beta_6$)	652.72	(434.47, 881.32)
bmi_above_30 ($\beta_7$)	3096.90	(2309.95, 3878.18)
Metrics
$\sigma$	4738.94	(4309.41, 5059.14)
$R^2$	0.85	(0.82, 0.88)

In [49]:

residuals9 = plot_residuals(result9, ["age", "bmi", "children"])

This just breaks up both "bands. We did observe that this only happens to smokers. Let's change this to an interactin term with smoke_yes:

In [50]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children + smoke_yes:bmi_above_30"
result10 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result10)

Out[50]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	2014.88	(1091.29, 3254.52)
age_sq ($\beta_1$)	3.34	(3.17, 3.56)
male ($\beta_2$)	-516.74	(-834.60, -60.49)
bmi ($\beta_3$)	3.56	(-34.96, 31.52)
smoke_yes ($\beta_4$)	1632.98	(-2703.61, 5056.56)
smoke_yes:bmi ($\beta_5$)	464.32	(336.14, 625.91)
children ($\beta_6$)	651.47	(479.84, 834.67)
smoke_yes:bmi_above_30 ($\beta_7$)	15225.57	(13582.96, 16624.85)
Metrics
$\sigma$	4378.69	(3975.15, 4864.89)
$R^2$	0.87	(0.85, 0.89)

In [51]:

residuals10 = plot_residuals(result10, ["age", "bmi", "children"])

Wow, that seems to be the case. Now we get all kinds of interesting things falling out of the model.

age is positive; this is expected.
male is negative; if you correct for BMI and smoking, you're not getting charged more for being male.
bmi by itself doesn't seem to have strong evidence to support it in terms of a score. Maybe charges aren't a good proxy for disease?
smoke_yes by itself doesn't seem to have a strong effect on charges.
smoking_yes and bmi have a positive effect on charges.
children is positive and has strong support from the data.
smoking and a bmi above 30 have a base effect on charges.

Thinking just about smoking and BMI. Neither seems to have a strong effect on charges alone. However, together both set a higher base rate of charges and as BMI increases (while smoking) charges increase.

I'm not sure what this says about disease v. charges. What about the children? It seems like this would affect women more than men. Let's make it an interaction term:

In [52]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + male:children + smoke_yes:bmi_above_30"
result11 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result11)

Out[52]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	2606.58	(1611.41, 3767.55)
age_sq ($\beta_1$)	3.36	(3.16, 3.56)
male ($\beta_2$)	-1168.91	(-1661.91, -560.02)
bmi ($\beta_3$)	6.58	(-33.99, 40.61)
smoke_yes ($\beta_4$)	2018.05	(-1755.40, 5498.80)
smoke_yes:bmi ($\beta_5$)	449.38	(310.69, 586.74)
male:children ($\beta_6$)	611.63	(361.15, 830.50)
smoke_yes:bmi_above_30 ($\beta_7$)	15327.45	(13660.17, 16891.56)
Metrics
$\sigma$	4417.13	(3986.65, 4751.77)
$R^2$	0.87	(0.84, 0.89)

In [53]:

residuals11 = plot_residuals(result11, ["age", "bmi", "children"])

That's strange. The intercept is positive. If you're male, charges go down but if you're a male with children charges go back up. We might need both children and the interaction term:

In [54]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + smoke_yes:bmi_above_30 + children + male:children"
result12 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result12)

Out[54]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	1976.96	(576.24, 3510.93)
age_sq ($\beta_1$)	3.34	(3.17, 3.55)
male ($\beta_2$)	-423.27	(-983.70, 205.68)
bmi ($\beta_3$)	3.32	(-32.57, 39.75)
smoke_yes ($\beta_4$)	1590.26	(-2300.42, 4826.23)
smoke_yes:bmi ($\beta_5$)	466.06	(346.65, 614.28)
smoke_yes:bmi_above_30 ($\beta_6$)	15211.86	(13687.01, 16663.58)
children ($\beta_7$)	695.83	(467.69, 1004.26)
male:children ($\beta_8$)	-85.88	(-454.35, 208.44)
Metrics
$\sigma$	4380.03	(3953.34, 4798.98)
$R^2$	0.87	(0.84, 0.89)

Nope, that doesn't seem to help. It looks like having children affects both men's and women's charges. Our "final model is:

In [55]:

model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + smoke_yes:bmi_above_30 + children"
final = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(final)

Out[55]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	2014.88	(875.08, 3155.50)
age_sq ($\beta_1$)	3.34	(3.12, 3.57)
male ($\beta_2$)	-516.74	(-950.22, -112.38)
bmi ($\beta_3$)	3.56	(-30.45, 40.46)
smoke_yes ($\beta_4$)	1632.98	(-1421.54, 5179.83)
smoke_yes:bmi ($\beta_5$)	464.32	(321.29, 579.94)
smoke_yes:bmi_above_30 ($\beta_6$)	15225.57	(13698.17, 16942.19)
children ($\beta_7$)	651.47	(492.24, 831.69)
Metrics
$\sigma$	4378.69	(3969.76, 4865.85)
$R^2$	0.87	(0.84, 0.89)

We know we can produce estimates using the Bootstrap but do we have to? Looking at the individual residuals, we can see that we do:

In [56]:

residuals_final= plot_residuals(final, ["age", "bmi", "children"])

These sorts of patterns are typical with targets that can't be negative. There's no place for the other side of the tail to be (or it's just all truncated at "no charges"). Although we didn't do it before, we can also just plot a histogram of residuals:

In [58]:

figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(residuals_final, bins=50)
axes.set_title("Residuals Histogram")
plt.show()
plt.close()

There might be some other analysis we can do but this appears to be the bulk of it. The $R^2$ is pretty high...87%. All the variables in the model make sense. I'm not sure what our model turned out to be a model of...which goes to validity. You can't use charge data to make a model of disease.