Lab 10 Linear Models

Posted on 12/05/2019 in Python Data Science

In [1]:
import warnings
warnings.filterwarnings('ignore')

Lab 10 - Linear Models - Solution

In [2]:
% matplotlib inline

Linear Regression

In a previous module (Lab 5), you performed EDA on the insurance data set. In this Lab, you should build a linear regression model trying to estimate charges.

In [3]:
import numpy as np
import random as py_random
import numpy.random as np_random
import time
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

sns.set(style="whitegrid")

Answer

(Obviously, this is not the solution but a possible solution.)

Let's review the EDA from Lab 5. What did we learn?

The data contains a number of variables. We have age, sex, BMI, number of children, smoking, region, and the target variable, charges.

  • Age - Uniform distribution
  • Sex - Bernoulli distribution, 50/50
  • BMI - Normal distribution
  • Children - Multinomial distribution, possibly geometric.
  • Smoking - Bernoulli distribution, 20/80
  • Region - Multinomial distribution
  • Charges - Exponential distribution-ish (no zero charges).

For the features, there's nothing that particularly stands out as needing our attention. Sex and smoker will need to be converted into a dummy variable and Region will need to be converted into four dummy variables. Let's load the data:

In [4]:
insurance = pd.read_csv( "insurance.csv", header=0)
In [5]:
insurance.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
charges     1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB

We noted in our EDA that charges is not normally distributed. This may cause problems later but let's leave it for now.

In [6]:
figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(insurance.charges)
axes.set_title("Charges Histogram")
plt.show()
plt.close()    

Let's review our pairwise EDA with the target variable, charges:

  • age - definite positive progression but banding suggests an interaction term.
  • sex - possibly some higher charges for men.
  • BMI - overall positive but banding suggests there are other factors.
  • smoking - definitely a relationship.

We didn't look at children but the connection doesn't seem likely to be strong on the face of it.

Since we've decided to use a linear model, it is much more important to look at the correlation coefficients between each variable and the target, and each other but first we need those dummy variables:

In [7]:
insurance = pd.concat([insurance, pd.get_dummies(insurance["sex"])], axis=1)
insurance = pd.concat([insurance, pd.get_dummies(insurance["region"])], axis=1)
insurance = pd.concat([insurance, pd.get_dummies(insurance["smoker"], prefix="smoke")], axis=1)
In [8]:
insurance.head()
Out[8]:
age sex bmi children smoker region charges female male northeast northwest southeast southwest smoke_no smoke_yes
0 19 female 27.900 0 yes southwest 16884.92400 1 0 0 0 0 1 0 1
1 18 male 33.770 1 no southeast 1725.55230 0 1 0 0 1 0 1 0
2 28 male 33.000 3 no southeast 4449.46200 0 1 0 0 1 0 1 0
3 33 male 22.705 0 no northwest 21984.47061 0 1 0 1 0 0 1 0
4 32 male 28.880 0 no northwest 3866.85520 0 1 0 1 0 0 1 0

And now we can calculate our correlation coefficients:

In [9]:
def correlations(data, y, xs):
    rs = []
    rhos = []
    for x in xs:
        r = stats.pearsonr(data[y], data[x])[0]
        rs.append(r)
        rho = stats.spearmanr(data[y], data[x])[0]
        rhos.append(rho)
    return pd.DataFrame({"feature": xs, "r": rs, "rho": rhos})
In [10]:
correlations(insurance, "charges", ["age", "male", "female", 
                                    "bmi", "smoke_yes", "children", 
                                    "northwest", "southwest", "southeast"])
Out[10]:
feature r rho
0 age 0.299008 0.534392
1 male 0.057292 0.009490
2 female -0.057292 -0.009490
3 bmi 0.198341 0.119396
4 smoke_yes 0.787251 0.663460
5 children 0.067998 0.133339
6 northwest -0.039905 -0.021634
7 southwest -0.043210 -0.042354
8 southeast 0.073982 0.017275

There are some interesting results here:

  1. age - as we saw in the EDA, there is a positive relationship (30%) but that Spearman's suggests that the relationship is non-linear.
  2. male - not a huge relationship.
  3. female - exact same but opposite relationship as male. I included it to demonstrate why it's not needed to include both. I'm including the sex dummy variable for male because our EDA suggested that there was some relationship there.
  4. bmi - not a huge relationship and the Spearman doesn't suggest that this is because it's nonlinear.
  5. smoking - definite relationship. We already knew this.
  6. children - not a huge relationship but the larger Spearman's indicates the relationship might be non-linear.

The southest variable looks at least as promising as male but we'll hold off on the regional variables for now. There's good reason to believe that for any effect we find, it's likely that regional differences are likely the result of the different distribution of those effects.

Before estimating any linear models, let's look at the Null model (the mean):

In [11]:
insurance.charges.describe()
Out[11]:
count     1338.000000
mean     13270.422265
std      12110.011237
min       1121.873900
25%       4740.287150
50%       9382.033000
75%      16639.912515
max      63770.428010
Name: charges, dtype: float64

The lowest charge is \$1,121 and the highest is \$63,770. The mean is \$13,270 with a standard deviation of \$12,110. You can think of \$12,110 as the $\sigma$ of the Null model.

(This is somewhat artificial...modeling like this is likely to follow directly after the EDA)

Let's try the "all in" model. We have domain knowledge and some statistics to support the inclusion of all the variables (exception regions):

In [12]:
import models
In [13]:
model = "charges ~ age + male + bmi + smoke_yes + children"
result1 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result1)
Out[13]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)-12052.46(-13877.54, -10472.09)
age ($\beta_1$)257.73(238.52, 291.71)
male ($\beta_2$)-128.64(-851.32, 402.91)
bmi ($\beta_3$)322.36(262.35, 379.68)
smoke_yes ($\beta_4$)23823.39(22881.75, 24853.05)
children ($\beta_5$)474.41(269.26, 725.05)
Metrics
$\sigma$6069.73(5700.21, 6302.76)
$R^2$0.75(0.72, 0.78)

The correlation coefficient is 75% which is pretty good. The error has been cut in half compared to the Null model to \$6,070. But we have already seen that charges is not Normally distributed so the overall error is probably not good to use. What about coefficients:

  • The intercept isn't particularly interesting here because there are no childless, non-smoking females with zero age and BMI.
  • age is positive and strongly supported by the data.
  • male is negative (suprise!) and has mixed support in the data.
  • bmi is positive and strongly supported by the data.
  • smoke_yes is positive and strongly supported by the data.
  • children is positive (surprise!) and strongly supported by the data.

With our first model done, here's what we think we know:

  1. Charges is not normally distributed so a transformation is probably in order.
  2. We could improve interpretability if we mean centered and possibly mean scaled the data.
  3. male and children are surprises and it's not clear why for now.

Let's look at the residuals:

In [14]:
def plot_residuals(result, variables):
    figure = plt.figure(figsize=(20,6))

    variables = ["age", "bmi", "children"]

    plots = len( variables)
    rows = (plots // 3) + 1

    residuals = np.array([r[0] for r in result["residuals"]])
    limits = max(np.abs(residuals.min()), residuals.max())
    
    n = result["n"]
    for i, variable in enumerate( variables):
        axes = figure.add_subplot(rows, 3, i + 1)

        keyed_values = sorted(zip(insurance[variable].values, residuals), key=lambda x: x[ 0])
        ordered_residuals = [x[ 1] for x in keyed_values]

        axes.plot(list(range(0, n)), ordered_residuals, '.', color="dimgray", alpha=0.75)
        axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
        axes.set_ylim((-limits, limits))
        axes.set_ylabel("residuals")
        axes.set_xlabel(variable)

    plt.show()
    plt.close()
    
    return residuals
In [15]:
residuals1 = plot_residuals(result1, ["age", "bmi", "children"])

This is very interesting.

For age, we see three bands. We saw these bands in our EDA. They basically say, at every age, there is some information you are missing that is causing you to be systematically wrong. But they are also slightly curved in a way that suggests increasing returns just like in our discussion of residuals. Additionally, the errors are not symmetric. For every age, we are more likely to overestimate charges.

We see the same banding for children and the same systematic over estimation.

The most interesting result is for BMI. First, the middle band has an inflection that suggests a non-linearity. Second, there's an abrupt break in the error just over the 600th observation. Additionally, if you connected the those two lines, it suggests that you'd have the opposite inflection as the middle line.

We mentioned before that charges is not normally distributed and that this might cause some problems. Let's see if we can address that issue.

In [16]:
insurance["log_charges"] = insurance["charges"].apply(np.log)

Let's see a histogram:

In [17]:
figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(insurance.log_charges)
axes.set_title("Log(Charges) Histogram")
plt.show()
plt.close()

That looks better than before. It's not exactly normal but is fairly symmetric. Let's try to use this as our response variable:

In [18]:
model = "log_charges ~ age + male + bmi + smoke_yes + children"
result2 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result2)
Out[18]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)7.01(6.88, 7.11)
age ($\beta_1$)0.03(0.03, 0.04)
male ($\beta_2$)-0.08(-0.12, -0.03)
bmi ($\beta_3$)0.01(0.01, 0.01)
smoke_yes ($\beta_4$)1.55(1.50, 1.65)
children ($\beta_5$)0.10(0.09, 0.12)
Metrics
$\sigma$0.45(0.42, 0.48)
$R^2$0.76(0.73, 0.79)
In [19]:
residuals2 = plot_residuals(result2, ["age", "bmi", "children"])

Where did all the bands go? This doesn't look very symmetric either. I think we need to compare the histograms of charges versus residuals:

In [20]:
figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(2, 2, 1)
axes.hist(insurance.charges)
axes.set_title("Charges Histogram")

axes = figure.add_subplot(2, 2, 2)
axes.hist(residuals1)
axes.set_title("Residuals Histogram")

axes = figure.add_subplot(2, 2, 3)
axes.hist(insurance.log_charges)
axes.set_title("Log(Charges) Histogram")

axes = figure.add_subplot(2, 2, 4)
axes.hist(residuals2)
axes.set_title("Residuals Histogram")

plt.show()
plt.close()    

Interesting indeed. Our errors are much more symmetric with the actual charges variable rather than the transformed charges variable.

This is a key point that is often misunderstood or even misstated: linear regression does not require your variables to be normally distributed. We only expect normally distributed errors (residuals) and then only for gauging the confidence in our estimates. Lacking such normality, we can easily fall back on Bootstrap estimates.

Let's continue with the untransformed charges variable...

Unlike many such situations, there are no variables we haven't added to the data so we're left with:

  1. transforms
  2. interaction terms.

We have some ideas for transforms so let's start there. Age looks like it should be squared. This basically means that as you age, charges increase and at an increasing amount. Let's see if that's true:

In [21]:
insurance["age_sq"] = insurance.age**2
In [22]:
model = "charges ~ age_sq + male + bmi + smoke_yes + children"
result3 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result3)
Out[22]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)-7655.22(-9054.61, -6295.49)
age_sq ($\beta_1$)3.26(3.02, 3.56)
male ($\beta_2$)-133.72(-756.45, 547.77)
bmi ($\beta_3$)318.93(268.62, 374.20)
smoke_yes ($\beta_4$)23834.29(22459.90, 24825.84)
children ($\beta_5$)612.15(317.76, 864.23)
Metrics
$\sigma$6036.46(5644.66, 6396.84)
$R^2$0.75(0.72, 0.78)

That didn't change the $R^2$ what about the residuals?

In [23]:
residuals3 = plot_residuals(result3, ["age_sq", "bmi", "children"])

For comparison,

In [24]:
plot_residuals(result1, ["age", "bmi", "children"]); # the semicolon suppresses printing out the result.

So the curves in age are gone, the error ($\sigma$) went down slightly but the artifacts on BMI are less disperse with the new model. Let's see if we can fix that. What if there are decreasing returns to BMI?

In [25]:
insurance["bmi_sqrt"] = insurance.bmi.apply(np.sqrt)
In [26]:
model = "charges ~ age_sq + male + bmi_sqrt + smoke_yes + children"
result4 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result4)
Out[26]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)-17455.49(-21675.24, -14665.81)
age_sq ($\beta_1$)3.26(2.96, 3.51)
male ($\beta_2$)-135.82(-685.47, 543.67)
bmi_sqrt ($\beta_3$)3555.99(3005.42, 4273.00)
smoke_yes ($\beta_4$)23842.02(22800.44, 25079.69)
children ($\beta_5$)611.69(302.64, 910.06)
Metrics
$\sigma$6031.23(5708.24, 6452.22)
$R^2$0.75(0.72, 0.78)
In [27]:
residuals4 = plot_residuals(result4, ["age", "bmi", "children"])

Doesn't look like it...squared?

In [28]:
insurance["bmi_sq"] = insurance.bmi**2
In [29]:
model = "charges ~ age_sq + male + bmi_sq + smoke_yes + children"
result5 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result5)
Out[29]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)-2643.19(-3621.65, -1741.01)
age_sq ($\beta_1$)3.29(3.00, 3.62)
male ($\beta_2$)-124.93(-820.99, 478.72)
bmi_sq ($\beta_3$)4.84(3.95, 5.78)
smoke_yes ($\beta_4$)23820.25(22874.36, 25022.69)
children ($\beta_5$)613.36(360.73, 859.40)
Metrics
$\sigma$6057.28(5668.81, 6371.76)
$R^2$0.75(0.72, 0.78)
In [30]:
residuals5 = plot_residuals(result5, ["age", "bmi", "children"])

We don't seem to be making much progress here. There's a non-linearity but we can't identify it. Let's switch for a bit and think of interaction terms.

There are a lot of potential interaction terms: male, age, bmi, smoke_yes, but we only have 1,336 observations so we can't get overly specific. Let's try the big one: male and bmi.

In [31]:
model = "charges ~ age_sq + male + bmi + male:bmi + smoke_yes + children"
result6 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result6)
Out[31]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)-7664.57(-9805.43, -5343.57)
age_sq ($\beta_1$)3.26(3.03, 3.52)
male ($\beta_2$)-115.44(-2588.57, 3332.09)
bmi ($\beta_3$)319.24(235.79, 400.00)
male:bmi ($\beta_4$)-0.60(-114.21, 77.63)
smoke_yes ($\beta_5$)23834.54(22670.36, 24831.61)
children ($\beta_6$)612.13(381.95, 796.39)
Metrics
$\sigma$6038.72(5696.85, 6449.33)
$R^2$0.75(0.72, 0.78)

Wow, that seemed like a slam dunk but the $R^2$ didn't even move. Smoking and bmi?

In [32]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children"
result7 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result7)
Out[32]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)2005.66(758.39, 3365.79)
age_sq ($\beta_1$)3.34(3.15, 3.57)
male ($\beta_2$)-500.57(-992.58, 46.85)
bmi ($\beta_3$)3.53(-34.83, 40.81)
smoke_yes ($\beta_4$)-20236.54(-23518.81, -17378.45)
smoke_yes:bmi ($\beta_5$)1437.00(1337.01, 1543.57)
children ($\beta_6$)653.84(431.70, 837.16)
Metrics
$\sigma$4827.80(4307.24, 5298.97)
$R^2$0.84(0.81, 0.87)

The $R^2$ went up a lot but what about the adjusted $R^2$?

In [33]:
print(models.adjusted_r_squared(result1))
print(models.adjusted_r_squared(result6))
0.7484052599228702
0.7509684533349236
In [34]:
residuals7 = plot_residuals(result7, ["age", "bmi", "children"])

So that took care of a whole lot of our error. It shifted the residuals all around but we still have a break in the BMI. Are the residuals correlated with anything?

In [35]:
insurance["residuals"] = residuals7
In [36]:
correlations(insurance, "residuals", ["northeast", "southeast", "southwest"])
Out[36]:
feature r rho
0 northeast 0.087398 0.424320
1 southeast -0.048958 -0.279556
2 southwest -0.053899 -0.263691

Hmm, maybe there are real regional differences in health! Let's add these into the regression:

In [37]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children + northeast + southwest + southeast"
result8 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result8)
Out[37]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)1679.25(72.47, 3126.24)
age_sq ($\beta_1$)3.33(3.09, 3.50)
male ($\beta_2$)-505.26(-1019.81, 32.25)
bmi ($\beta_3$)20.32(-22.42, 68.58)
smoke_yes ($\beta_4$)-20353.76(-23588.39, -16884.88)
smoke_yes:bmi ($\beta_5$)1441.41(1341.48, 1542.87)
children ($\beta_6$)657.35(516.90, 856.61)
northeast ($\beta_7$)597.70(-401.09, 1429.05)
southwest ($\beta_8$)-630.77(-1373.65, 4.18)
southeast ($\beta_9$)-608.20(-1433.58, 118.93)
Metrics
$\sigma$4807.98(4336.76, 5164.95)
$R^2$0.84(0.81, 0.88)
In [38]:
residuals8 = plot_residuals(result8, ["age", "bmi", "children"])

The error goes down a little and it appears that those in the Southwest are slightly healthier for some reason. The evidence is pretty strong for the northeast having higher charges and for the southeast having lower charges...which sort of defies stereotypes. There may be something else going on here so let's continue to leave the regional dummy variables out.

Let's look at the actual data. First we need to figure out the actual BMI associated with those indices. We can do this by replotting with BMI on the axis instead of index:

In [39]:
n = result7["n"]

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(insurance.bmi.values, insurance.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Notice how we can't see the curvature when plotted as BMI because of overstriking. Still, with this plot, we can see that our problem happens almost exactly at BMI=30. Since we put residuals into the dataframe, we can do a bit of filtering. What happens if we take out women?

In [40]:
subset = insurance[insurance.male == 1]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Is it only men?

In [41]:
subset = insurance[insurance.male == 0]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Nope, still there. What about smoking?

In [42]:
subset = insurance[insurance.smoke_yes == 1]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Interesting. What about non-smoking?

In [43]:
subset = insurance[insurance.smoke_yes == 0]
n = len(subset.index)

figure = plt.figure(figsize=(20,6))

axes = figure.add_subplot(1, 1, 1)

keyed_values = sorted(zip(subset.bmi.values, subset.residuals), key=lambda x: x[ 0])
ordered_residuals = [x[ 1] for x in keyed_values]
bmi_values = [x[0] for x in keyed_values]

axes.plot(bmi_values, ordered_residuals, '.', color="dimgray", alpha=0.75)
axes.axhline(y=0.0, xmin=0, xmax=n, c="firebrick", alpha=0.5)
axes.set_ylim((-30000, 30000))
axes.set_ylabel("residuals")

plt.show()
plt.close()

Well, there are still problems but not this problem. Whatever "this" is, it happens only to smokers, still, it's an incredibly strange discontinuity. Whatever "it" is, we over then underestimate charges until BMI 30...and exactly at BMI 30, it shifts to over estimating and underestimating again.

Let's look at the data:

In [44]:
pd.options.display.max_rows = 100
In [45]:
by_bmi = insurance.sort_values(by="bmi")
In [46]:
data = by_bmi[by_bmi.smoke_yes == 1][((28 <= by_bmi.bmi) & (by_bmi.bmi <= 32))]
data[["age", "male", "bmi", "children", "region", "charges", "residuals"]]
/usr/local/Cellar/ipython/6.5.0/libexec/vendor/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """Entry point for launching an IPython kernel.
Out[46]:
age male bmi children region charges residuals
52 48 1 28.000 1 southwest 23568.27200 -6394.900885
827 36 1 28.025 1 northeast 20773.62775 -5854.268234
105 20 1 28.025 1 northwest 17560.37975 -6070.813774
1040 35 0 28.025 0 northwest 20234.85475 -6002.302697
280 40 0 28.120 1 northeast 22331.56680 -5950.484142
1307 32 1 28.120 4 northwest 21472.47880 -6344.082100
1007 47 1 28.215 3 northwest 24915.22085 -6347.621165
514 39 1 28.300 1 southwest 21082.16000 -6694.403153
126 19 0 28.300 0 southwest 17081.08000 -6662.547096
375 23 0 28.310 0 northwest 18033.96790 -6285.946253
604 19 0 28.310 0 northwest 17468.98390 -6289.048542
475 61 1 28.310 1 northwest 28868.66390 -6280.279388
641 42 1 28.310 3 northwest 32787.45859 2876.083414
465 30 0 28.380 1 southeast 19521.96820 -6793.447947
1184 23 0 28.490 1 southeast 18328.23810 -6904.814743
476 24 1 28.500 0 northeast 35147.52848 10897.285486
795 27 1 28.500 0 northwest 18310.74200 -6451.214695
144 30 1 28.690 3 northwest 20745.98910 -6823.111860
773 19 0 28.880 0 northwest 17748.50620 -6830.630975
885 32 1 28.930 1 southeast 19719.69470 -7302.171868
886 57 1 28.975 0 northeast 27218.43725 -6655.999057
1337 61 0 29.070 0 northwest 29141.36030 -6949.112676
238 19 1 29.070 0 northwest 17352.68030 -6999.592618
741 27 1 29.150 0 southeast 18246.49550 -7451.808698
949 25 1 29.700 3 southwest 19933.45800 -8170.836077
1053 47 1 29.800 3 southwest 25309.48900 -8236.600387
1179 31 1 29.810 0 southeast 19350.36890 -8074.620034
843 57 0 29.810 0 southeast 27533.91290 -8043.935649
1250 24 1 29.830 0 northeast 18648.42170 -7517.732338
92 59 1 29.830 3 northeast 30184.93670 -7658.615967
103 61 0 29.920 3 southeast 30942.19180 -8334.262977
1278 39 1 29.925 1 northeast 22462.04375 -7655.388160
461 42 1 30.000 0 southwest 22144.03200 -8240.319309
1196 19 0 30.020 0 northwest 33307.55080 7086.204158
1308 25 0 30.200 0 southwest 33900.65300 6537.053152
953 44 1 30.200 2 southwest 38998.54600 6443.143004
587 34 0 30.210 1 northwest 43943.87610 14136.080897
503 19 1 30.250 0 southeast 32548.34050 6496.236731
1300 45 1 30.360 0 southeast 62592.87309 30817.005945
1093 22 0 30.400 0 northwest 33907.54800 6727.420521
146 46 1 30.495 3 northwest 40720.55105 6484.331764
1037 45 0 30.495 1 northwest 39725.51805 6600.770438
1042 20 1 30.685 0 northeast 33475.81715 6666.643995
381 55 1 30.685 0 northeast 42303.69215 6715.117258
828 41 1 30.780 3 northeast 39597.40720 6405.507835
850 37 0 30.780 0 northeast 37270.15120 6582.707980
322 34 1 30.800 0 southwest 35491.64000 5988.337664
956 54 1 30.800 1 southeast 41999.52000 5955.995375
1301 62 1 30.875 3 northwest 46718.16325 6155.186065
1049 49 1 30.900 0 southwest 39727.61400 5916.313378
1021 22 0 31.020 3 southeast 35595.58980 5560.803483
1267 24 1 31.065 0 northeast 34254.05335 6308.839057
689 27 1 31.130 1 southeast 34806.46770 5602.062497
86 57 0 31.160 0 northwest 43578.93940 6056.369114
94 64 0 31.300 2 southwest 47291.05500 5426.304658
677 60 1 31.350 3 northwest 46130.52650 5699.361665
123 44 1 31.350 1 northeast 39556.49450 5998.319149
1120 23 0 31.400 0 southwest 34166.27300 5395.106872
314 27 0 31.400 0 southwest 34838.87300 5398.800073
57 18 1 31.680 2 southeast 34303.16720 5007.161767
1078 28 1 31.680 0 southeast 34672.14720 5145.341045
911 18 1 31.730 0 northeast 33732.68670 5672.339451
738 23 1 31.730 3 northeast 36189.10170 5481.597609
826 56 1 31.790 2 southeast 43813.86610 4954.572264
254 50 1 31.825 0 northeast 41097.16175 5622.257739
23 34 0 31.920 1 northeast 37701.87680 5430.767397
259 19 1 31.920 0 northwest 33750.29180 5292.495216

We can see the break exactly at $BMI=30$. It seems incredibly strange that diseases just magically know that your BMI is over 30 and you smoke and therefore you get more of them and your healthcare charges go up.

Could it be as simple as the insurance company is just plain charging more for smokers with a high BMI? Let's create a dummy variable:

In [47]:
insurance["bmi_above_30"] = insurance.bmi.apply(lambda bmi: 1 if bmi > 30.0 else 0)
In [48]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children + bmi_above_30"
result9 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result9)
Out[48]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)6650.04(5237.08, 8487.74)
age_sq ($\beta_1$)3.34(3.13, 3.55)
male ($\beta_2$)-526.29(-1049.67, -92.78)
bmi ($\beta_3$)-200.40(-263.41, -141.22)
smoke_yes ($\beta_4$)-20415.43(-23515.74, -17456.82)
smoke_yes:bmi ($\beta_5$)1443.44(1357.06, 1546.13)
children ($\beta_6$)652.72(434.47, 881.32)
bmi_above_30 ($\beta_7$)3096.90(2309.95, 3878.18)
Metrics
$\sigma$4738.94(4309.41, 5059.14)
$R^2$0.85(0.82, 0.88)
In [49]:
residuals9 = plot_residuals(result9, ["age", "bmi", "children"])

This just breaks up both "bands. We did observe that this only happens to smokers. Let's change this to an interactin term with smoke_yes:

In [50]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + children + smoke_yes:bmi_above_30"
result10 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result10)
Out[50]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)2014.88(1091.29, 3254.52)
age_sq ($\beta_1$)3.34(3.17, 3.56)
male ($\beta_2$)-516.74(-834.60, -60.49)
bmi ($\beta_3$)3.56(-34.96, 31.52)
smoke_yes ($\beta_4$)1632.98(-2703.61, 5056.56)
smoke_yes:bmi ($\beta_5$)464.32(336.14, 625.91)
children ($\beta_6$)651.47(479.84, 834.67)
smoke_yes:bmi_above_30 ($\beta_7$)15225.57(13582.96, 16624.85)
Metrics
$\sigma$4378.69(3975.15, 4864.89)
$R^2$0.87(0.85, 0.89)
In [51]:
residuals10 = plot_residuals(result10, ["age", "bmi", "children"])

Wow, that seems to be the case. Now we get all kinds of interesting things falling out of the model.

  1. age is positive; this is expected.
  2. male is negative; if you correct for BMI and smoking, you're not getting charged more for being male.
  3. bmi by itself doesn't seem to have strong evidence to support it in terms of a score. Maybe charges aren't a good proxy for disease?
  4. smoke_yes by itself doesn't seem to have a strong effect on charges.
  5. smoking_yes and bmi have a positive effect on charges.
  6. children is positive and has strong support from the data.
  7. smoking and a bmi above 30 have a base effect on charges.

Thinking just about smoking and BMI. Neither seems to have a strong effect on charges alone. However, together both set a higher base rate of charges and as BMI increases (while smoking) charges increase.

I'm not sure what this says about disease v. charges. What about the children? It seems like this would affect women more than men. Let's make it an interaction term:

In [52]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + male:children + smoke_yes:bmi_above_30"
result11 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result11)
Out[52]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)2606.58(1611.41, 3767.55)
age_sq ($\beta_1$)3.36(3.16, 3.56)
male ($\beta_2$)-1168.91(-1661.91, -560.02)
bmi ($\beta_3$)6.58(-33.99, 40.61)
smoke_yes ($\beta_4$)2018.05(-1755.40, 5498.80)
smoke_yes:bmi ($\beta_5$)449.38(310.69, 586.74)
male:children ($\beta_6$)611.63(361.15, 830.50)
smoke_yes:bmi_above_30 ($\beta_7$)15327.45(13660.17, 16891.56)
Metrics
$\sigma$4417.13(3986.65, 4751.77)
$R^2$0.87(0.84, 0.89)
In [53]:
residuals11 = plot_residuals(result11, ["age", "bmi", "children"])

That's strange. The intercept is positive. If you're male, charges go down but if you're a male with children charges go back up. We might need both children and the interaction term:

In [54]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + smoke_yes:bmi_above_30 + children + male:children"
result12 = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(result12)
Out[54]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)1976.96(576.24, 3510.93)
age_sq ($\beta_1$)3.34(3.17, 3.55)
male ($\beta_2$)-423.27(-983.70, 205.68)
bmi ($\beta_3$)3.32(-32.57, 39.75)
smoke_yes ($\beta_4$)1590.26(-2300.42, 4826.23)
smoke_yes:bmi ($\beta_5$)466.06(346.65, 614.28)
smoke_yes:bmi_above_30 ($\beta_6$)15211.86(13687.01, 16663.58)
children ($\beta_7$)695.83(467.69, 1004.26)
male:children ($\beta_8$)-85.88(-454.35, 208.44)
Metrics
$\sigma$4380.03(3953.34, 4798.98)
$R^2$0.87(0.84, 0.89)

Nope, that doesn't seem to help. It looks like having children affects both men's and women's charges. Our "final model is:

In [55]:
model = "charges ~ age_sq + male + bmi + smoke_yes + smoke_yes:bmi + smoke_yes:bmi_above_30 + children"
final = models.bootstrap_linear_regression(model, data=insurance)
models.describe_bootstrap_lr(final)
Out[55]:
Linear Regression Results
Coefficients
$\theta$95% BCI
($\beta_0$)2014.88(875.08, 3155.50)
age_sq ($\beta_1$)3.34(3.12, 3.57)
male ($\beta_2$)-516.74(-950.22, -112.38)
bmi ($\beta_3$)3.56(-30.45, 40.46)
smoke_yes ($\beta_4$)1632.98(-1421.54, 5179.83)
smoke_yes:bmi ($\beta_5$)464.32(321.29, 579.94)
smoke_yes:bmi_above_30 ($\beta_6$)15225.57(13698.17, 16942.19)
children ($\beta_7$)651.47(492.24, 831.69)
Metrics
$\sigma$4378.69(3969.76, 4865.85)
$R^2$0.87(0.84, 0.89)

We know we can produce estimates using the Bootstrap but do we have to? Looking at the individual residuals, we can see that we do:

In [56]:
residuals_final= plot_residuals(final, ["age", "bmi", "children"])

These sorts of patterns are typical with targets that can't be negative. There's no place for the other side of the tail to be (or it's just all truncated at "no charges"). Although we didn't do it before, we can also just plot a histogram of residuals:

In [58]:
figure = plt.figure(figsize=(20,6))
axes = figure.add_subplot(1,1,1)
axes.hist(residuals_final, bins=50)
axes.set_title("Residuals Histogram")
plt.show()
plt.close()  

There might be some other analysis we can do but this appears to be the bulk of it. The $R^2$ is pretty high...87%. All the variables in the model make sense. I'm not sure what our model turned out to be a model of...which goes to validity. You can't use charge data to make a model of disease.