Boston Housing¶

1. Introduction / Background¶

1.1. Data Selection¶

Boston housing dataset: https://www.kaggle.com/c/boston-housing/overview

1.2. Data description on Boston housing dataset¶

The Boston data frame has 404 rows and 14 columns.

This data frame contains the following columns:

crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per \$10,000.
ptratio: pupil-teacher ratio by town.
black:: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.

Sources

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

1.3. Goal and objectives¶

The goal is to develop a model that best predicts Boston housing price.

2. EDA¶

In [1]:

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

#Display for notebooks
%matplotlib inline

In [2]:

#Load the Boston housing dataset
data = pd.read_csv ( 'boston_data_1.csv' )
#Note that 'medv' been multiplicatively scaled to account for 35 years of market inflation
prices = data [ 'medv' ]
features = data.drop ( 'medv', axis = 1 )

Boston housing dataset has 404 data points with 14 variables each

In [3]:

#Obtain general info on dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404 entries, 0 to 403
Data columns (total 14 columns):
crim       404 non-null float64
zn         404 non-null float64
indus      404 non-null float64
chas       404 non-null float64
nox        404 non-null float64
rm         404 non-null float64
age        404 non-null float64
dis        404 non-null float64
rad        404 non-null float64
tax        404 non-null float64
ptratio    404 non-null float64
black      404 non-null float64
lstat      404 non-null float64
medv       404 non-null float64
dtypes: float64(14)
memory usage: 44.3 KB

In [4]:

#Look at first few rows of dataframe
data.head()

Out[4]:

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0	0.15876	0.0	10.81	0.413	5.961	17.5	5.2873	4.0	305.0	19.2	376.94	9.88	21.7
1	0.10328	25.0	5.13	0.453	5.927	47.2	6.9320	8.0	284.0	19.7	396.90	9.22	19.6
2	0.34940	0.0	9.90	0.544	5.972	76.7	3.1025	4.0	304.0	18.4	396.24	9.97	20.3
3	2.73397	0.0	19.58	0.871	5.597	94.9	1.5257	5.0	403.0	14.7	351.85	21.45	15.4
4	0.04337	21.0	5.64	0.439	6.115	63.0	6.8147	4.0	243.0	16.8	393.97	9.43	20.5

In [5]:

#Histographs for each variable
data.hist ( sharex = False, sharey = False, xlabelsize = 1, ylabelsize = 1 )
plt.show()

Interpretation¶

chas, crim, dis, zn, lsat: Appears to be right skewed

age, black, ptratio: Appears to be left skewed

medv, rm appear to have normal distributions

2.1. Variable crim¶

In [6]:

# ********************* *************** #
#   Learn more on the variable "crim"
# ************************************* #
# what are the basic statistics
data["crim"].describe()

Out[6]:

count    404.000000
mean       3.730912
std        8.943922
min        0.006320
25%        0.082382
50%        0.253715
75%        4.053158
max       88.976200
Name: crim, dtype: float64

Interpretation¶

Median value is much smaller than the mean; but the max is much larger the 75%ile, suggesting there are likely some outliers in this variable as well.

Let's plot the histogram and boxplot for this variable.

In [7]:

def eda_numeric_plt(variable, y_max=-99, y_label=''):
    fig = plt.figure(figsize=(12,4))
    ax = fig.add_subplot(1,2,1)
    ax.hist(data[variable],50, color='lightblue', edgecolor='k')
    ax.set_xlabel(variable, fontsize=14)
    ax.set_ylabel("Counts", fontsize=14)
    ax.tick_params(labelsize=14)
    ax.set_title('Distribution of {}'.format(variable), fontsize=14)

    ax = fig.add_subplot(1,2,2)
    ax.boxplot(data[variable], widths=0.3)
    # if no value is provided, we skip next line
    if y_max != -99:
        ax.set_ylim([-1, y_max])
    ax.set_ylabel(y_label, fontsize=14)
    ax.tick_params(labelsize=14)
    ax.set_title('Boxplot', fontsize=16)
    plt.rcParams['axes.linewidth'] = 1
    plt.setp(ax.spines.values(), color='k')


    plt.show()
    plt.close()

In [8]:

eda_numeric_plt(variable='crim', y_max=20, y_label='Per Capita Crime Rate')

Interpretation¶

This variable is very right-skewed. Not sure how relistic to have a 88 in the per capita crime rate. Maybe worth to further to investigate the rows with large per captica values.

In [9]:

ia = np.where(data.crim>20)[0]
data.iloc[ia]

Out[9]:

	crim	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
10	25.9406	18.1	0.679	5.304	89.1	1.6475	24.0	666.0	20.2	127.36	26.64	10.4
34	38.3518	18.1	0.693	5.453	100.0	1.4896	24.0	666.0	20.2	396.90	30.59	5.0
50	23.6482	18.1	0.671	6.380	96.2	1.3861	24.0	666.0	20.2	396.90	23.69	13.1
114	67.9208	18.1	0.693	5.683	100.0	1.4254	24.0	666.0	20.2	384.97	22.98	5.0
116	41.5292	18.1	0.693	5.531	85.4	1.6074	24.0	666.0	20.2	329.46	27.38	8.5
143	24.3938	18.1	0.700	4.652	100.0	1.4672	24.0	666.0	20.2	396.90	28.28	10.5
167	20.0849	18.1	0.700	4.368	91.2	1.4395	24.0	666.0	20.2	285.83	30.63	8.8
177	20.7162	18.1	0.659	4.138	100.0	1.1781	24.0	666.0	20.2	370.22	23.34	11.9
273	37.6619	18.1	0.679	6.202	78.7	1.8629	24.0	666.0	20.2	18.82	14.52	10.9
285	73.5341	18.1	0.679	5.957	100.0	1.8026	24.0	666.0	20.2	16.45	20.62	8.8
377	25.0461	18.1	0.693	5.987	100.0	1.5888	24.0	666.0	20.2	396.90	26.77	5.6
380	88.9762	18.1	0.671	6.968	91.9	1.4165	24.0	666.0	20.2	396.90	17.21	10.4
400	51.1358	18.1	0.597	5.757	100.0	1.4130	24.0	666.0	20.2	2.60	10.11	15.0

For the values with extremely high crime rate, the proportion of the residential land over a 25000 sqft is all 0 and the age of the building is old (~80 or older).

2.2 Variable zn¶

In [10]:

# ********************* *************** #
#   "zn": proportion of residential 
# land zoned for lots over 25,000 sq.ft.
# ************************************* #
data["zn"].describe()

Out[10]:

count    404.000000
mean      10.509901
std       22.053733
min        0.000000
25%        0.000000
50%        0.000000
75%       12.500000
max       95.000000
Name: zn, dtype: float64

Interpretation¶

Basic statistics of "zn" seem to show that "zn" is zero for the 50%ile and lower. It is likely to be right skewed.

Let's plot the histogram and boxplot.

In [11]:

eda_numeric_plt(variable='zn',y_max=50, y_label='Proportion of residential')

Interpretation¶

Indeed. This variable is right skewed. Since many values are 0, suggesting there are few houses with lot size > 25000.

2.3. Variable indus¶

In [12]:

# ********************* *************** #
#   "indus": proportion of non-retail 
# . business acres per town
# ************************************* #
data["indus"].describe()

Out[12]:

count    404.000000
mean      11.189901
std        6.814909
min        0.460000
25%        5.190000
50%        9.795000
75%       18.100000
max       27.740000
Name: indus, dtype: float64

In [13]:

eda_numeric_plt(variable='indus',y_label='Proportion of non-retail business acres per town')

2.4. Variable chas¶

In [14]:

# ********************* *************** #
#   "chas": Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
# ************************************* #
# find out how many 0s and 1s
ia=np.where(data['chas']== 1)
print("counts of \'chas==1\' are: ", len(ia[0]))

counts of 'chas==1' are:  28

Interpretation¶

Only 28 housing is close to the charles river in this dataset.

2.5. Variable nox¶

In [15]:

# ********************* *************** #
#   "nox": nitrogen oxides concentration (parts per 10 million).
# ************************************* #
data['nox'].describe()

Out[15]:

count    404.000000
mean       0.556710
std        0.117321
min        0.392000
25%        0.453000
50%        0.538000
75%        0.631000
max        0.871000
Name: nox, dtype: float64

In [16]:

eda_numeric_plt(variable='nox',y_label='nitrogen oxides concentration')

Interpretation¶

"The national ambient air quality standard for NO2 is 50 parts per billion (ppb) (or 0.05 parts per million [ppm]) averaged over 1 year. Maximum 30-minute and 24-hour outdoor values of NO2 can be 0.45 ppm and 0.21 ppm, respectively. In homes with gas stoves, 15 to 25 ppb are added to the usual background level of NO2 .' The normal level for such homes ranges from 25 to 75 ppb." -- https://www.jacionline.org/article/S0091-6749(54)00063-2/pdf

the dataset's value doesn't really match the article I have found, but we know higher value of nox is worse

2.6. Variable rm¶

In [17]:

# ********************* *************** #
#   "rm": average number of rooms per dwelling.
# ************************************* #
data['rm'].describe()

Out[17]:

count    404.00000
mean       6.30145
std        0.67583
min        3.56100
25%        5.90275
50%        6.23050
75%        6.62925
max        8.78000
Name: rm, dtype: float64

In [18]:

eda_numeric_plt(variable='rm',y_label='average number of rooms per dwelling.')

Interpretation¶

most houses of this dataset has around 5~7 rooms

2.7. Variable age¶

In [19]:

# ********************* *************** #
#   "age": proportion of owner-occupied units built prior to 1940.
# ************************************* #
data['age'].describe()

Out[19]:

count    404.000000
mean      68.601733
std       28.066143
min        2.900000
25%       45.800000
50%       76.600000
75%       94.150000
max      100.000000
Name: age, dtype: float64

In [20]:

eda_numeric_plt(variable='age',y_label='proportion of owner-occupied units built prior to 1940')

Interpretation¶

the higher this number is the older it is, looks like a lot of building were built prior to 1940, the data was collected at 1978, so it does make sense

2.8. Variable dis¶

In [21]:

# ********************* *************** #
#   "dis": weighted mean of distances to five Boston employment centres.
# ************************************* # 
data['dis'].describe()

Out[21]:

count    404.000000
mean       3.799666
std        2.109916
min        1.169100
25%        2.087875
50%        3.207450
75%        5.222125
max       12.126500
Name: dis, dtype: float64

In [22]:

eda_numeric_plt(variable='dis',y_label='weighted mean of distances')

Interpretation¶

looks like most of building is pretty close to the employment centres, and it makes sense, because city has a higher density of the population than suburban.

2.9. Variable rad¶

In [23]:

# ********************* *************** #
#   "rad": index of accessibility to radial highways.
# ************************************* # 
data['rad'].describe()

Out[23]:

count    404.000000
mean       9.836634
std        8.834741
min        1.000000
25%        4.000000
50%        5.000000
75%       24.000000
max       24.000000
Name: rad, dtype: float64

In [24]:

eda_numeric_plt(variable='rad',y_label='index of accessibility to radial highways',y_max=35)

Interpretation¶

most of people has easy access to the highway, not much people are living 10-20 away from highway.
but there are also a large group of people are extremly far.
people tends to be either live very close to highway or very far from it

2.10. Variable tax¶

In [25]:

# ********************* *************** #
#   "tax": full-value property-tax rate per $10,000.
# ************************************* # 
data['tax'].describe()

Out[25]:

count    404.000000
mean     411.688119
std      171.073553
min      187.000000
25%      281.000000
50%      330.000000
75%      666.000000
max      711.000000
Name: tax, dtype: float64

In [26]:

eda_numeric_plt(variable='tax',y_label='full-value property-tax rate per $10,000')

Interpretation¶

most of people paying taxt rate around 200-400 per 10000,
not much people paying 500-600,
and a large group people paying around 700 per 10000, seems like they are live in the same community

2.11. Variable ptratio¶

In [27]:

# ********************* *************** #
#   "ptratio": pupil-teacher ratio by town.
# ************************************* # 
data['ptratio'].describe()

Out[27]:

count    404.000000
mean      18.444554
std        2.150295
min       12.600000
25%       17.375000
50%       19.000000
75%       20.200000
max       22.000000
Name: ptratio, dtype: float64

In [28]:

eda_numeric_plt(variable='ptratio',y_label='pupil-teacher ratio by town')

Interpretation¶

looks like the standard is around 1:20

2.12. Variable black¶

In [29]:

# ********************* *************** #
#   "black": 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
# ************************************* # 
data['black'].describe()

Out[29]:

count    404.000000
mean     355.068243
std       94.489572
min        0.320000
25%      374.710000
50%      391.065000
75%      396.007500
max      396.900000
Name: black, dtype: float64

In [30]:

eda_numeric_plt(variable='black',y_label='1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town',y_max=500)

2.13. Variable lstat¶

In [31]:

# ********************* *************** #
#   "lstat": lower status of the population (percent).
# ************************************* # 
data["lstat"].describe()

Out[31]:

count    404.000000
mean      12.598936
std        6.925173
min        1.730000
25%        7.135000
50%       11.265000
75%       16.910000
max       34.370000
Name: lstat, dtype: float64

In [32]:

eda_numeric_plt(variable='lstat',y_label='lower status of the population (percent)')

2.14. Variable medv¶

In [33]:

data["medv"].describe()

Out[33]:

count    404.000000
mean      22.312376
std        8.837019
min        5.000000
25%       17.100000
50%       21.400000
75%       25.000000
max       50.000000
Name: medv, dtype: float64

In [34]:

eda_numeric_plt(variable='medv',y_label='Median Value of owner occupied homes')

Interpretation¶

The distribution of median values of owner occupied homes are slightly right skewed with mean larger than median values.

2.14. Explore correlations between variables¶

In [35]:

def correlations(data, y, xs):
    rs = []
    rhos = []
    for x in xs:
        r = stats.pearsonr(data[y], data[x])[0]
        rs.append(r)
        rho = stats.spearmanr(data[y], data[x])[0]
        rhos.append(rho)
    return pd.DataFrame({"feature": xs, "r": rs, "rho": rhos})

variables = list(data.columns) 

for i, var_i in enumerate(variables):
    if i< len(variables)-1:
        print("correlations with %s" % var_i)
        print(correlations(data, var_i, variables[i+1:]))
    print("\n\n")

correlations with crim
    feature         r       rho
0        zn -0.194770 -0.552142
1     indus  0.404249  0.740333
2      chas -0.060688  0.042912
3       nox  0.405290  0.811619
4        rm -0.172226 -0.269534
5       age  0.349288  0.699253
6       dis -0.376493 -0.740802
7       rad  0.603044  0.740088
8       tax  0.565047  0.721795
9   ptratio  0.292092  0.468879
10    black -0.377436 -0.393906
11    lstat  0.428940  0.636931
12     medv -0.400956 -0.584999



correlations with zn
    feature         r       rho
0     indus -0.532268 -0.637018
1      chas -0.037290 -0.047635
2       nox -0.497114 -0.624603
3        rm  0.302404  0.349302
4       age -0.546587 -0.519148
5       dis  0.659018  0.608352
6       rad -0.311313 -0.268042
7       tax -0.289332 -0.323713
8   ptratio -0.371315 -0.423871
9     black  0.176350  0.185130
10    lstat -0.412075 -0.478700
11     medv  0.355608  0.435495



correlations with indus
    feature         r       rho
0      chas  0.059928  0.091522
1       nox  0.760913  0.791574
2        rm -0.353589 -0.380511
3       age  0.630048  0.663708
4       dis -0.702583 -0.751953
5       rad  0.616375  0.483960
6       tax  0.722828  0.652565
7   ptratio  0.356549  0.400937
8     black -0.378189 -0.347579
9     lstat  0.608309  0.635374
10     medv -0.501698 -0.589243



correlations with chas
   feature         r       rho
0      nox  0.100650  0.072012
1       rm  0.090146  0.055196
2      age  0.084852  0.059560
3      dis -0.103124 -0.087830
4      rad -0.022561  0.013109
5      tax -0.043708 -0.043678
6  ptratio -0.121835 -0.138431
7    black  0.042580 -0.064772
8    lstat -0.041061 -0.033678
9     medv  0.141400  0.114116



correlations with nox
   feature         r       rho
0       rm -0.266114 -0.267417
1      age  0.728882  0.796277
2      dis -0.762469 -0.875651
3      rad  0.612115  0.587807
4      tax  0.660607  0.633700
5  ptratio  0.166715  0.374583
6    black -0.393876 -0.348756
7    lstat  0.587573  0.631116
8     medv -0.439225 -0.570589



correlations with rm
   feature         r       rho
0      age -0.210589 -0.232704
1      dis  0.178700  0.218093
2      rad -0.173222 -0.077877
3      tax -0.241733 -0.213169
4  ptratio -0.323330 -0.279252
5    black  0.128410  0.077338
6    lstat -0.578025 -0.622317
7     medv  0.683541  0.630316



correlations with age
   feature         r       rho
0      dis -0.734915 -0.796290
1      rad  0.467341  0.429168
2      tax  0.503830  0.512626
3  ptratio  0.231820  0.328111
4    black -0.276469 -0.250842
5    lstat  0.602001  0.654449
6     medv -0.390863 -0.552470



correlations with dis
   feature         r       rho
0      rad -0.503471 -0.503126
1      tax -0.526938 -0.554694
2  ptratio -0.206016 -0.303989
3    black  0.306445  0.296868
4    lstat -0.500912 -0.566598
5     medv  0.264876  0.458146



correlations with rad
   feature         r       rho
0      tax  0.917344  0.727690
1  ptratio  0.487733  0.354271
2    black -0.468700 -0.299539
3    lstat  0.499279  0.418529
4     medv -0.423508 -0.392268



correlations with tax
   feature         r       rho
0  ptratio  0.465997  0.449532
1    black -0.462250 -0.356075
2    lstat  0.542732  0.518977
3     medv -0.495792 -0.569819



correlations with ptratio
  feature         r       rho
0   black -0.194891 -0.121272
1   lstat  0.351408  0.447951
2    medv -0.506313 -0.550071



correlations with black
  feature         r       rho
0   lstat -0.388913 -0.258902
1    medv  0.360071  0.239874



correlations with lstat
  feature         r       rho
0    medv -0.742695 -0.844794

Interpretation¶

From correlation analyses, result show that, "crim" (crime rate) is anti-correlated with "zn" (proportion of residential areas in a 25,000, and positively correlated with "indus" (the proportion of the non-retail business acres). In addition, there is a negative correlation between "zn" and "indus".

When we build our predition model, we have to keep the collinearity between variables, so that we don't include variables with strong correlations as our predictors.

In [36]:

#Heat map
plt.figure ( figsize = ( 20, 10 ) )
sns.heatmap ( data.corr().abs(),  annot = True )

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fece0a5ab70>

In [37]:

import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
#Let's scale the columns before plotting them against medv
min_max_scaler = preprocessing.MinMaxScaler()
column_sels = [ 'lsat', 'indus', 'nox', 'ptratio', 'rm', 'tax', 'dis', 'age' ]
x = data.loc[:,column_sels]
y = data[ 'medv' ]
x = pd.DataFrame ( data = min_max_scaler.fit_transform(x), columns = column_sels )
fig, axs = plt.subplots ( ncols = 4, nrows = 2, figsize = ( 20, 10 ) )
index = 0
axs = axs.flatten()
for i, k in enumerate ( column_sels ):
    sns.regplot ( y = y, x = x [ k ], ax = axs[ i ] )
plt.tight_layout ( pad = 0.4, w_pad = 0.5, h_pad = 5.0 )

Interpretation¶

Based on the results above, we may try predict medv with 'lsat', 'indus', 'nox', 'ptratio', 'rm', 'tax', 'dis', 'age' features.

Intuitively we can infer the following

An increase in the value of 'rm' might result in an increase in the value of 'medv', since more rooms indicate bigger home size thus can accommodate more people.
An increase in the value of 'lsat', might result in an decrease in the value of 'medv', since low income people tend to have a low purchasing power.
An increase in the value of 'ptratio', might result in an increase in the value of 'medv', because a low 'ptratio' can indicate better education for the children in the town.

From the correlations figures shown above, we can infer the following:

An increase in the value of 'rm' will result in an increase in the value of 'medv', since they have a positive correlation.
An increase in the value of 'lsat', w result in an decrease in the value of 'medv', since they have a negative correlation.

3. Model development and evaluation¶

3 linear models+ 1 decision tree model+ 1 knn model

3.1 Linear Model 1¶

Some initial thoughts and plans¶

From the first round of EDA analysis, we have looked at some basic statistics from individual variables in our datasets. We plotted their histograms and looked at their correlations.

We have found collinearity between many variables in this dataset.

The first model started with building a linear model using variables having strong correlations with the target variable ("medv": median value of owner-occupied homes in $1000s), but avoiding using variables with strong correlations with each other.

Variables with strong correlation with the targe variable (r>0.5) are: indus, nox, rm, rad, tax, lstat. Since indus and nox are strongly correlated (r= 0.76), we only used indus (indus and medv have a stronger correlation). Similary, indus also have strong correlation with tax and lstat. We chose to use lstat since it has the highest correlation with medv. Thus, this model started with "rm, rad, lstat" as predictors of "medv".

From the first several runs of model development and evaluations of residuals, we found that we had to transform some data variables.

In [38]:

# data transformation

# divide rad into two categories < 15 & > 15
data['rad_lt_15'] = data.rad.apply(lambda x: 1 if x<15 else 0)

# convert lstat to a log scale
data['lstat_log'] = np.log(data.lstat)

data.head()

Out[38]:

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv	rad_lt_15	lstat_log
0	0.15876	0.0	10.81	0.413	5.961	17.5	5.2873	4.0	305.0	19.2	376.94	9.88	21.7	1	2.290513
1	0.10328	25.0	5.13	0.453	5.927	47.2	6.9320	8.0	284.0	19.7	396.90	9.22	19.6	1	2.221375
2	0.34940	0.0	9.90	0.544	5.972	76.7	3.1025	4.0	304.0	18.4	396.24	9.97	20.3	1	2.299581
3	2.73397	0.0	19.58	0.871	5.597	94.9	1.5257	5.0	403.0	14.7	351.85	21.45	15.4	1	3.065725
4	0.04337	21.0	5.64	0.439	6.115	63.0	6.8147	4.0	243.0	16.8	393.97	9.43	20.5	1	2.243896

In [39]:

import models

# build the second linear model
model = " medv ~ rm + rad_lt_15 + lstat_log"
result = models.bootstrap_linear_regression(model, data=data)
models.describe_bootstrap_lr(result)

Out[39]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	15.90	(0.64, 27.94)
rm ($\beta_1$)	3.98	(2.89, 5.62)
rad_lt_15 ($\beta_2$)	2.12	(0.19, 3.48)
lstat_log ($\beta_3$)	-8.52	(-10.51, -6.80)
Metrics
$\sigma$	4.76	(4.17, 5.32)
$R^2$	0.71	(0.66, 0.78)

Interpretation¶

With the this model, results show an R2 of 0.71. Now, we considered using cross validation to build more bootstrap ensembles of R2.

Cross validation¶

In [40]:

import random
import patsy

def chunk(xs, n):
    k, m = divmod(len(xs), n)
    return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]

def cross_validation(algorithm, formula, data, evaluate, fold_count=10, repetitions=3):
    indices = list(range(len( data)))
    metrics = []
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[fold]
            train_indices = [idx not in fold for idx in indices]
            train_data = data.iloc[train_indices]
            result = algorithm(formula, data=train_data)
            model = result["model"]
            y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
            # y = np.ravel( y) # might need for logistic regression
            results = models.summarize(formula, X, y, model)
            metric = evaluate(results)
            metrics.append(metric)
    return metrics


lr_result_1 = cross_validation(models.linear_regression, model, data, lambda r: (r["sigma"], r["r_squared"]))
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result_1], [0.025, 0.975]))

95% CI for R^2: [0.37879442 0.82624228]

Intepretation¶

divided data into 10 chunks. repeat the analysis with 3 runs and came out with 30 independent estimates of R2 for this model.

Result suggest the R2 of this model is between 0.47 and 0.85 at a 95% confidence interval.

3.2 Linear Model 2¶

Linear model 2 was contributed by a different group member compared to the linear model 1. We now considered a few more different variables. Note that we expect a better fit to our predictor "medv" with more variables. But we did not consider calculating an adjusted R2 for this exercise.

In [41]:

# data transformation
data["log_dis"] = data["dis"].apply(np.log)
data["log_crim"] = data["crim"].apply(np.log)
data["sqrt_lstat"] = data["lstat"].apply(np.sqrt)

In [42]:

model2 = "medv ~ rm + sqrt_lstat + ptratio + log_dis + log_crim" 
result2 = models.bootstrap_linear_regression(model2, data=data)
models.describe_bootstrap_lr(result2)

Out[42]:

Linear Regression Results
Coefficients
$\theta$		95% BCI
($\beta_0$)	35.29	(22.25, 45.97)
rm ($\beta_1$)	3.68	(2.53, 4.94)
sqrt_lstat ($\beta_2$)	-5.26	(-6.13, -3.95)
ptratio ($\beta_3$)	-0.79	(-1.00, -0.53)
log_dis ($\beta_4$)	-3.36	(-5.12, -1.76)
log_crim ($\beta_5$)	-0.53	(-0.87, -0.19)
Metrics
$\sigma$	4.56	(4.03, 5.20)
$R^2$	0.74	(0.66, 0.78)

Interpretation¶

Linear model 2 seem to show a higher R2. Now, let's run cross validation.

Cross validation¶

In [43]:

lr_result_2 = cross_validation(models.linear_regression, model2, data, lambda r: (r["sigma"], r["r_squared"]))
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result_2], [0.025, 0.975]))

95% CI for R^2: [0.36708378 0.86339239]

More interpretation¶

From bootstrap ensemble R2 results, they suggest the 95% confidence interval for linear model 2 is between 0.42 and 0.86.

3.3 Linear Model 3¶

Linear model 3 originated from one group member different from models 1 and 2. This model took consideration of a different suite of variables.

In [44]:

model3 = "medv ~ black + tax + dis + lstat_log + ptratio + rm"
results = models.linear_regression(model3, data)
models.simple_describe_lr(results)

Out[44]:

Linear Regression Results
Coefficients
$\theta$
($\beta_0$)	37.14
black ($\beta_1$)	0.01
tax ($\beta_2$)	-0.01
dis ($\beta_3$)	-0.78
lstat_log ($\beta_4$)	-8.85
ptratio ($\beta_5$)	-0.68
rm ($\beta_6$)	3.31
Metrics
$\sigma$	4.29
$R^2$	0.77

Cross validation¶

In [45]:

lr_result_3 = cross_validation(models.linear_regression, model3, data, lambda r: (r["sigma"], r["r_squared"]))
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result_3], [0.025, 0.975]))

95% CI for R^2: [0.5354853  0.90574466]

More interpretation¶

The 95% confidence interval for model 3 is between 0.58 and 0.86.

3.4 Decision Tree Regression¶

In this method, we first use a validation curve to assess how many tree depths we need to consider.

Validation curve¶

In [46]:

# build a cross validation curve for decision tree 
def decisiontree_cross_validation(formula, builder, data, fold_count=10, repetitions=3):
    indices = list(range(len( data)))
    metrics = {"train": [], "test": []}
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[fold]
            train_indices = [idx not in fold for idx in indices]
            train_data = data.iloc[train_indices]
            # y, X for training data
            y, X = patsy.dmatrices(formula, train_data, return_type="matrix")
            model = builder.fit(X, y)
            y_hat = model.predict(X)
            training_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
            metrics["train"].append(training_r_squared)
            # y, X for training data
            y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
            y_hat = model.predict(X)
            test_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
            metrics["test"].append(test_r_squared)
    return metrics

In [47]:

from sklearn.tree import DecisionTreeRegressor

# test a range of tree depths and create a validation curve
test_curve = []
train_curve = []
for k in range(1, 21):
    builder = DecisionTreeRegressor(max_depth=k)
    results = decisiontree_cross_validation(model, builder, data)
    test_curve.append(stats.mstats.mquantiles(results['test'], [0.025, 0.5, 0.975]))
    train_curve.append(stats.mstats.mquantiles(results['train'], [0.025, 0.5, 0.975]))

# visualize the results
xs = list(range(1, 21))
figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
axes = figure.add_subplot(1, 1, 1)

test_lower, test_mid, test_upper = zip(*test_curve)
train_lower, train_mid, train_upper = zip(*train_curve)

axes.fill_between(xs, test_lower, test_upper, alpha=0.25, color="firebrick", label="test")
axes.plot(xs, test_mid, color="firebrick")

axes.fill_between(xs, train_lower, train_upper, alpha=0.25, color="cornflowerblue", label="train")
axes.plot(xs, train_mid, color="cornflowerblue")

axes.set_xticks(xs)
axes.set_xticklabels([str(x) for x in xs])

axes.legend()
axes.set_title("Validation Curves for Decision Tree")
axes.set_xlabel("Depth of Decision Tree")
axes.set_ylabel("$R^2$")

plt.show()
plt.close()

Decision¶

Cross validation curve suggest depth = 13 show the best performance.

We also need to decide if we are overfitting the data. Now, let's plot a learning curve.

Learning Curve¶

In [48]:

from collections import defaultdict

def data_collection():
    result = dict()
    result[ "train"] = defaultdict( list)
    result[ "test"] = defaultdict( list)
    return result

def decisiontree_learning_curves(builder, formula, data, fold_count=10, repetitions=3, increment=1):
    indices = list(range(len( data)))
    results = data_collection()
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[ fold]
            train_indices = [idx for idx in indices if idx not in fold]
            train_data = data.iloc[train_indices]
            for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
                # the indices are already shuffled so we only need to take ever increasing chunks
                train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
                train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
                # we calculate the model
                y, X = patsy.dmatrices(formula, train_data_chunk, return_type="matrix")
                model = builder.fit(X, y)
                y_hat = model.predict(X)
                training_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
                results["train"][i].append(training_r_squared)
                
                # y, X for test data
                y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
                y_hat = model.predict(X)
                test_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
                results["test"][i].append(test_r_squared)
    # process results
    statistics = {}
    for k, v in results["train"].items():
        statistics[ k] = (np.mean(v), np.std(v))
    results["train"] = statistics
    statistics = {}
    for k, v in results["test"].items():
        statistics[ k] = (np.mean(v), np.std(v))
    results["test"] = statistics
    return results

# plot a learning curve
builder = DecisionTreeRegressor(max_depth=13)
result = decisiontree_learning_curves(builder, model, data)
def results_to_curves( curve, results):
    all_statistics = results[ curve]
    keys = list( all_statistics.keys())
    keys.sort()
    mean = []
    upper = []
    lower = []
    for k in keys:
        m, s = all_statistics[ k]
        mean.append( m)
        upper.append( m + 2 * s)
        lower.append( m - 2 * s)
    return keys, lower, mean, upper

def plot_learning_curves( results, metric, zoom=False):
    figure = plt.figure(figsize=(10,6))

    axes = figure.add_subplot(1, 1, 1)

    xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
    _, test_lower, test_mean, test_upper = results_to_curves( "test", results)

    axes.plot( xs, train_mean, color="steelblue")
    axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
    axes.plot( xs, test_mean, color="firebrick")
    axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
    axes.legend()
    axes.set_xlabel( "training set (%)")
    axes.set_ylabel( metric)
    axes.set_title("Learning Curves")

    if zoom:
        y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
        y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
        axes.set_ylim((y_lower, y_upper))

    plt.show()
    plt.close()

In [49]:

plot_learning_curves(result, "$R^2$")

Interpretation¶

The learning curve suggests taking > 10% of the data should be sufficient.

Creating ensemble of R2 estimates¶

In [50]:

DT_results = decisiontree_cross_validation(model, builder, data)
print(r"95% CI for R^2:", stats.mstats.mquantiles(DT_results['test'], [0.025, 0.975]))

95% CI for R^2: [0.77645148 0.99648919]

Interpretation¶

Results suggest the 95% confidence interval of R2 in the decision tree method is between 0.88 and 0.99.

3.5 K nearest neighbour¶

In [51]:

from sklearn.preprocessing import StandardScaler
import sklearn.neighbors as neighbors
import sklearn.linear_model as linear
import random
import patsy

In [57]:

columns = ['medv', 'black' , 'tax' , 'dis' , 'lstat_log' , 'ptratio' , 'rm'] 
scaler = StandardScaler()
data[columns] = scaler.fit_transform(data[columns])

In [58]:

def chunk(xs, n):
    k, m = divmod(len(xs), n)
    return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]

def knn_cross_validation(formula, builder, data, fold_count=10, repetitions=3):
    indices = list(range(len( data)))
    metrics = {"train": [], "test": []}
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[fold]
            train_indices = [idx not in fold for idx in indices]
            train_data = data.iloc[train_indices]
            # y, X for training data
            y, X = patsy.dmatrices(formula, train_data, return_type="matrix")
            model = builder.fit(X, y)
            y_hat = model.predict(X)
            training_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
            metrics["train"].append(training_r_squared)
            # y, X for training data
            y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
            y_hat = model.predict(X)
            test_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
            metrics["test"].append(test_r_squared)
    return metrics

def data_collection(test_curve, train_curve):
    xs = list(range(1, 21))
    figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
    axes = figure.add_subplot(1, 1, 1)

    test_lower, test_mid, test_upper = zip(*test_curve)
    train_lower, train_mid, train_upper = zip(*train_curve)

    axes.fill_between(xs, test_lower, test_upper, alpha=0.25, color="firebrick", label="test")
    axes.plot(xs, test_mid, color="firebrick")

    axes.fill_between(xs, train_lower, train_upper, alpha=0.25, color="cornflowerblue", label="train")
    axes.plot(xs, train_mid, color="cornflowerblue")

    axes.set_xticks(xs)
    axes.set_xticklabels([str(x) for x in xs])

    axes.legend()
    axes.set_title("Validation Curves for kNN")
    axes.set_xlabel("k")
    axes.set_ylabel("$R^2$")

    plt.show()
    plt.close()
from collections import defaultdict
def create_data_collection():
    result = dict()
    result["train"] = defaultdict(list)
    result["test"]  = defaultdict(list)
    return result
def knn_learning_curves(builder, formula, data, fold_count=10, repetitions=3, increment=1):
    indices = list(range(len( data)))
    results = create_data_collection()
    for _ in range(repetitions):
        random.shuffle(indices)
        folds = chunk(indices, fold_count)
        for fold in folds:
            test_data = data.iloc[ fold]
            train_indices = [idx for idx in indices if idx not in fold]
            train_data = data.iloc[train_indices]
            for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
                # the indices are already shuffled so we only need to take ever increasing chunks
                train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
                train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
                # we calculate the model
                y, X = patsy.dmatrices(formula, train_data_chunk, return_type="matrix")
                model = builder.fit(X, y)
                y_hat = model.predict(X)
                training_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
                results["train"][i].append(training_r_squared)
                
                # y, X for test data
                y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
                y_hat = model.predict(X)
                test_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
                results["test"][i].append(test_r_squared)
    # process results
    # Rely on the CLT...
    statistics = {}
    for k, v in results["train"].items():
        statistics[ k] = (np.mean(v), np.std(v))
    results["train"] = statistics
    statistics = {}
    for k, v in results["test"].items():
        statistics[ k] = (np.mean(v), np.std(v))
    results["test"] = statistics
    return results
#
def results_to_curves( curve, results):
    all_statistics = results[ curve]
    keys = list( all_statistics.keys())
    keys.sort()
    mean = []
    upper = []
    lower = []
    for k in keys:
        m, s = all_statistics[ k]
        mean.append( m)
        upper.append( m + 2 * s)
        lower.append( m - 2 * s)
    return keys, lower, mean, upper
def plot_learning_curves( results, metric, zoom=False):
    figure = plt.figure(figsize=(10,6))

    axes = figure.add_subplot(1, 1, 1)

    xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
    _, test_lower, test_mean, test_upper = results_to_curves( "test", results)

    axes.plot( xs, train_mean, color="steelblue")
    axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
    axes.plot( xs, test_mean, color="firebrick")
    axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
    axes.legend()
    axes.set_xlabel( "training set (%)")
    axes.set_ylabel( metric)
    axes.set_title("Learning Curves")

    if zoom:
        y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
        y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
        axes.set_ylim((y_lower, y_upper))

    plt.show()
    plt.close()

In [63]:

formula = "medv ~ black + tax + dis + lstat_log + ptratio + rm"

test_curve = []
train_curve = []
for k in range(1, 21):
    builder = neighbors.KNeighborsRegressor(k)
    results = knn_cross_validation(formula, builder, data)
    test_curve.append(stats.mstats.mquantiles(results['test'], [0.025, 0.5, 0.975]))
    train_curve.append(stats.mstats.mquantiles(results['train'], [0.025, 0.5, 0.975]))
data_collection(test_curve, train_curve)

In [64]:

result = knn_learning_curves(neighbors.KNeighborsRegressor(1), formula, data)
plot_learning_curves(result, "$R^2$")

Creating ensemble of R2 estimates

In [66]:

builder = neighbors.KNeighborsRegressor(1)
knn_results = knn_cross_validation(formula, builder, data)
print(r"95% CI for R^2:", stats.mstats.mquantiles(knn_results['test'], [0.025, 0.975]))

95% CI for R^2: [0.66825749 0.9997024 ]

Interpretation

Results suggest the 95% confidence interval of R2 in the decision tree method is between 0.67 and 0.99.

3.6 Model evaluation using Bayesian inference¶

In this section, we will evaluate the overall performance of the four models considered.

Histograms of R2¶

In [68]:

def bootstrap_sample( data, f, n=100):
    result = []
    m = len( data)
    for _ in range( n):
        sample = np.random.choice( data, len(data), replace=True)
        r = f( sample)
        result.append( r)
    return np.array( result)

In [79]:

lr1_r2 = [r[1] for r in lr_result_1]
lr2_r2 = [r[1] for r in lr_result_2]
lr3_r2 = [r[1] for r in lr_result_3]
dt_r2 = DT_results['test']
knn_bootstrap = bootstrap_sample(knn_results['test'], np.mean,30)

sns.set(style="whitegrid")
# visualize result
figure = plt.figure(figsize=(8, 8)) # first element is width, second is height.

axes = figure.add_subplot(2, 3, 1)
axes.hist(lr1_r2, density=True)
axes.set_ylabel( "Density")
axes.set_title( "Linear Model 1")

axes = figure.add_subplot(2,3, 2)
axes.hist(lr2_r2, density=True)
axes.set_ylabel( "Density")
axes.set_title( "Linear Model 2")

axes = figure.add_subplot(2,3, 3)
axes.hist( lr3_r2, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "Linear Model 3")

axes = figure.add_subplot(2,3, 4)
axes.hist( dt_r2, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "Decision Tree")

axes = figure.add_subplot(2,3, 5)
axes.hist( knn_bootstrap, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "k-nearest neighbors")


plt.show()
plt.close()

Interpretation¶

Results suggest R2 from the 1st linear model centered around 0.7, whereas linear model 2 is centered around 0.8. Linear Model 3 seems to have a highr R2 compared to the other 2 linear models. But the R2 in knn regression method show likely the highest R2.

Let's look at this more quantitatively.

Look at the differences¶

(1) Compare linear models 1 and 2¶

In [54]:

diff1 = np.array(lr2_r2) - np.array(lr1_r2)
print("P(LR2 > LR1)", np.mean(diff1 > 0))

P(LR2 > LR1) 0.5

Statistically, the chance that Linear Model 2 performs better than Linear Model 1 is only 57%.

(2) Compare linear model 3 with linear models 1 and 2¶

In [55]:

diff2a = np.array(lr3_r2) - np.array(lr1_r2)
print("P(LR3 > LR1)", np.mean(diff2a > 0))
diff2b = np.array(lr3_r2) - np.array(lr2_r2)
print("P(LR3 > LR2)", np.mean(diff2b > 0))

P(LR3 > LR1) 0.8
P(LR3 > LR2) 0.7

From the results above, they suggest the probability that linear model 3 performs bettter than models 1 and 2 is only 0.53 - 0.63.

(3) Compare decision tree model with linear models¶

In [56]:

diff3a = np.array(dt_r2) - np.array(lr1_r2)
print("P(DT > LR1)", np.mean(diff3a > 0))
diff3b = np.array(dt_r2) - np.array(lr2_r2)
print("P(DT > LR2)", np.mean(diff3b > 0))
diff3c = np.array(dt_r2) - np.array(lr3_r2)
print("P(DT > LR3)", np.mean(diff3c > 0))

P(DT > LR1) 0.9666666666666667
P(DT > LR2) 0.9666666666666667
P(DT > LR3) 0.9666666666666667

(4) Compare K nearest neighbour model with linear models

In [86]:

diff3a = np.array(knn_bootstrap) - np.array(lr1_r2)
print("P(KNN > LR1)", np.mean(diff3a > 0))
diff3b = np.array(knn_bootstrap) - np.array(lr2_r2)
print("P(KNN > LR2)", np.mean(diff3b > 0))
diff3c = np.array(knn_bootstrap) - np.array(lr3_r2)
print("P(KNN > LR3)", np.mean(diff3c > 0))
diff4c = np.array(knn_bootstrap) - np.array(dt_r2)
print("P(KNN > dt_r2)", np.mean(diff4c > 0))

P(DT > LR1) 1.0
P(DT > LR2) 1.0
P(DT > LR3) 1.0
P(DT > dt_r2) 0.5333333333333333

Results suggest the K nearest regression is much better than the performance of the three linear models we considered and comparable with the performance of the decision tree regression.

4. Conclusions¶

In this project, we selected the Boston housing dataset for our analysis and modeling. We conducted detailed EDA for individual variables and correlation analyses among any two variables.

From the EDA analysis, we constructed three linear model, one decision tree model, and one KNN model. We found that there was 80% of chance that the third linear model may perform slightly better than the other two linear models. But, the decision tree and KNN models are much better than any of the three linear models.