Boston Housing Price
Posted on 12/08/2019 in Python Data Science
Boston Housing¶
1. Introduction / Background¶
1.1. Data Selection¶
Boston housing dataset: https://www.kaggle.com/c/boston-housing/overview
1.2. Data description on Boston housing dataset¶
The Boston data frame has 404 rows and 14 columns.
This data frame contains the following columns:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per \$10,000.
- ptratio: pupil-teacher ratio by town.
- black:: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes in $1000s.
Sources
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
1.3. Goal and objectives¶
The goal is to develop a model that best predicts Boston housing price.
2. EDA¶
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
#Display for notebooks
%matplotlib inline
#Load the Boston housing dataset
data = pd.read_csv ( 'boston_data_1.csv' )
#Note that 'medv' been multiplicatively scaled to account for 35 years of market inflation
prices = data [ 'medv' ]
features = data.drop ( 'medv', axis = 1 )
#Obtain general info on dataframe
data.info()
#Look at first few rows of dataframe
data.head()
#Histographs for each variable
data.hist ( sharex = False, sharey = False, xlabelsize = 1, ylabelsize = 1 )
plt.show()
Interpretation¶
chas, crim, dis, zn, lsat: Appears to be right skewed
age, black, ptratio: Appears to be left skewed
medv, rm appear to have normal distributions
2.1. Variable crim¶
# ********************* *************** #
# Learn more on the variable "crim"
# ************************************* #
# what are the basic statistics
data["crim"].describe()
Interpretation¶
Median value is much smaller than the mean; but the max is much larger the 75%ile, suggesting there are likely some outliers in this variable as well.
Let's plot the histogram and boxplot for this variable.
def eda_numeric_plt(variable, y_max=-99, y_label=''):
fig = plt.figure(figsize=(12,4))
ax = fig.add_subplot(1,2,1)
ax.hist(data[variable],50, color='lightblue', edgecolor='k')
ax.set_xlabel(variable, fontsize=14)
ax.set_ylabel("Counts", fontsize=14)
ax.tick_params(labelsize=14)
ax.set_title('Distribution of {}'.format(variable), fontsize=14)
ax = fig.add_subplot(1,2,2)
ax.boxplot(data[variable], widths=0.3)
# if no value is provided, we skip next line
if y_max != -99:
ax.set_ylim([-1, y_max])
ax.set_ylabel(y_label, fontsize=14)
ax.tick_params(labelsize=14)
ax.set_title('Boxplot', fontsize=16)
plt.rcParams['axes.linewidth'] = 1
plt.setp(ax.spines.values(), color='k')
plt.show()
plt.close()
eda_numeric_plt(variable='crim', y_max=20, y_label='Per Capita Crime Rate')
Interpretation¶
This variable is very right-skewed. Not sure how relistic to have a 88 in the per capita crime rate. Maybe worth to further to investigate the rows with large per captica values.
ia = np.where(data.crim>20)[0]
data.iloc[ia]
For the values with extremely high crime rate, the proportion of the residential land over a 25000 sqft is all 0 and the age of the building is old (~80 or older).
2.2 Variable zn¶
# ********************* *************** #
# "zn": proportion of residential
# land zoned for lots over 25,000 sq.ft.
# ************************************* #
data["zn"].describe()
Interpretation¶
Basic statistics of "zn" seem to show that "zn" is zero for the 50%ile and lower. It is likely to be right skewed.
Let's plot the histogram and boxplot.
eda_numeric_plt(variable='zn',y_max=50, y_label='Proportion of residential')
Interpretation¶
Indeed. This variable is right skewed. Since many values are 0, suggesting there are few houses with lot size > 25000.
2.3. Variable indus¶
# ********************* *************** #
# "indus": proportion of non-retail
# . business acres per town
# ************************************* #
data["indus"].describe()
eda_numeric_plt(variable='indus',y_label='Proportion of non-retail business acres per town')
2.4. Variable chas¶
# ********************* *************** #
# "chas": Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
# ************************************* #
# find out how many 0s and 1s
ia=np.where(data['chas']== 1)
print("counts of \'chas==1\' are: ", len(ia[0]))
Interpretation¶
Only 28 housing is close to the charles river in this dataset.
2.5. Variable nox¶
# ********************* *************** #
# "nox": nitrogen oxides concentration (parts per 10 million).
# ************************************* #
data['nox'].describe()
eda_numeric_plt(variable='nox',y_label='nitrogen oxides concentration')
Interpretation¶
"The national ambient air quality standard for NO2 is 50 parts per billion (ppb) (or 0.05 parts per million [ppm]) averaged over 1 year. Maximum 30-minute and 24-hour outdoor values of NO2 can be 0.45 ppm and 0.21 ppm, respectively. In homes with gas stoves, 15 to 25 ppb are added to the usual background level of NO2 .' The normal level for such homes ranges from 25 to 75 ppb." -- https://www.jacionline.org/article/S0091-6749(54)00063-2/pdf
the dataset's value doesn't really match the article I have found, but we know higher value of nox is worse
2.6. Variable rm¶
# ********************* *************** #
# "rm": average number of rooms per dwelling.
# ************************************* #
data['rm'].describe()
eda_numeric_plt(variable='rm',y_label='average number of rooms per dwelling.')
Interpretation¶
most houses of this dataset has around 5~7 rooms
2.7. Variable age¶
# ********************* *************** #
# "age": proportion of owner-occupied units built prior to 1940.
# ************************************* #
data['age'].describe()
eda_numeric_plt(variable='age',y_label='proportion of owner-occupied units built prior to 1940')
Interpretation¶
the higher this number is the older it is, looks like a lot of building were built prior to 1940, the data was collected at 1978, so it does make sense
2.8. Variable dis¶
# ********************* *************** #
# "dis": weighted mean of distances to five Boston employment centres.
# ************************************* #
data['dis'].describe()
eda_numeric_plt(variable='dis',y_label='weighted mean of distances')
Interpretation¶
looks like most of building is pretty close to the employment centres, and it makes sense, because city has a higher density of the population than suburban.
2.9. Variable rad¶
# ********************* *************** #
# "rad": index of accessibility to radial highways.
# ************************************* #
data['rad'].describe()
eda_numeric_plt(variable='rad',y_label='index of accessibility to radial highways',y_max=35)
Interpretation¶
most of people has easy access to the highway,
not much people are living 10-20 away from highway.
but there are also a large group of people are extremly far.
people tends to be either live very close to highway or very far from it
2.10. Variable tax¶
# ********************* *************** #
# "tax": full-value property-tax rate per $10,000.
# ************************************* #
data['tax'].describe()
eda_numeric_plt(variable='tax',y_label='full-value property-tax rate per $10,000')
Interpretation¶
most of people paying taxt rate around 200-400 per 10000,
not much people paying 500-600,
and a large group people paying around 700 per 10000, seems like they are live in the same community
2.11. Variable ptratio¶
# ********************* *************** #
# "ptratio": pupil-teacher ratio by town.
# ************************************* #
data['ptratio'].describe()
eda_numeric_plt(variable='ptratio',y_label='pupil-teacher ratio by town')
Interpretation¶
looks like the standard is around 1:20
2.12. Variable black¶
# ********************* *************** #
# "black": 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
# ************************************* #
data['black'].describe()
eda_numeric_plt(variable='black',y_label='1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town',y_max=500)
2.13. Variable lstat¶
# ********************* *************** #
# "lstat": lower status of the population (percent).
# ************************************* #
data["lstat"].describe()
eda_numeric_plt(variable='lstat',y_label='lower status of the population (percent)')
2.14. Variable medv¶
data["medv"].describe()
eda_numeric_plt(variable='medv',y_label='Median Value of owner occupied homes')
Interpretation¶
The distribution of median values of owner occupied homes are slightly right skewed with mean larger than median values.
2.14. Explore correlations between variables¶
def correlations(data, y, xs):
rs = []
rhos = []
for x in xs:
r = stats.pearsonr(data[y], data[x])[0]
rs.append(r)
rho = stats.spearmanr(data[y], data[x])[0]
rhos.append(rho)
return pd.DataFrame({"feature": xs, "r": rs, "rho": rhos})
variables = list(data.columns)
for i, var_i in enumerate(variables):
if i< len(variables)-1:
print("correlations with %s" % var_i)
print(correlations(data, var_i, variables[i+1:]))
print("\n\n")
Interpretation¶
From correlation analyses, result show that, "crim" (crime rate) is anti-correlated with "zn" (proportion of residential areas in a 25,000, and positively correlated with "indus" (the proportion of the non-retail business acres). In addition, there is a negative correlation between "zn" and "indus".
When we build our predition model, we have to keep the collinearity between variables, so that we don't include variables with strong correlations as our predictors.
#Heat map
plt.figure ( figsize = ( 20, 10 ) )
sns.heatmap ( data.corr().abs(), annot = True )
import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
#Let's scale the columns before plotting them against medv
min_max_scaler = preprocessing.MinMaxScaler()
column_sels = [ 'lsat', 'indus', 'nox', 'ptratio', 'rm', 'tax', 'dis', 'age' ]
x = data.loc[:,column_sels]
y = data[ 'medv' ]
x = pd.DataFrame ( data = min_max_scaler.fit_transform(x), columns = column_sels )
fig, axs = plt.subplots ( ncols = 4, nrows = 2, figsize = ( 20, 10 ) )
index = 0
axs = axs.flatten()
for i, k in enumerate ( column_sels ):
sns.regplot ( y = y, x = x [ k ], ax = axs[ i ] )
plt.tight_layout ( pad = 0.4, w_pad = 0.5, h_pad = 5.0 )
Interpretation¶
Based on the results above, we may try predict medv with 'lsat', 'indus', 'nox', 'ptratio', 'rm', 'tax', 'dis', 'age' features.
Intuitively we can infer the following
- An increase in the value of
'rm'might result in an increase in the value of'medv', since more rooms indicate bigger home size thus can accommodate more people. - An increase in the value of
'lsat', might result in an decrease in the value of'medv', since low income people tend to have a low purchasing power. - An increase in the value of
'ptratio', might result in an increase in the value of'medv', because a low'ptratio'can indicate better education for the children in the town.
From the correlations figures shown above, we can infer the following:
- An increase in the value of
'rm'will result in an increase in the value of'medv', since they have a positive correlation. - An increase in the value of
'lsat', w result in an decrease in the value of'medv', since they have a negative correlation.
3. Model development and evaluation¶
3 linear models+ 1 decision tree model+ 1 knn model
3.1 Linear Model 1¶
Some initial thoughts and plans¶
From the first round of EDA analysis, we have looked at some basic statistics from individual variables in our datasets. We plotted their histograms and looked at their correlations.
We have found collinearity between many variables in this dataset.
The first model started with building a linear model using variables having strong correlations with the target variable ("medv": median value of owner-occupied homes in $1000s), but avoiding using variables with strong correlations with each other.
Variables with strong correlation with the targe variable (r>0.5) are: indus, nox, rm, rad, tax, lstat. Since indus and nox are strongly correlated (r= 0.76), we only used indus (indus and medv have a stronger correlation). Similary, indus also have strong correlation with tax and lstat. We chose to use lstat since it has the highest correlation with medv. Thus, this model started with "rm, rad, lstat" as predictors of "medv".
From the first several runs of model development and evaluations of residuals, we found that we had to transform some data variables.
# data transformation
# divide rad into two categories < 15 & > 15
data['rad_lt_15'] = data.rad.apply(lambda x: 1 if x<15 else 0)
# convert lstat to a log scale
data['lstat_log'] = np.log(data.lstat)
data.head()
import models
# build the second linear model
model = " medv ~ rm + rad_lt_15 + lstat_log"
result = models.bootstrap_linear_regression(model, data=data)
models.describe_bootstrap_lr(result)
import random
import patsy
def chunk(xs, n):
k, m = divmod(len(xs), n)
return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]
def cross_validation(algorithm, formula, data, evaluate, fold_count=10, repetitions=3):
indices = list(range(len( data)))
metrics = []
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[fold]
train_indices = [idx not in fold for idx in indices]
train_data = data.iloc[train_indices]
result = algorithm(formula, data=train_data)
model = result["model"]
y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
# y = np.ravel( y) # might need for logistic regression
results = models.summarize(formula, X, y, model)
metric = evaluate(results)
metrics.append(metric)
return metrics
lr_result_1 = cross_validation(models.linear_regression, model, data, lambda r: (r["sigma"], r["r_squared"]))
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result_1], [0.025, 0.975]))
Intepretation¶
divided data into 10 chunks. repeat the analysis with 3 runs and came out with 30 independent estimates of R2 for this model.
Result suggest the R2 of this model is between 0.47 and 0.85 at a 95% confidence interval.
3.2 Linear Model 2¶
Linear model 2 was contributed by a different group member compared to the linear model 1. We now considered a few more different variables. Note that we expect a better fit to our predictor "medv" with more variables. But we did not consider calculating an adjusted R2 for this exercise.
# data transformation
data["log_dis"] = data["dis"].apply(np.log)
data["log_crim"] = data["crim"].apply(np.log)
data["sqrt_lstat"] = data["lstat"].apply(np.sqrt)
model2 = "medv ~ rm + sqrt_lstat + ptratio + log_dis + log_crim"
result2 = models.bootstrap_linear_regression(model2, data=data)
models.describe_bootstrap_lr(result2)
lr_result_2 = cross_validation(models.linear_regression, model2, data, lambda r: (r["sigma"], r["r_squared"]))
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result_2], [0.025, 0.975]))
More interpretation¶
From bootstrap ensemble R2 results, they suggest the 95% confidence interval for linear model 2 is between 0.42 and 0.86.
3.3 Linear Model 3¶
Linear model 3 originated from one group member different from models 1 and 2. This model took consideration of a different suite of variables.
model3 = "medv ~ black + tax + dis + lstat_log + ptratio + rm"
results = models.linear_regression(model3, data)
models.simple_describe_lr(results)
Cross validation¶
lr_result_3 = cross_validation(models.linear_regression, model3, data, lambda r: (r["sigma"], r["r_squared"]))
print(r"95% CI for R^2:", stats.mstats.mquantiles([r[1] for r in lr_result_3], [0.025, 0.975]))
More interpretation¶
The 95% confidence interval for model 3 is between 0.58 and 0.86.
# build a cross validation curve for decision tree
def decisiontree_cross_validation(formula, builder, data, fold_count=10, repetitions=3):
indices = list(range(len( data)))
metrics = {"train": [], "test": []}
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[fold]
train_indices = [idx not in fold for idx in indices]
train_data = data.iloc[train_indices]
# y, X for training data
y, X = patsy.dmatrices(formula, train_data, return_type="matrix")
model = builder.fit(X, y)
y_hat = model.predict(X)
training_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
metrics["train"].append(training_r_squared)
# y, X for training data
y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
y_hat = model.predict(X)
test_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
metrics["test"].append(test_r_squared)
return metrics
from sklearn.tree import DecisionTreeRegressor
# test a range of tree depths and create a validation curve
test_curve = []
train_curve = []
for k in range(1, 21):
builder = DecisionTreeRegressor(max_depth=k)
results = decisiontree_cross_validation(model, builder, data)
test_curve.append(stats.mstats.mquantiles(results['test'], [0.025, 0.5, 0.975]))
train_curve.append(stats.mstats.mquantiles(results['train'], [0.025, 0.5, 0.975]))
# visualize the results
xs = list(range(1, 21))
figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
axes = figure.add_subplot(1, 1, 1)
test_lower, test_mid, test_upper = zip(*test_curve)
train_lower, train_mid, train_upper = zip(*train_curve)
axes.fill_between(xs, test_lower, test_upper, alpha=0.25, color="firebrick", label="test")
axes.plot(xs, test_mid, color="firebrick")
axes.fill_between(xs, train_lower, train_upper, alpha=0.25, color="cornflowerblue", label="train")
axes.plot(xs, train_mid, color="cornflowerblue")
axes.set_xticks(xs)
axes.set_xticklabels([str(x) for x in xs])
axes.legend()
axes.set_title("Validation Curves for Decision Tree")
axes.set_xlabel("Depth of Decision Tree")
axes.set_ylabel("$R^2$")
plt.show()
plt.close()
from collections import defaultdict
def data_collection():
result = dict()
result[ "train"] = defaultdict( list)
result[ "test"] = defaultdict( list)
return result
def decisiontree_learning_curves(builder, formula, data, fold_count=10, repetitions=3, increment=1):
indices = list(range(len( data)))
results = data_collection()
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[ fold]
train_indices = [idx for idx in indices if idx not in fold]
train_data = data.iloc[train_indices]
for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
# the indices are already shuffled so we only need to take ever increasing chunks
train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
# we calculate the model
y, X = patsy.dmatrices(formula, train_data_chunk, return_type="matrix")
model = builder.fit(X, y)
y_hat = model.predict(X)
training_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
results["train"][i].append(training_r_squared)
# y, X for test data
y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
y_hat = model.predict(X)
test_r_squared = (stats.pearsonr(y[:,0], y_hat)[0])**2
results["test"][i].append(test_r_squared)
# process results
statistics = {}
for k, v in results["train"].items():
statistics[ k] = (np.mean(v), np.std(v))
results["train"] = statistics
statistics = {}
for k, v in results["test"].items():
statistics[ k] = (np.mean(v), np.std(v))
results["test"] = statistics
return results
# plot a learning curve
builder = DecisionTreeRegressor(max_depth=13)
result = decisiontree_learning_curves(builder, model, data)
def results_to_curves( curve, results):
all_statistics = results[ curve]
keys = list( all_statistics.keys())
keys.sort()
mean = []
upper = []
lower = []
for k in keys:
m, s = all_statistics[ k]
mean.append( m)
upper.append( m + 2 * s)
lower.append( m - 2 * s)
return keys, lower, mean, upper
def plot_learning_curves( results, metric, zoom=False):
figure = plt.figure(figsize=(10,6))
axes = figure.add_subplot(1, 1, 1)
xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
_, test_lower, test_mean, test_upper = results_to_curves( "test", results)
axes.plot( xs, train_mean, color="steelblue")
axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
axes.plot( xs, test_mean, color="firebrick")
axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
axes.legend()
axes.set_xlabel( "training set (%)")
axes.set_ylabel( metric)
axes.set_title("Learning Curves")
if zoom:
y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
axes.set_ylim((y_lower, y_upper))
plt.show()
plt.close()
plot_learning_curves(result, "$R^2$")
DT_results = decisiontree_cross_validation(model, builder, data)
print(r"95% CI for R^2:", stats.mstats.mquantiles(DT_results['test'], [0.025, 0.975]))
Interpretation¶
Results suggest the 95% confidence interval of R2 in the decision tree method is between 0.88 and 0.99.
3.5 K nearest neighbour¶
from sklearn.preprocessing import StandardScaler
import sklearn.neighbors as neighbors
import sklearn.linear_model as linear
import random
import patsy
columns = ['medv', 'black' , 'tax' , 'dis' , 'lstat_log' , 'ptratio' , 'rm']
scaler = StandardScaler()
data[columns] = scaler.fit_transform(data[columns])
def chunk(xs, n):
k, m = divmod(len(xs), n)
return [xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)]
def knn_cross_validation(formula, builder, data, fold_count=10, repetitions=3):
indices = list(range(len( data)))
metrics = {"train": [], "test": []}
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[fold]
train_indices = [idx not in fold for idx in indices]
train_data = data.iloc[train_indices]
# y, X for training data
y, X = patsy.dmatrices(formula, train_data, return_type="matrix")
model = builder.fit(X, y)
y_hat = model.predict(X)
training_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
metrics["train"].append(training_r_squared)
# y, X for training data
y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
y_hat = model.predict(X)
test_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
metrics["test"].append(test_r_squared)
return metrics
def data_collection(test_curve, train_curve):
xs = list(range(1, 21))
figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
axes = figure.add_subplot(1, 1, 1)
test_lower, test_mid, test_upper = zip(*test_curve)
train_lower, train_mid, train_upper = zip(*train_curve)
axes.fill_between(xs, test_lower, test_upper, alpha=0.25, color="firebrick", label="test")
axes.plot(xs, test_mid, color="firebrick")
axes.fill_between(xs, train_lower, train_upper, alpha=0.25, color="cornflowerblue", label="train")
axes.plot(xs, train_mid, color="cornflowerblue")
axes.set_xticks(xs)
axes.set_xticklabels([str(x) for x in xs])
axes.legend()
axes.set_title("Validation Curves for kNN")
axes.set_xlabel("k")
axes.set_ylabel("$R^2$")
plt.show()
plt.close()
from collections import defaultdict
def create_data_collection():
result = dict()
result["train"] = defaultdict(list)
result["test"] = defaultdict(list)
return result
def knn_learning_curves(builder, formula, data, fold_count=10, repetitions=3, increment=1):
indices = list(range(len( data)))
results = create_data_collection()
for _ in range(repetitions):
random.shuffle(indices)
folds = chunk(indices, fold_count)
for fold in folds:
test_data = data.iloc[ fold]
train_indices = [idx for idx in indices if idx not in fold]
train_data = data.iloc[train_indices]
for i in list(range(increment, 100, increment)) + [100]: # ensures 100% is always picked.
# the indices are already shuffled so we only need to take ever increasing chunks
train_chunk_size = int( np.ceil((i/100)*len( train_indices)))
train_data_chunk = data.iloc[train_indices[0:train_chunk_size]]
# we calculate the model
y, X = patsy.dmatrices(formula, train_data_chunk, return_type="matrix")
model = builder.fit(X, y)
y_hat = model.predict(X)
training_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
results["train"][i].append(training_r_squared)
# y, X for test data
y, X = patsy.dmatrices(formula, test_data, return_type="matrix")
y_hat = model.predict(X)
test_r_squared = (stats.pearsonr(y, y_hat)[0][0])**2
results["test"][i].append(test_r_squared)
# process results
# Rely on the CLT...
statistics = {}
for k, v in results["train"].items():
statistics[ k] = (np.mean(v), np.std(v))
results["train"] = statistics
statistics = {}
for k, v in results["test"].items():
statistics[ k] = (np.mean(v), np.std(v))
results["test"] = statistics
return results
#
def results_to_curves( curve, results):
all_statistics = results[ curve]
keys = list( all_statistics.keys())
keys.sort()
mean = []
upper = []
lower = []
for k in keys:
m, s = all_statistics[ k]
mean.append( m)
upper.append( m + 2 * s)
lower.append( m - 2 * s)
return keys, lower, mean, upper
def plot_learning_curves( results, metric, zoom=False):
figure = plt.figure(figsize=(10,6))
axes = figure.add_subplot(1, 1, 1)
xs, train_lower, train_mean, train_upper = results_to_curves( "train", results)
_, test_lower, test_mean, test_upper = results_to_curves( "test", results)
axes.plot( xs, train_mean, color="steelblue")
axes.fill_between( xs, train_upper, train_lower, color="steelblue", alpha=0.25, label="train")
axes.plot( xs, test_mean, color="firebrick")
axes.fill_between( xs, test_upper, test_lower, color="firebrick", alpha=0.25, label="test")
axes.legend()
axes.set_xlabel( "training set (%)")
axes.set_ylabel( metric)
axes.set_title("Learning Curves")
if zoom:
y_lower = int( 0.9 * np.amin([train_lower[-1], test_lower[-1]]))
y_upper = int( 1.1 * np.amax([train_upper[-1], test_upper[-1]]))
axes.set_ylim((y_lower, y_upper))
plt.show()
plt.close()
formula = "medv ~ black + tax + dis + lstat_log + ptratio + rm"
test_curve = []
train_curve = []
for k in range(1, 21):
builder = neighbors.KNeighborsRegressor(k)
results = knn_cross_validation(formula, builder, data)
test_curve.append(stats.mstats.mquantiles(results['test'], [0.025, 0.5, 0.975]))
train_curve.append(stats.mstats.mquantiles(results['train'], [0.025, 0.5, 0.975]))
data_collection(test_curve, train_curve)
result = knn_learning_curves(neighbors.KNeighborsRegressor(1), formula, data)
plot_learning_curves(result, "$R^2$")
Creating ensemble of R2 estimates
builder = neighbors.KNeighborsRegressor(1)
knn_results = knn_cross_validation(formula, builder, data)
print(r"95% CI for R^2:", stats.mstats.mquantiles(knn_results['test'], [0.025, 0.975]))
Interpretation
Results suggest the 95% confidence interval of R2 in the decision tree method is between 0.67 and 0.99.
def bootstrap_sample( data, f, n=100):
result = []
m = len( data)
for _ in range( n):
sample = np.random.choice( data, len(data), replace=True)
r = f( sample)
result.append( r)
return np.array( result)
lr1_r2 = [r[1] for r in lr_result_1]
lr2_r2 = [r[1] for r in lr_result_2]
lr3_r2 = [r[1] for r in lr_result_3]
dt_r2 = DT_results['test']
knn_bootstrap = bootstrap_sample(knn_results['test'], np.mean,30)
sns.set(style="whitegrid")
# visualize result
figure = plt.figure(figsize=(8, 8)) # first element is width, second is height.
axes = figure.add_subplot(2, 3, 1)
axes.hist(lr1_r2, density=True)
axes.set_ylabel( "Density")
axes.set_title( "Linear Model 1")
axes = figure.add_subplot(2,3, 2)
axes.hist(lr2_r2, density=True)
axes.set_ylabel( "Density")
axes.set_title( "Linear Model 2")
axes = figure.add_subplot(2,3, 3)
axes.hist( lr3_r2, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "Linear Model 3")
axes = figure.add_subplot(2,3, 4)
axes.hist( dt_r2, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "Decision Tree")
axes = figure.add_subplot(2,3, 5)
axes.hist( knn_bootstrap, density=True)
axes.set_ylabel( "Density")
axes.set_xlabel( "$R^2$")
axes.set_title( "k-nearest neighbors")
plt.show()
plt.close()
Interpretation¶
Results suggest R2 from the 1st linear model centered around 0.7, whereas linear model 2 is centered around 0.8. Linear Model 3 seems to have a highr R2 compared to the other 2 linear models. But the R2 in knn regression method show likely the highest R2.
Let's look at this more quantitatively.
diff1 = np.array(lr2_r2) - np.array(lr1_r2)
print("P(LR2 > LR1)", np.mean(diff1 > 0))
Statistically, the chance that Linear Model 2 performs better than Linear Model 1 is only 57%.
(2) Compare linear model 3 with linear models 1 and 2¶
diff2a = np.array(lr3_r2) - np.array(lr1_r2)
print("P(LR3 > LR1)", np.mean(diff2a > 0))
diff2b = np.array(lr3_r2) - np.array(lr2_r2)
print("P(LR3 > LR2)", np.mean(diff2b > 0))
From the results above, they suggest the probability that linear model 3 performs bettter than models 1 and 2 is only 0.53 - 0.63.
(3) Compare decision tree model with linear models¶
diff3a = np.array(dt_r2) - np.array(lr1_r2)
print("P(DT > LR1)", np.mean(diff3a > 0))
diff3b = np.array(dt_r2) - np.array(lr2_r2)
print("P(DT > LR2)", np.mean(diff3b > 0))
diff3c = np.array(dt_r2) - np.array(lr3_r2)
print("P(DT > LR3)", np.mean(diff3c > 0))
(4) Compare K nearest neighbour model with linear models
diff3a = np.array(knn_bootstrap) - np.array(lr1_r2)
print("P(KNN > LR1)", np.mean(diff3a > 0))
diff3b = np.array(knn_bootstrap) - np.array(lr2_r2)
print("P(KNN > LR2)", np.mean(diff3b > 0))
diff3c = np.array(knn_bootstrap) - np.array(lr3_r2)
print("P(KNN > LR3)", np.mean(diff3c > 0))
diff4c = np.array(knn_bootstrap) - np.array(dt_r2)
print("P(KNN > dt_r2)", np.mean(diff4c > 0))
Results suggest the K nearest regression is much better than the performance of the three linear models we considered and comparable with the performance of the decision tree regression.
4. Conclusions¶
In this project, we selected the Boston housing dataset for our analysis and modeling. We conducted detailed EDA for individual variables and correlation analyses among any two variables.
From the EDA analysis, we constructed three linear model, one decision tree model, and one KNN model. We found that there was 80% of chance that the third linear model may perform slightly better than the other two linear models. But, the decision tree and KNN models are much better than any of the three linear models.