Linear regression: prediction error and more#

Libraries#

## Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import scipy.stats as ss
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!

Goals of this lecture#

  • Extracting model predictions.

  • Basic model evaluation:

    • Visualizing \(\hat{Y}\) vs. \(Y\).

    • \(RSS\): residual sum of squares.

    • \(S_{Y|X}\): standard error of the estimate.

      • Using \(S_{Y|X}\) to calculate standard error for our coefficients.

    • \(R^2\): coefficient of determination.

  • Homoscedasticity.

Models as predictors#

Modeling our data#

A statistical model is a mathematical model representing a “data-generating process”.

This means we can use a model to generate predictions for some value of \(X\).

\(\Large \hat{Y} = f(X, \beta)\)

df_income = pd.read_csv("data/models/income.csv")
df_income.head(3)
Education Seniority Income
0 21.586207 113.103448 99.917173
1 18.275862 119.310345 92.579135
2 12.068966 100.689655 34.678727

Predictions from a linear model#

Predictions from a linear model can be obtained by “plugging in” some value of \(X\) to the linear equation with fit model parameters (\(\beta_0\), \(\beta_1\)).

mod_edu = smf.ols(data = df_income, formula = "Income ~ Education").fit()
mod_edu.params
Intercept   -41.916612
Education     6.387161
dtype: float64

Here, the linear equation would be written:

\(Y = -41.92 + 6.39 * X_1 + \epsilon\)

X = 10
-41.92 + 6.39 * X ## predicted income when X = 10
21.979999999999997

Predictions using statsmodels#

  • Instead of generating predictions by hand, you can use the predict function in statsmodels.

  • By default, this will generate predictions for all of the original values of \(X\).

mod_edu.predict()
array([95.95797131, 74.81426521, 35.16981627, 66.88537542, 85.38611826,
       74.81426521, 85.38611826, 93.31500804, 88.02908152, 21.95499996,
       45.74166932, 77.45722847, 32.52685301, 64.24241216, 21.95499996,
       88.02908152, 48.38463259, 64.24241216, 64.24241216, 88.02908152,
       74.81426521, 51.02759585, 69.52833868, 24.59796322, 95.95797131,
       29.88388975, 85.38611826, 32.52685301, 35.16981627, 66.88537542])

predict() for new data?#

You can also use predict() for new data––it just needs to be represented as a DataFrame with the appropriate columns.

## Create new DataFrame
new_data = pd.DataFrame({'Education': [10, 25, 35]})
mod_edu.predict(new_data)
0     21.955000
1    117.762418
2    181.634030
dtype: float64

Model evaluation#

Once we’ve built a model, we want to evaluate it: how good is this model?

We’ll discuss several ways to evaluate our linear model:

  • Visually: plotting \(\hat{Y}\) or the residuals of the model.

  • Evaluation metrics: \(RSS\), \(MSE\), and \(S_{Y|X}\)

Visual comparisons#

We can use data visualizations to evaluate the success of our model.

There are a few ways to do this:

  1. Plotting \(\hat{Y}\) as the regression line over Income and Education.

  2. Directly plotting \(\hat{Y}\) vs. \(Y\).

  3. Plotting the residuals of our model.

Plotting \(\hat{Y}\) over the regression line#

y_pred = mod_edu.predict()
sns.scatterplot(x = df_income['Education'], y = df_income['Income'])
sns.lineplot(x = df_income['Education'], y = y_pred, alpha = .5, linestyle = "dotted")
plt.xlabel("Education")
plt.ylabel("Income")
Text(0, 0.5, 'Income')
../_images/7c6561bc73c3499015ad3cadd9280745697cacbe52d90ec76878f51d55761f2b.png

Comparing \(\hat{Y}\) to \(Y\) directly#

If we had a perfect model, \(\hat{Y}\) should be identical to \(Y\).

y_pred = mod_edu.predict()
sns.scatterplot(x = y_pred, y = df_income['Income'])
sns.lineplot(x = df_income['Income'], y = df_income['Income'], alpha = .5, linestyle = "dotted")
plt.xlabel("Predicted values: $\hat{Y}$")
plt.ylabel("Real values: $Y$")
Text(0, 0.5, 'Real values: $Y$')
../_images/0ea1a9896adf71659c7be77398d4d2f38ae6424b49991692185e3ba25cb4c814.png

Plotting residuals#

The residuals of a model are the difference between each predicted value \(\hat{Y}\) and real value \(Y\).

  • We can calculate the residuals directly: \(\hat{Y} - Y\).

  • Or we can access them using mod.resid.

resid = y_pred - df_income['Income'].values ### calculate residuals
sns.scatterplot(x = df_income['Education'], y = resid)
plt.axhline(y = 0, linestyle = "dotted")
plt.ylabel("Residuals")
Text(0, 0.5, 'Residuals')
../_images/e62dca18b11e6b91a4cddcffa3561bf1516a953df9faa3e4a1be311d8018f096.png

Check-in#

Build a linear model predicting Income from Seniority. Then, evaluate this model using one of the visual comparison methods discussed above:

  1. Plotting \(\hat{Y}\) as the regression line over Income and Education.

  2. Directly plotting \(\hat{Y}\) vs. \(Y\).

  3. Plotting the residuals of our model.

### Your code here

Solution (1)#

mod_seniority = smf.ols(data = df_income, formula = "Income ~ Seniority").fit()
sns.scatterplot(x = df_income['Seniority'], y = df_income['Income'])
sns.lineplot(x = df_income['Seniority'], y = mod_seniority.predict(), linestyle = "dotted")
<Axes: xlabel='Seniority', ylabel='Income'>
../_images/2162cafd924712bf00ad8fe9ad0915b5131bcba58b7368aee5bdf376db430364.png

Solution (2)#

sns.scatterplot(x = mod_seniority.predict(), y = df_income['Income'])
sns.lineplot(x = df_income['Income'], y = df_income['Income'], alpha = .5, linestyle = "dotted")
plt.xlabel("Predicted values: $\hat{Y}$")
plt.ylabel("Real values: $Y$")
Text(0, 0.5, 'Real values: $Y$')
../_images/f6608d54801a4fdcb11d68c7fec8c94e3beda10fc60d84d6237ddc963ab30e7a.png

Solution (3)#

sns.scatterplot(x = df_income['Seniority'], y = mod_seniority.resid)
plt.axhline(y = 0, linestyle = "dotted")
plt.ylabel("Residuals")
Text(0, 0.5, 'Residuals')
../_images/51694b1afc66ba440384b1e730ae3aed370009301e5fda5fa7c3c56212e6f7de.png

Calculating \(RSS\)#

The residual sum of squares is the sum of squared deviations between each prediction and the real values of \(Y\).

\(RSS = \sum_i^N (\hat{y}_i - y_i)^2\)

Note: A higher value is worse (i.e., more error).

### define a helper function!
def rss(y_pred, y):
    return sum((y_pred - y)**2)
## RSS for "Education" model
rss(mod_edu.predict(), df_income['Income'])
3982.506585472266

Check-in#

Calculate the \(RSS\) for your Seniority model. Is it better or worse? Does this match your intuition from the visual comparisons?

### Your code here

Solution#

## RSS for "Seniority" model
rss(mod_seniority.predict(), df_income['Income'])
15477.269661883032

Visually comparing \(RSS\)#

rss_seniority = rss(mod_seniority.predict(), df_income['Income'])
rss_education = rss(mod_edu.predict(), df_income['Income'])
df_rss = pd.DataFrame({'RSS': [rss_seniority, rss_education],
                      'Predictor': ['Seniority', 'Education']})
sns.barplot(data = df_rss, x = "Predictor", y = "RSS")
<Axes: xlabel='Predictor', ylabel='RSS'>
../_images/873e291b7b1d5a38d04ab6024a8ea4e61a22bb78bbdbfa03c2fc535eebc47de6.png

Calculating \(MSE\)#

The mean squared error is the \(RSS\) divided by the number of data points.

\(MSE = \frac{1}{N}*\sum_i^N (\hat{y}_i - y_i)^2\)

Note: A higher value is worse (i.e., more error).

### define a helper function!
def mse(y_pred, y):
    return (sum((y_pred - y)**2)) / (len(y))
## RSS for "Education" model
mse(mod_edu.predict(), df_income['Income'])
132.7502195157422

Check-in#

Calculate the \(MSE\) for your Seniority model. Is it better or worse? Does this match your intuition from the visual comparisons?

### Your code here

Solution#

## RSS for "Seniority" model
mse(mod_seniority.predict(), df_income['Income'])
515.9089887294344

Calculating \(S_{Y|X}\)#

The standard error of the estimate is a measure of the expected prediction error, i.e., how much your predictions are “wrong” on average.

\(\Large S_{Y|X} = \sqrt{\frac{RSS}{n-2}}\)

  • How much, on average, do we expect \(\hat{Y}\) to deviate from \(Y\)?

  • As before, a smaller number means better fit.

### Calculating directly
np.sqrt(rss(mod_edu.predict(), df_income['Income']) / (len(df_income) - 2))
11.926121668530007

Calculating using statsmodels#

A fit model in statsmodels also gives us a scale parameter; the square root of this parameter is \(S_{Y|X}\).

mod_edu.scale ## model variance
142.2323780525809
np.sqrt(mod_edu.scale) ## standard error of the estimate
11.926121668530005

Reporting \(S_{Y|X}\)#

We can report the standard error of the estimate by attaching it to our predictions, such as:

For an individual with \(10\) years of education, we predict a salary of \(21.96, \pm 11.93\).

Check-in#

What is the standard error of the estimate for the Seniority model? Is it higher or lower than the one for the Education model? What does this mean with respect to our prediction accuracy?

### Your code here

Solution#

np.sqrt(mod_seniority.scale)
23.510840707672212

\(R^2\): how much variance is explained?#

The \(R^2\), or coefficient of determination, measures the proportion of variance in \(Y\) explained by the model.

\(\Large R^2 = 1 - \frac{RSS}{SS_{Y}}\)

Check-in: Why is this a sensible formula for calculating the proportion of variance in \(Y\) explained by the model?

Decomposing \(R^2\) (pt. 1)#

First, consider the terms in the fraction:

  • \(RSS\): Leftover variance in \(Y\) after fitting the model.

  • \(SS_Y\): Total variance in \(Y\) (i.e., before fitting the model).

Thus, \(\frac{RSS}{SS_{Y}} = 1\) would mean the model hasn’t explained anything, while \(\frac{RSS}{SS_{Y}} = 0\) would mean there’s \(0\) leftover variance after fitting the model.

Decomposing \(R^2\) (pt. 2)#

So \(\frac{RSS}{SS_{Y}}\) gives us the ratio of leftover variance to Total variance.

This means that subtracting \(1 - \frac{RSS}{SS_{Y}}\) gives us the proportion of variance explained by the model (i.e., the inverse of the leftover variance).

  • \(R^2 = 0\) means the model hasn’t explained anything.

  • \(R^2 = 1\) means there’s \(0\) leftover variance.

Extracting \(R^2\)#

We can extract \(R^2\) directly from the fit model.

### What does this R^2 value mean?
mod_edu.rsquared
0.811806896672258

Check-in#

Extract the \(R^2\) for your Seniority model. What is it? How does it compare to the \(R^2\) for the Education model?

### Your code here

Solution#

### What does this R^2 value mean?
mod_seniority.rsquared
0.2686225757076368

Comparing three models#

Now, let’s compare \(3\) different models:

  1. An Intercept-only model, i.e., with no predictors.

  2. The model with Seniority.

  3. The model with Education.

mod_intercept = smf.ols(data = df_income, formula = "Income ~ 1").fit()
df_r2 = pd.DataFrame({'R2': [mod_edu.rsquared, mod_seniority.rsquared,
                            mod_intercept.rsquared],
                     'Predictors': ['Education', 'Senioritiy', 'Intercept']})
sns.barplot(data = df_r2, x = "Predictors", y = "R2")
<Axes: xlabel='Predictors', ylabel='R2'>
../_images/a0f1a99b03eb93a2509e40cf3f5afbab295a907f73c5de78b6d29c5cdc31cd3f.png

The return of heteroscedasticity#

Review: Homoscedasticity#

Homoscedasticity means that the residuals are evenly distributed along the regression line.

  • The opposite of homoscedasticity is heteroscedasticity.

### This is an example of heteroscedasticity
np.random.seed(1)
X = np.arange(1, 101)
y = X + np.random.normal(loc = 0, scale = X/2, size = 100)
plt.scatter(X, y, alpha = .6)
<matplotlib.collections.PathCollection at 0x173e758d0>
../_images/ad672721e09229a7a2c5999b0ce0ff040768f55516e0b57d76579370819a1b0d.png

Check-in#

Why might heteroscedasticity pose a problem for our measures of prediction error, e.g., \(S_{Y|X}\)?

### Your answer here

Heteroscedasticity and prediction error#

The standard error of the estimate gives us an estimate of how much prediction error to expect.

But with heteroscedasticity, our error is itself correlated with \(X\).

  • Less error for small values of \(X\).

  • More error for large values of \(X\).

plt.scatter(X, y, alpha = .6)
plt.plot(X, X, linestyle = "dotted", color = "red", alpha = .6)
[<matplotlib.lines.Line2D at 0x173e76410>]
../_images/e285476aa742eb759f0cdb92349c82f6342cf247cdcb00d115cb5d35adec38ce.png

Heteroscedasticity and \(\beta\) SE#

  • Heteroscedasticity can also bias our estimates of the error associated with each coefficient.

  • Typically, this results in an underestimate of SE––potentially leading to false positives.

  • This is because calculating \(SE\) depends on \(S_{Y|X}\)!

Calculating \(SE(\beta)\)#

The formula for calculating the standard error of our slope coefficient is:

\(SE(\beta_1) = \frac{S_{Y|X}}{\sqrt{SS_X}} = \sqrt{\frac{\frac{1}{n-2}RSS}{SS_X}}\)

(You won’t be required to know this––but adds important context!)

Conclusion#

  • An important part of statistical modeling is evaluating our models, either by:

    • Visual inspection.

    • Using metrics of predictive success.

  • Metrics liks \(S_{Y|X}\) depend on certain assumptions being met, like homoscedasticity.