{ "cells": [ { "cell_type": "markdown", "id": "565fd595", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Linear regression: prediction error and more" ] }, { "cell_type": "markdown", "id": "658ba253", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "##### Libraries" ] }, { "cell_type": "code", "execution_count": 2, "id": "51c0f716", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1\n", " warnings.warn(f\"A NumPy version >={np_minversion} and <{np_maxversion}\"\n" ] } ], "source": [ "## Imports\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import seaborn as sns\n", "import scipy.stats as ss" ] }, { "cell_type": "code", "execution_count": 3, "id": "e53866de", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina' # makes figs nicer!" ] }, { "cell_type": "markdown", "id": "0e698b4b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Goals of this lecture\n", " \n", "- Extracting model **predictions**.\n", "- Basic model evaluation: \n", " - Visualizing $\\hat{Y}$ vs. $Y$.\n", " - $RSS$: residual sum of squares. \n", " - $S_{Y|X}$: standard error of the estimate. \n", " - Using $S_{Y|X}$ to calculate **standard error** for our coefficients.\n", " - $R^2$: coefficient of determination. \n", "- Homoscedasticity. " ] }, { "cell_type": "markdown", "id": "559a9a83", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Models as *predictors*" ] }, { "cell_type": "markdown", "id": "f3f1f507", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Modeling our data\n", "\n", "> A **statistical model** is a mathematical model representing a \"data-generating process\".\n", "\n", "This means we can use a model to **generate predictions** for some value of $X$. \n", "\n", "$\\Large \\hat{Y} = f(X, \\beta)$" ] }, { "cell_type": "code", "execution_count": 3, "id": "67cfbaab", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Education | \n", "Seniority | \n", "Income | \n", "
---|---|---|---|
0 | \n", "21.586207 | \n", "113.103448 | \n", "99.917173 | \n", "
1 | \n", "18.275862 | \n", "119.310345 | \n", "92.579135 | \n", "
2 | \n", "12.068966 | \n", "100.689655 | \n", "34.678727 | \n", "