Notation Key

  • Vectors: boldface
  • Scalars: normal font
  • X₂: second data record
  • X₂: second component of dataset
  • Number of data records: n
  • "Star" for true quantity, e.g., θ*
  • "Hat" for estimates, e.g., θ̂
  • Upper case: random variables, e.g., Y
  • Lower case: numbers and constants, e.g., y for realized value of Y

New Insights

To find relationships between different variables to get new patterns.

Regression Framework

Vector     X₁ | Y₁
Dimension m : | :
Data        : | :
           Xₙ | Yₙ
           ---+---
            X | Y? ← Scalar

 

Regressor/predictor: Ŷ = g(X)

Goal: Build function g so when a new X comes in, we can output predicted value Ŷ. X → [g] → Ŷ

We need to learn a good g from the data.

Objective Function

Image

Where:

  • g(Xᵢ) is the prediction
  • Yᵢ is the actual value
  • g(Xᵢ) - Yᵢ is the error, which we want to be small
  • The whole function is mean square error (MSE)

Overfitting Warning

If it's an arbitrary curve (a line that goes through all data points, so the error is 0), we can't believe it. This is called overfitting, which we need to avoid since it leads to nonsensical conclusions.

[Graph showing overfitting: A wiggly line that perfectly passes through all points, versus a simpler linear fit]

Linear Regression Model

We prohibit g to be arbitrary general, restrict to limited class of predictors:

Within linear regression, we restrict to the class of predictors that are linear in the attributes in the X vector.

When a person comes in with X₁ up to Xₘ, then form a linear combination of these attributes.

Choice of predictor β = curve in 2D β determines the location of that line.

The slope here would be β₁. By playing with β, we can move the line around.

Residuals

Residual is an error between predicted value and the observed value.

We need to find θ so the sum of squared residuals is as small as possible:

Any line gives certain numerical value for the sum of squared residuals. This is called ordinary least squares (OLS).

Assumptions of Linear Regression

  1. Linearity - The regression model can be expressed linearly
  2. Homoscedasticity - The variance of the error is constant
  3. Independence - Observations are independent from each other
  4. No autocorrelation

Example Application

  • n = 209
  • m+1 = 4
  • β₀ is the intercept
  • θ̂ = [2.94, 0.064, 0.19, -0.001]ᵀ
  • Sales = 2.94 + 0.064(TV) + 0.19(Radio) - 0.001(Newspaper)

Linear regression software produces the coefficients that multiply the ads expenditure over different channels.

-0.001 is unusual as it suggests the more you spend, the lower the sales.

Simple linear regression example: Sales = 12.35 + 0.055(Newspaper)

0.055 contradicts with -0.001, so which one is true?