Topic overview
Regression Analysis
The foundation of predictive analytics. Learn to build, interpret, and test regression models.
Learning objectives
- •Understand the assumptions behind the linear regression model.
- •Understand the difference between correlation and causality.
- •Estimate and interpret a linear regression model.
- •Interpret goodness-of-fit measures.
- •Conduct tests of significance.
- •Address common violations of the OLS assumptions.
- •Introduce R Markdown.
Key Learning Summary
The core ideas to take away from this week.
OLS Assumptions Matter
Linear regression relies on assumptions like linearity, independent errors, and constant variance. Violations can make your inference unreliable.
Correlation Is Not Causation
A strong relationship does not prove cause and effect. You need theory, design, or controls to make causal claims.
Interpret Coefficients in Context
Each coefficient is the expected change in Y for a one-unit change in X, holding other variables constant.
Goodness-of-Fit Is Not Enough
R-squared and adjusted R-squared summarize fit, but you still need significance tests and diagnostics.
Significance Tests Guide Decisions
t-tests, p-values, and confidence intervals tell you which predictors matter and how precise your estimates are.
Document Work with R Markdown
Use R Markdown to combine code, output, and explanation in one reproducible report.
The one sentence to remember: Linear regression is powerful, but only when its assumptions are checked and its results are interpreted responsibly.
Key Vocabulary
Terms you should be able to define and use confidently.
Core Concepts
Statistical Model
A simplified representation of reality built from assumptions.
Dependent Variable (Y)
The outcome we want to explain or predict.
Independent Variable (X)
A predictor used to explain variation in Y.
Correlation vs. Causation
Correlation shows association. Causation requires evidence of a true cause-and-effect link.
Fit & Inference
R-squared
Proportion of variation in Y explained by the model.
Adjusted R-squared
R-squared adjusted for the number of predictors.
t-statistic
Measures how many standard errors a coefficient is from zero.
p-value
Probability of seeing the result if the true effect were zero.
Confidence Interval
A range of plausible values for the true coefficient.
Diagnostics
Residual
The difference between actual and predicted values: e = y − ŷ.
Heteroscedasticity
The spread of residuals changes across fitted values.
Outlier
An extreme observation that can distort the model.
Multicollinearity
Predictors are highly correlated, inflating standard errors.
Key Formulas
The essentials you should recognize and use.
Core Regression
Regression Equation
The model’s predicted value for Y given the inputs.
Residual
The model’s error for each observation.
Fit & Inference
R-squared
The fraction of total variation in Y explained by the model.
t-statistic
Used to test whether a coefficient is statistically different from zero.
Confidence Interval
A range of plausible values for the true coefficient.
Diagnostics
Homoscedasticity
The spread of residuals should be roughly constant across fitted values.