Lecture
6. Regression Diagnostics
I. Assumptions
•The validity of multiple regression depends
on whether certain assumptions are satisfied.
•OLS assumptions are specific conditions
under which multiple regression works well.
II. Standards of Performance
•Bias: An estimation method is unbiased if
there is no systematic tendency to produce
estimates that are too high or too low
•Efficiency: How much variation there is
around the true value. Efficient estimation
methods have standard errors that are
as small as possible
III. Probability Sampling
•Every individual has an equal probability
of being chosen
IV. Critical questions about
estimation
•Does it tell us about the causal
relationship among the variables?
V. Standard Linear Model
Assumptions
•Linearity: the dependent variable y is a
linear function of the x’splus a random disturbance
U(random noise or unexplained variation)
•Mean independence: The mean of U does not
depend on the x’s. The mean of U is always
zero.
•Homoscedasticity. The variance of U cannot
depend on the x’s. It is always constant.
•Uncorrelated disturbances. Value of U for
any observation is uncorrelated with the value
of U
for any other observation
•Normal disturbance. U has a normal
distribution
•Linearity and Mean independence guarantee
unbiased estimates
•Homoscedasticity and Uncorrelated
disturbances guarantee efficiency
•Normality indicate that we can use t tables
to calculate p values and CI
Based on the assumptions OLS
estimation is:
(1)•Best (2)
•Linear (3) •Unbiased Estimation Method
VI. Disturbance term U
•It is treated as a random variable: U has a
probability distribution
•For every value of U there is a certain
probability that that value will occur
•There is a different U for each individual
in the data ser
VII. Mean Independence
•The independent variables are unrelated to
the random disturbance U
•The mean of U is zero to get unbiased
estimates of the intercept
•This is the most critical assumption
•Violations of this assumption can:
–Produce severe bias in estimates
–There are often reasons to expect violations
–There is no way to test for violations
without additional data
•Conditions that lead to violations
–Omitted x variables. If any omitted
variable is correlated with the measured x’s that will
produce a correlation
–Reverse causation. If y has a causal
effect on any of the x’s, then U will indirectly affect
the x’s.
–Measurement error in the x’s.
VIII. Homoscedasticity
•Homoscedasticity vs. Heteroscedasticity
•Homoscedasticity: the degree of random
noise is always the same, regardless of the values
of
the x variables
•This assumption can be checked with the
data: scatter plot of observed y and predicted y
•Violation to this assumption does not
produce bias estimators, but produce inefficient
estimators and biased standard errors
–Inefficiency: because OLS gives equal weight to all observations.
Observations with
smaller disturbance contain more information.
Solution: Weighted Least Squares
–Biased standard errors: which lead to
bias test statistics and confidence intervals, and
therefore produce incorrect conclusions
•Solution: use robust standard error
•This procedure does not solve the problem
of inefficiency but it will give you accurate test
statistics
•Another solution is to transform the
dependent variable (e.g. instead of using income, use
the logarithm of income. This procedure is
called variance stabilizing transformation. This
transformations change the nature of the
relationship between y and the x’s.
IX. Uncorrelated Disturbances
•Ways in which this assumption might be
violated:
–If unmeasured variables are common to
two or more observations, their U terms will be
correlated
–If the behavior of one person affects
the behavior of another person in the sample
–Time series, clustered samples
•Consequences of violating this assumption:
–Although the coefficients remain
unbiased they will be inefficient
–The estimated standard errors will be
biased downward and the test statistics will be
biased upward. Therefore, there will be a
tendency to conclude that relationships exist
when they really don’t.
•Diagnosis:
–Analysis of residual correlation for
‘pairs’ of individuals
–For clustering: estimate intraclass correlation
coefficient
–Durbin-Watson d test for residual
correlation (time series analysis)
•Solution:
–Generalized least squares will produce
optimal estimates and good estimates of standard
errors
X. Normality:
•The ONLY variable that is assumed to have a
normal distribution is the disturbance term U
•It could be problematic when the sample is
small (less than 200 cases)
•If the sample is small, insist on smaller p
values
XI. Other assumptions
•Independent variables are fixed
•Multivariate normal model
•There is no perfect multicollinearityamong
the independent variables
XII. Multicollinearity
•Extreme:
–At least two independent variables are
perfectly related by a linear function
–Consequence: it is impossible to get
separate estimators for one of the independent variables
•Near-extreme:
–There is a strong linear relationship
among the independent variables
•It only affects the coefficient estimates
for those variables that are collinear
•It has nothing to do with the dependent
variable
A. Diagnosis
•Basic method: examine a matrix of
two-variable correlations among all independent
variables
•Regress each independent variable on all
other independent variables and look for
high R-squares in any of the regressions
(you should consider multicollinearity as
problematic when you get a R-square>0.6)
•Tolerance: 1-(R-square) for each
independent variable. Watch out for low tolerances (T<0.1)
•VIF (Variance Inflation Factor), which is
the reciprocal of the tolerance. Tolerances< 0.1
correspond to VIF>10
B. Consequences
•Near extreme multicollinearity is not a
violation of the OLS assumptions
•However, if an independent variable is
highly collinear with other variables, the standard
error of the coefficient will be large. So, it
is harder to find statistically significant
coefficients
C. When multicollinearity is
more likely to occur?
•Time-series data
•Panel data
•Aggregated, group level or ecological data.
The units are not individuals but groups
of
individuals and the variables are summary
measures for those groups
D. Solutions
•When collinear variables are alternative
measures of the same conceptual variable:
–Delete one or more variables from the
model
–Combine the collinear variables into an
index
–Estimate a latent variable model
–Perform joint hypothesis test (Ho: none
of the collinear variables has a coefficient that
differs from zero)
•The only real solution is to get better
data!