Lecture
8. Regression
Diagnostics: Review and Practice
I. Review of basic
assumptions about the error term
1.The mean of the error term is always zero, it does not vary
with the levels of x
2.The variance of the residuals is constant for all levels of
x
3.The
errors are uncorrelated. In other words, the errors associates with one value
of y have no effect on the errors associated with other y values
4.The probability distribution of the error term is normal
A.
Implications:
•Mean independence (1) ensures that the regression
coefficients are unbiased
•Homoscedasticity (2) and
uncorrelated disturbances (3) ensure that the standard errors are unbiased and
are the lowest possible (efficient)
•Normality (4) ensure the validity of confidence intervals
and p-values
B.
Remember that:
•Linear model does not assume that the distribution of a
variable’s observation is normal
•The assumptions involve the residuals
•Multiple regression expresses the joint multivariate
association of x’s with y
II. Regression Diagnostics
•Are procedures to:
–detect violations of the linear model assumptions
–gauge the severity of the violations
–take appropriate remedial action
A.
Model Specification
•Regression diagnosis is applied once the model is correctly
specified.
•If the model is functionally misspecified,
then the coefficients are biased
B.
STATA commands to test the model functional specification
•linktest: it tests whether y is properly
specified or not
•In this procedure “hatsq” must
test insignificant: we want to fail to reject the null hypothesis that y is
specified correctly
C. Example: •Use wage.dta
•Run the following
commands:
.use age, clear
.reg wage educ exper
tenure
.linktest
•Look at the _hatsq p-value. It is
near 0, so you reject the null hypothesis. This means that there is a
misspecification in the model. Let’s try to transform the dependent variable
•Run the following commands:
.generate
lwage=log(wage)
.qreg lwage educ
exper tenure
•qreg is the command for quantile regression, a procedure less sensitive to
y-outliers.
see: www2.sas.com/proceedings/sugi30/213-30.pdf
•The p-value of _hatsq is higher
than 0.05, therefore it is insignificant and know we have a correctly specified
model
•Or if you want to use OLS regression, you should explore the
specification of the independent variables.
•We have discussed that age and education are likely to have
non-linear relationships with income. What would happen with the p-value of hatsq if we transform the variable educ?
Let’s try. . .
•Run the following commands:
.generate
educsq=educ*educ
.reg lwage educ
educ2 exper tenure
•Look at the p-value of _hatsq, it is very high or
insignificant. Therefore, we can conclude that the present model is a correctly
specified model (but, not the best mode, you need to keep trying!). It only
means that the model has passed minimal statistical threshold of data fitting.
D. Caution:
•The omission of relevant independent variables causes biased
coefficients
•It means that we are not controlling for important effects
of x on y
•Thus we might incorrectly detect or fail to detect y/x relationships
•We need to consider not lonely what variables are included
in a model but also what variables are not included
III. Regression Specification Test
•STATA command: estat ovtest
•We use this command to indicate whether there are important
omitted variables or not.
•This procedure adds polynomials to the model’s fitted values
•We want it to test insignificant, so that we fail to reject
the null hypothesis that there are no important omitted variables
A. Example: Run the following commands:
.reg lwage educ
exper tenure
.estat ovtest
•And
.reg lwage educ
educ2 exper tenure
.estat ovtest
•The p-value for the second model is higher and
insignificant. Therefore, the second model is better.
•Recall that passing any diagnostic test by no means
guarantees that we have specified the best possible model.
B.
Graphic Perspective
•At this point in the model specification procedure it is
helpful to graph the fitted values vs. the residuals to obtain a graphical
perspective on the model’s fit and other problems related to the residuals
•Run the following commands:
.rvfplot, yline (0)
.rvfplot, yline(0) ms(i) ml(id)
.rvfplot, yline(0) ms(l) ml(female)
Look
at the graphs:
Do
you notice any patterns in the residuals?
IV. Heteroscedasticity or Non-Constant
variance
•The graphs display fan-shaped residuals, which indicate
problems of non-constant variance of residuals
•These patterns suggest that the residuals are not random but
rather are correlated with the values of x
A.
Implications
•In the presence of non-constant variance:
–The OLS standard errors are not optimal. Weighted least
squares would give better estimates
–The standard errors are biased, making statistical
significant hard or easy to detect
B.
STATA commands
•In STATA we test for non-constant variance by means of:
–Test with estat: hettest, szroeter and imtest, and
–Graphs: rvfplot and fvpplot
•We want the tests to turn out insignificant so that we fail
to reject the null hypothesis that there is no heteroscedasticity
C. Example:
•Run the following commands:
.reg lwage educ
educ2 exper tenure
.estat hettest, rhsmt(sidak)
.estat szroeter, rhsmt(sidak)
.
estat imtest
Take
a look at the output
Try
to interpret your results
•Summary of results for tests of non-constant variance
–hettest indicates problems with
tenure
–szroeter also indicates problems with tenure
–Imtestis barely insignificant,
which also indicates problems of non-constant variance in the error term
D.
Graphic Perspective
•Run the following commands:
.rvfplot, yline(0) ms(l) ml(id)
.rvpplot tenure, yline(0) ms(i) ml(id)
.rvpplot exper, yline(0)
ms(i) ml(id)
Examine
at the graphs
•Do you notice any differences in the patterns?
E. Example:
•The graph for tenure looks bad. The variance of the error
term increases with tenure
•The graph for exper is passable
F. solutions:
•Add omitted variables
•Include interactions or transformations of the independent
variables
•Use weighted least squares regression
•If nothing else works, use robust standard errors
•At the end, you must redo all diagnostics and compare the
new model’s coefficients to the original model