Lecture 8.  Regression Diagnostics: Review and Practice

 

I.  Review of basic assumptions about the error term

 

1.The mean of the error term is always zero, it does not vary with the levels of x

2.The variance of the residuals is constant for all levels of x

3.The errors are uncorrelated. In other words, the errors associates with one value of y have no effect on the errors associated with other y values

4.The probability distribution of the error term is normal

 

A. Implications:

 

•Mean independence (1) ensures that the regression coefficients are unbiased

Homoscedasticity (2) and uncorrelated disturbances (3) ensure that the standard errors are unbiased and are the lowest possible (efficient)

•Normality (4) ensure the validity of confidence intervals and p-values

 

B. Remember that:

 

•Linear model does not assume that the distribution of a variable’s observation is normal

•The assumptions involve the residuals

•Multiple regression expresses the joint multivariate association of x’s with y

 

II. Regression Diagnostics

 

•Are procedures to:

–detect violations of the linear model assumptions

–gauge the severity of the violations

–take appropriate remedial action

 

A. Model Specification

 

•Regression diagnosis is applied once the model is correctly specified.

•If the model is functionally misspecified, then the coefficients are biased

 

B. STATA commands to test the model functional specification

 

linktest: it tests whether y is properly specified or not

•In this procedure “hatsq” must test insignificant: we want to fail to reject the null hypothesis that y is specified correctly

 

C. Example: •Use wage.dta

 

 •Run the following commands:

 

.use age, clear

.reg wage educ exper tenure

.linktest

 

•Look at the _hatsq p-value. It is near 0, so you reject the null hypothesis. This means that there is a misspecification in the model. Let’s try to transform the dependent variable

•Run the following commands:

 

.generate lwage=log(wage)

.qreg lwage educ exper tenure

 

qreg is the command for quantile regression, a procedure less sensitive to y-outliers.

see: www2.sas.com/proceedings/sugi30/213-30.pdf

•The p-value of _hatsq is higher than 0.05, therefore it is insignificant and know we have a correctly specified model

•Or if you want to use OLS regression, you should explore the specification of the independent variables.

•We have discussed that age and education are likely to have non-linear relationships with income. What would happen with the p-value of hatsq if we transform the variable educ? Let’s try. . .

•Run the following commands:

 

.generate educsq=educ*educ

.reg lwage educ educ2 exper tenure

 

•Look at the p-value of _hatsq, it is very high or insignificant. Therefore, we can conclude that the present model is a correctly specified model (but, not the best mode, you need to keep trying!). It only means that the model has passed minimal statistical threshold of data fitting.

 

D. Caution:

 

•The omission of relevant independent variables causes biased coefficients

•It means that we are not controlling for important effects of x on y

•Thus we might incorrectly detect or fail to detect y/x relationships

•We need to consider not lonely what variables are included in a model but also what variables are not included

 

III. Regression Specification Test

 

•STATA command: estat ovtest

•We use this command to indicate whether there are important omitted variables or not.

•This procedure adds polynomials to the model’s fitted values

•We want it to test insignificant, so that we fail to reject the null hypothesis that there are no important omitted variables

 

A. Example: Run the following commands:

 

.reg lwage  educ exper tenure

.estat ovtest

•And

.reg lwage educ educ2 exper tenure

.estat ovtest

 

•The p-value for the second model is higher and insignificant. Therefore, the second model is better.

•Recall that passing any diagnostic test by no means guarantees that we have specified the best possible model.

 

B. Graphic Perspective

 

•At this point in the model specification procedure it is helpful to graph the fitted values vs. the residuals to obtain a graphical perspective on the model’s fit and other problems related to the residuals

•Run the following commands:

 

.rvfplot, yline (0)

.rvfplot, yline(0) ms(i) ml(id)

.rvfplot, yline(0) ms(l) ml(female)

 

Look at the graphs:

Do you notice any patterns in the residuals?

 

IV. Heteroscedasticity or Non-Constant variance

 

•The graphs display fan-shaped residuals, which indicate problems of non-constant variance of residuals

•These patterns suggest that the residuals are not random but rather are correlated with the values of x

 

A. Implications

 

•In the presence of non-constant variance:

–The OLS standard errors are not optimal. Weighted least squares would give better estimates

–The standard errors are biased, making statistical significant hard or easy to detect

 

B. STATA commands

 

•In STATA we test for non-constant variance by means of:

 

–Test with estat: hettest, szroeter and imtest, and

–Graphs: rvfplot and fvpplot

 

•We want the tests to turn out insignificant so that we fail to reject the null hypothesis that there is no heteroscedasticity

 

C. Example:

 

•Run the following commands:

 

.reg lwage educ educ2 exper tenure

.estat hettest, rhsmt(sidak)

.estat szroeter, rhsmt(sidak)

. estat imtest

 

Take a look at the output

Try to interpret your results

 

 

•Summary of results for tests of non-constant variance

hettest indicates problems with tenure

szroeter  also indicates problems with tenure

Imtestis barely insignificant, which also indicates problems of non-constant variance in the error term

 

D. Graphic Perspective

 

•Run the following commands:

 

.rvfplot, yline(0) ms(l) ml(id)

.rvpplot tenure, yline(0) ms(i) ml(id)

.rvpplot exper, yline(0) ms(i) ml(id)

 

Examine at the graphs

•Do you notice any differences in the patterns?

 

E. Example:

 

•The graph for tenure looks bad. The variance of the error term increases with tenure

•The graph for exper is passable

 

F. solutions:

 

•Add omitted variables

•Include interactions or transformations of the independent variables

•Use weighted least squares regression

•If nothing else works, use robust standard errors

•At the end, you must redo all diagnostics and compare the new model’s coefficients to the original model