Lecture 6.  Regression Diagnostics


I. Assumptions

   •The validity of multiple regression depends on whether certain assumptions are satisfied.

   •OLS assumptions are specific conditions under which multiple regression works well.


 

II. Standards of Performance

   •Bias: An estimation method is unbiased if there is no systematic tendency to produce

        estimates that are too high or too low

   •Efficiency: How much variation there is around the true value. Efficient estimation

         methods have standard errors that are as small as possible

 

III. Probability Sampling

   •Every individual has an equal probability of being chosen

 

IV. Critical questions about estimation

    •Does it tell us about the causal relationship among the variables?


 

V. Standard Linear Model Assumptions

   •Linearity: the dependent variable y is a linear function of the x’splus a random disturbance

    U(random noise or unexplained variation)

   •Mean independence: The mean of U does not depend on the x’s. The mean of U is always

    zero.


   •Homoscedasticity. The variance of U cannot depend on the x’s. It is always constant.

   •Uncorrelated disturbances. Value of U for any observation is uncorrelated with the value

    of U for any other observation


   •Normal disturbance. U has a normal distribution


   •Linearity and Mean independence guarantee unbiased estimates

   •Homoscedasticity and Uncorrelated disturbances guarantee efficiency

   •Normality indicate that we can use t tables to calculate p values and CI


 

Based on the assumptions OLS estimation is: 

(1)•Best (2) •Linear (3) •Unbiased Estimation Method


 

VI. Disturbance term U

  •It is treated as a random variable: U has a probability distribution

  •For every value of U there is a certain probability that that value will occur

  •There is a different U for each individual in the data ser


 

VII. Mean Independence

  •The independent variables are unrelated to the random disturbance U

  •The mean of U is zero to get unbiased estimates of the intercept

  •This is the most critical assumption


  •Violations of this assumption can:

      –Produce severe bias in estimates

      –There are often reasons to expect violations

      –There is no way to test for violations without additional data


   •Conditions that lead to violations

       –Omitted x variables. If any omitted variable is correlated with the measured x’s that will

         produce a correlation

       –Reverse causation. If y has a causal effect on any of the x’s, then U will indirectly affect

         the x’s.

      –Measurement error in the x’s.


 

VIII. Homoscedasticity

   •Homoscedasticity vs. Heteroscedasticity

   •Homoscedasticity: the degree of random noise is always the same, regardless of the values

    of the x variables

   •This assumption can be checked with the data: scatter plot of observed y and predicted y


   •Violation to this assumption does not produce bias estimators, but produce inefficient

     estimators and biased standard errors

      –Inefficiency: because OLS  gives equal weight to all observations. Observations with

        smaller disturbance contain more information. Solution: Weighted Least Squares

      –Biased standard errors: which lead to bias test statistics and confidence intervals, and

        therefore produce incorrect conclusions


   •Solution: use robust standard error

   •This procedure does not solve the problem of inefficiency but it will give you accurate test

     statistics

   •Another solution is to transform the dependent variable (e.g. instead of using income, use

    the logarithm of income. This procedure is called variance stabilizing transformation. This

    transformations change the nature of the relationship between y and the x’s.


 

IX. Uncorrelated Disturbances

   •Ways in which this assumption might be violated:

      –If unmeasured variables are common to two or more observations, their U terms will be

        correlated

      –If the behavior of one person affects the behavior of another person in the sample

      –Time series, clustered samples


   •Consequences of violating this assumption:

      –Although the coefficients remain unbiased they will be inefficient

      –The estimated standard errors will be biased downward and the test statistics will be

        biased upward. Therefore, there will be a tendency to conclude that relationships exist

        when they really don’t.


   •Diagnosis:

      –Analysis of residual correlation for ‘pairs’ of individuals

      –For clustering: estimate intraclass correlation coefficient

      –Durbin-Watson d test for residual correlation (time series analysis)

   •Solution:

      –Generalized least squares will produce optimal estimates and good estimates of standard

        errors


 

X. Normality:

   •The ONLY variable that is assumed to have a normal distribution is the disturbance term U

   •It could be problematic when the sample is small (less than 200 cases)

   •If the sample is small, insist on smaller p values


 

XI. Other assumptions

   •Independent variables are fixed

   •Multivariate normal model

   •There is no perfect multicollinearityamong the independent variables


 

 

 

 

 

XII. Multicollinearity

   •Extreme:

      –At least two independent variables are perfectly related by a linear function

      –Consequence: it is impossible to get separate estimators for one of the independent variables

   •Near-extreme:

      –There is a strong linear relationship among the independent variables


   •It only affects the coefficient estimates for those variables that are collinear

   •It has nothing to do with the dependent variable


 

A. Diagnosis

   •Basic method: examine a matrix of two-variable correlations among all independent

    variables

   •Regress each independent variable on all other independent variables and look for

    high R-squares in any of the regressions (you should consider multicollinearity as

    problematic when you get a R-square>0.6)


   •Tolerance: 1-(R-square) for each independent variable. Watch out for low tolerances (T<0.1)

   •VIF (Variance Inflation Factor), which is the reciprocal of the tolerance. Tolerances< 0.1

    correspond to VIF>10


 

B. Consequences

   •Near extreme multicollinearity is not a violation of the OLS assumptions

   •However, if an independent variable is highly collinear with other variables, the standard

     error of the coefficient will be large. So, it is harder to find statistically significant

     coefficients


 

C. When multicollinearity is more likely to occur?

   •Time-series data

   •Panel data

   •Aggregated, group level or ecological data. The units are not individuals but  groups of

     individuals and the variables are summary measures for those groups


 

D. Solutions

   •When collinear variables are alternative measures of the same conceptual variable:

      –Delete one or more variables from the model

      –Combine the collinear variables into an index

      –Estimate a latent variable model

      –Perform joint hypothesis test (Ho: none of the collinear variables has a coefficient that

        differs from zero)

   •The only real solution is to get better data!