Lecture 8, Part II.

V. Univariate analysis of tenure

 

•Run the following commands:

 

  . su tenure, detail

.hist tenure, norm plotr(c(black))

.qnorm tenure, grid plotr(c(red))

 

•The distribution of the variable tenure is:

–Extremely skewed

–Non-monotonic

•Apparently a transformation of the variable will not be useful

•Let’s try to new STATA commands : qladder and ladder

 

A. Distributions of tenure

 

•Run the following commands:

                        .gladder tenure

                        .ladder tenure

 

B. Interpretation

 

•Nothing looks promising. This happens when the distribution of the variable is non-monotonic

•In cases like this, we should turn a quantitative variable into a categorical variable

 

VI. Creating a Qualitative variable

 

•Run the following commands:

 

.xtile tenurecats=tenure, nq(10)

.bys tenurecats: sutenure

.tab tenurecats, su(tenure)

.tab tenurectas, plot

.label define tenc 1 “Cat1” 4 “Cat2” 5 “Cat3” 6 “Cat4” 7 “Cat5” 8 “Cat6” 9 “Cat7” 10 “Cat8”

.label value tenurecats tenc

.label tenurecats“tenure categories”

 

A. Regression model and diagnostics

 

•Run the following command:

 

.xi:reg lwage educ educ2 exper i.tenurecats

.testparm_ltenurecat=2-ltenurecat=8

.linktest

.estat ovtest

.rvfplot, yline(0)

.estat imtest

.estat hettest, rhsmt(sidak)

.estat szroeter, rhsmt(sidak)

•Assess your results

 

VII. Robust Standard Errors

 

•It is recommended to use robust standard errors routinely, as long as the sample size isn’t small

•If we use robust standard errors, lots of diagnostic procedures won’t work because their statistical premises do not hold.

•A reasonable strategy is to use robust standard errors and re-estimate the model without the robust standard errors and compare. You should also run the diagnosis for the model without robust standard errors

•Run the following commands:

 

.xi:reg lwage educ educ2 exper i.tenurecats, robust

.testparm_ltenurecat=2-ltennurecat=8

.xi:reg lwage educ educ2 exper i.tenurecats

 

Answer the following question:

•Are standard errors & p-values notably different with versus without robust standard errors?

 

VIII. Correlated Errors

•In general there’s no straightforward way to check for correlated errors

Wage.dat is a data set based on a sample that is neither cluster nor panel or time series

•If we suspect correlated errors, we compensate in one or more of the following ways:

•By using robust standard errors

•If it is cluster sampling, by using cluster option in STATA

•If it is time series data, by using bygodfrey option in STATA

 

IX. Influential Outliers

•Particularly in small samples, OLS coefficient estimates can be strongly influenced by particular observations

•An observation’s influence on the coefficients depends on its discrepancy and leverage

 

A. Discrepancy and Leverage

   •Discrepancy: how far the observation on the y-axis falls from the mean for y

   •Leverage: how far the observation on the x-axis falls from the mean for x

   Discrepancy+Leverage=Influence

 

B. Studentized Residuals

   •Discrepancy is measured by studentized residuals

   Studentized residuals of -3 or less or =3 or more usually represent outliers with potential

     influence

   •Influential outliers can affect the equation’s constant, reduce its fit, and increase its

    standard errors, but they do not influence the regression coefficients.

 

C. Hat Value:

   •Leverage is measured by hat value, a non negative statistic that summarizes how far the

    explanatory variables fall from their means: greater hat values are farther from the x-mean

   •Hat-values are likely to be greater in small or moderate samples, values of 3*k/nor more are

    relatively large and indicate potential influence

 

D. Cook’s Distance and DFITS

   •These 2 indicators measure actual influence of an observation on the overall fit of a model

   •Cook’s distance and DFITS values of 1 or more, or 4/n or more in large samples, suggest

     substantial influence on the model’s overall fit

 

E. DFBETAs:

   •It measures the actual influence of observations, on particular slope coefficients

   •provide most direct measure of influence of explanatory variables on slope coefficients

    •Every DFBETA increment of 1 increases corresponding slope coef. by 1 standard deviation

    DFBETAs of 1 or more, or of at least 2 times the square root of n (in large samples)

     represent influential outliers

 

F. Graphic indicators:

   •Let’s examine some graphic indicators:

   •Run the following commands:

 

.lvr2plot, ms(i) ml(id)

.avplots

.avplot educ, ml(id)

.avplot exper

 

•Based on the preceding explanations, interpret your results

•Answer the following question: Are there any influential observations?

 

G. Numeric analysis of influential outliers

   •Run the following commands:

 

.predict rstu if e(sample), rstu

.predict h if e(sample), hat

.predict d if e(sample), cooksd

.dfbeta

.su rstu-DF_lxtenure_8

 

   •Although rstu has some clear outliers, the other diagnostics look good

 

H. Notes on Outliers

 

•Outliers are most likely to cause problems in small samples

•In a normal distribution we expect 5% of the observations to be outliers

•Do not over-fit a model to a sample. Remember that there is sample to sample variation

 

X. Multicollinearity

 

Multicollinearity indicates high correlation between the explanatory variables

Multicollinearity does not violate any linear model assumptions: if our objective were merely to predict values of y, there would be no need to worry about it.

•But, like small sample or sub sample size, it does inflate standard errors

•Signs of multicollinearity:

   –High bivariate correlations (.8+) between the explanatory variables. But, because multiple

     regression expresses joint linear effects, such correlations are not reliable indicators

•Signs of multicollinearity

   –Highly inflated slope coefficients and standard errors

   –Unstable slope coefficients. In other words slope coefficients that change markedly when

      other explanatory variables are removed or added.

   –Tolerance statistics less than .10 or VIF (variance inflation factor) greater than 10

•Run the following commands:

 

.xi:reg lwage educ educ2 exper i.tenurecats

.vif

 

educ and educ2 have high multicollinearity as a result of the quadratic transformation of the variable. This is expected and causes no problem. The other variables look fine

 

A. Multicollinearity: Solutions

 

•What would we do if there were such a problem:

–Eliminate variables

–Collect additional data

–Group relevant variables into sets of variables. Use principal components analysis or factor analysis

–Do “ridge regression” see Mendenhall and Sincich text.