Lecture 8, Part II.
V. Univariate analysis of tenure
•Run the following commands:
. su
tenure, detail
.hist tenure, norm plotr(c(black))
.qnorm tenure, grid plotr(c(red))
•The distribution of the variable tenure is:
–Extremely skewed
–Non-monotonic
•Apparently a transformation of the variable will not be
useful
•Let’s try to new STATA commands : qladder
and ladder
A.
Distributions of tenure
•Run the following commands:
.gladder tenure
.ladder tenure
B. Interpretation
•Nothing looks promising. This happens when the distribution
of the variable is non-monotonic
•In cases like this, we should turn a quantitative variable
into a categorical variable
VI. Creating a Qualitative variable
•Run the following commands:
.xtile tenurecats=tenure, nq(10)
.bys
tenurecats: sutenure
.tab
tenurecats, su(tenure)
.tab
tenurectas, plot
.label
define tenc 1 “Cat1” 4 “Cat2” 5 “Cat3” 6 “Cat4” 7
“Cat5” 8 “Cat6” 9 “Cat7” 10 “Cat8”
.label
value tenurecats tenc
.label
tenurecats“tenure categories”
A.
Regression model and diagnostics
•Run the following command:
.xi:reg lwage
educ educ2 exper i.tenurecats
.testparm_ltenurecat=2-ltenurecat=8
.linktest
.estat ovtest
.rvfplot, yline(0)
.estat imtest
.estat hettest, rhsmt(sidak)
.estat szroeter, rhsmt(sidak)
•Assess your results
VII. Robust Standard Errors
•It is recommended to use robust standard errors routinely,
as long as the sample size isn’t small
•If we use robust standard errors, lots of diagnostic
procedures won’t work because their statistical premises do not hold.
•A reasonable strategy is to use robust standard errors and
re-estimate the model without the robust standard errors and compare. You
should also run the diagnosis for the model without robust standard errors
•Run the following commands:
.xi:reg lwage
educ educ2 exper i.tenurecats, robust
.testparm_ltenurecat=2-ltennurecat=8
.xi:reg lwage
educ educ2 exper i.tenurecats
Answer
the following question:
•Are standard errors & p-values notably different with
versus without robust standard errors?
VIII. Correlated Errors
•In general there’s no straightforward way to check for
correlated errors
•Wage.dat is a data set based on a
sample that is neither cluster nor panel or time series
•If we suspect correlated errors, we compensate in one or
more of the following ways:
•By using robust standard errors
•If it is cluster sampling, by using cluster option in STATA
•If it is time series data, by using bygodfrey
option in STATA
IX. Influential Outliers
•Particularly in small samples, OLS coefficient estimates can
be strongly influenced by particular observations
•An observation’s influence on the coefficients depends on
its discrepancy and leverage
A.
Discrepancy and Leverage
•Discrepancy: how
far the observation on the y-axis falls from the mean for y
•Leverage: how far
the observation on the x-axis falls from the mean for x
•Discrepancy+Leverage=Influence
B.
Studentized Residuals
•Discrepancy is
measured by studentized residuals
•Studentized
residuals of -3 or less or =3 or more usually represent outliers with potential
influence
•Influential
outliers can affect the equation’s constant, reduce its fit, and increase its
standard errors, but
they do not influence the regression coefficients.
C. Hat Value:
•Leverage is
measured by hat value, a non negative statistic that summarizes how far the
explanatory
variables fall from their means: greater hat values are farther from the x-mean
•Hat-values are
likely to be greater in small or moderate samples, values of 3*k/nor more are
relatively large and indicate potential
influence
D.
Cook’s Distance and DFITS
•These 2 indicators
measure actual influence of an observation on the overall fit of a model
•Cook’s distance and
DFITS values of 1 or more, or 4/n or more in large samples, suggest
substantial influence on the model’s overall
fit
E. DFBETAs:
•It measures the
actual influence of observations, on particular slope coefficients
•provide most direct
measure of influence of explanatory variables on slope coefficients
•Every DFBETA
increment of 1 increases corresponding slope coef. by
1 standard deviation
•DFBETAs of 1 or more, or of at least 2 times the square
root of n (in large samples)
represent influential outliers
F.
Graphic indicators:
•Let’s examine some
graphic indicators:
•Run the following
commands:
.lvr2plot, ms(i) ml(id)
.avplots
.avplot educ, ml(id)
.avplot exper
•Based on the preceding explanations, interpret your results
•Answer the following question: Are there any influential
observations?
G.
Numeric analysis of influential outliers
•Run the following
commands:
.predict
rstu if e(sample), rstu
.predict
h if e(sample), hat
.predict
d if e(sample), cooksd
.dfbeta
.su rstu-DF_lxtenure_8
•Although rstu has some clear outliers, the other diagnostics look
good
H.
Notes on Outliers
•Outliers are most likely to cause problems in small samples
•In a normal distribution we expect 5% of the observations to
be outliers
•Do not over-fit a model to a sample. Remember that there is
sample to sample variation
X. Multicollinearity
•Multicollinearity indicates high
correlation between the explanatory variables
•Multicollinearity does not violate
any linear model assumptions: if our objective were merely to predict values of
y, there would be no need to worry about it.
•But, like small sample or sub sample size, it does inflate
standard errors
•Signs of multicollinearity:
–High bivariate correlations (.8+) between the explanatory
variables. But, because multiple
regression expresses joint linear effects,
such correlations are not reliable indicators
•Signs of multicollinearity
–Highly inflated
slope coefficients and standard errors
–Unstable slope
coefficients. In other words slope coefficients that change markedly when
other explanatory variables are removed or added.
–Tolerance
statistics less than .10 or VIF (variance inflation factor) greater than 10
•Run the following commands:
.xi:reg lwage
educ educ2 exper i.tenurecats
.vif
•educ and educ2 have high multicollinearity as a result of the quadratic
transformation of the variable. This is expected and causes no problem. The
other variables look fine
A.
Multicollinearity: Solutions
•What would we do if there were such a problem:
–Eliminate variables
–Collect additional data
–Group relevant variables into sets of variables. Use
principal components analysis or factor analysis
–Do “ridge regression” see Mendenhall and Sincich
text.