Research Methods II: Lecture 5
I. Variable Screening Procedures
•These techniques are used to
objectively determine which independent variables are the most important
predictors of the dependent variable.
• Methods:
– Stepwise regression
– All-possible-regressions-selection
Note: theory should be used for
variable selection, not a mindless computer!
II. Stepwise Regression
• 1. Identify Y and relevant Xs
• 2. STATA command: sw
– The program will fit all possible bivariate regressions (best t-test)
– The program will fit all possible two
independent variable regressions (second best t-test)
_ Evaluate the second model in the presence of the
first selected independent variable. It
will
look for the higher alternative t-test in the presence of the second selected
variable
_ The
program will fit all possible three independent variable regressions (the best
two variable
model and the best model with a third independent variable)
_ The process
continues until no further independent variables can be found that
yield significant t
values in the presence of the variables already in the model
•The process results in a model containing only those terms
with t values that are significant at a α level.
•Stepwise regression
is a non-theoretical variable screening procedure
III. Caution in using stepwise regression (regression
fishing)
•There is a
high probability that one or more errors have been made in selecting the
variables—creating nonsensical results!
-The computer cannot distinguish
spurious correlations or make
judgments regarding multicollinearity
•Often
high-order terms are omitted. You should include not only the main variables,
but their transformed forms and interactions.
•Stepwise regression should almost never be used, accept in a
completely non-theoretical approach to prediction
•Other stepwise regression techniques:
-Forward Selection
-Backward
elimination
IV. STATA commands
.
•From
UCLA-STATA website:
.
.
•The sw command is used for stepwise regression.
.
• The pr option is the probability to remove a
variable.
.
•The pe option is the
probability to enter a variable.
.
• sw regress y x1 x2 x3 x4, pr(.05)
.
• sw regress y x1 x2 x3 x4, pe(.05)
.
• sw regress y x1 x2 x3 x4, pe(.05)
pr(.1)
V. All-possible-regressions selection
•A procedure
that considers all possible regression models given the set of potentially
important predictors
•R-squared
criterion. Find a subset model so that adding more variables will yield only
small increases in R-squared
•Adjusted R-squared or
MSE criterion: searches for the model with the minimum MSE
•Cp criterion:
selects as the best model the subset model with a small total mean squared
error and a value of Cp near p+1(number of parameters), which is an indicator
of no bias in the subset regression model
•Adjusted R-squared or
MSE criterion: searches for the model with the minimum MSE
•Cp criterion:
selects as the best model the subset model with a small total mean squared
error and a value of Cp near p+1(number of parameters), which is an indicator
of no bias in the subset regression model
•PRESS (prediction sum of
squares) criterion: The candidate model is fit to the sample data n
times, each time omitting one of the data points and obtaining the predicted
value for that data point.