Research Methods II Lecture 2
Multiple Regression
I. What is
Multiple Regression?
•Multiple
regression is a statistical method for studying the relationship between a
single dependent variable and many independent variables
•Multiple
regression model expresses the joint,
linear effects of a set of (two or
more) explanatory variables on an outcome variable.
II. What Are the Two objectives of Multiple Regression?
•1. Prediction
–Future trends
of a given variable based on its past performance
–Likelihood
that an event would happen according to certain individual traits
–Multiple
regression allows to combine many variables to produce optimal predictions of
the dependent variable
•2. Causal Analysis
–Independent
variables have effects on the dependent variables. We observe if the effects
are real and their magnitude
–Multiple
regression separates the effects of each independent variable on the dependent
variable
III. What is Ordinary Least Squares Multiple Linear Regression?
•Ordinary: is the simplest method for
least squares
•Least Squares: minimizes the squared
sum of the error terms so the estimated coefficients make the sum of the
squared prediction errors as small as possible
•Multiple regression: has two or more
independent variables
•Linear: the relationships between each
independent variable and the dependent variable are constant or linear
IV. General Equation
for Multiple Regression:

V. What to do when the relationship is not linear?
•An Example is
income vs. age: although income increases with age at younger ages, this
relationship is not the same after an age threshold has been reached. After
certain age income decreases with age.
–Curvilinear
modeling of age: Income=a+b(education)+c(age)+d(age2)+e
VI. What are
the data Requirements for Multiple Regession?
•Define units
of analysis, e.g. households, individuals, organizations, countries. . .
•See that
variables are properly coded & no major anomalies are present
•Required N: At least as many cases as variables plus one for
the regression to be calculated. For inferential purposes you need a lot more
cases.
•Data need to
come from censuses or from probability samples from well-defined populations
(for inferential purposes)
•Quantitative
variables must be measured on a well-defined interval scale. An increase in a
specific amount means the same in all ranges of the variable
•Variables
using ordinal or nominal scales are inappropriate for linear regression. They
need to be coded as indicator variables to be included in the model. (For a
variable with k categories, k-1 indicators need to be created and included in
the regression)
VII. How good is our model?
•Least squares
regression produces the BEST linear predictions for a given data set—that is,
the least variance or greatest precision, if all assumptions met
•Still we need
to know if the model is correct or if the inclusion of other familiar or the
transformation of the variable will result in a better model
VIII. What is the Coefficient of Determination or R2?
•R2 is
1 minus the ratio of:
–The sum of
squared errors produced by the least squares equation
–The sum of
squared errors for a least squares with no independent variables (just the
intercept, which is equivalent to the mean of the dependent variable)
•What is the Interpretation
of R2?:
–Using x and w
to predict y yields a reduction of R2 % in the prediction errors,
compared to prediction based only on the mean of y

IX.
What are three Possible Sources of Error?
•1. Measurement error: The accuracy of the
data is not perfect
•2. Sampling error: The data will never be exactly like the
population
•3. Uncontrolled variation: Other variables not in the analysis may
disturb the relationship between each independent variable and the dependent
variable
X. What Assumptions are made about the error term?
•We assume that errors occur in random and unsystematic
fashion
XI. How do We Assess
Our Inferences?
•With Confidence
Intervals: give us a range of the possible values of the population coefficient
based on sample data
–We are not
certain that the value falls in the range but we are confident that a given
number of times (95) out of a 100 it will fall within the range.
•With Hypothesis
Tests: They are used to answer the question of whether or not the population
coefficient is zero
–Does this
particular variable really affect the dependent variable?
XII. How are Standard
Errors used in Assessing Inferences?
•STATA will
estimate a standard error for each regression coefficient
•Using the SE
we construct confidence intervals and perform hypothesis test
•T-statistic =
coefficient / SE
•The p-value
needs to be less than the alpha level for considering a given regression
coefficient statistically significant different than zero
XIII. Is Multiple
Regression Effective in Controlling for Variables?
•Conservative
view: only a randomized experiment can really control for extraneous variables
•In experiments:
The conditions are equal for all individuals (in the experiment and in the
control group)
Randomly assign
people to two groups, so there are no biases or notable differences between the
groups. The two groups will, on average, have the same characteristics
–Observational studies do not have an experimental group and
a control group. Instead, we have
statistical, simultaneous control of many variables.