Research Methods II  Lecture 2

Multiple Regression

 

I. What is Multiple Regression?

•Multiple regression is a statistical method for studying the relationship between a single dependent variable and many independent variables

 

•Multiple regression model expresses the joint, linear effects of a set of (two or more) explanatory variables on an outcome variable.

 

II. What Are the Two objectives of Multiple Regression?

•1.  Prediction

–Future trends of a given variable based on its past performance

–Likelihood that an event would happen according to certain individual traits

–Multiple regression allows to combine many variables to produce optimal predictions of the dependent variable

 

•2.  Causal Analysis

–Independent variables have effects on the dependent variables. We observe if the effects are real and their magnitude

–Multiple regression separates the effects of each independent variable on the dependent variable

 

III. What is Ordinary Least Squares Multiple Linear Regression?

 

Ordinary: is the simplest method for least squares

Least Squares: minimizes the squared sum of the error terms so the estimated coefficients make the sum of the squared prediction errors as small as possible

Multiple regression: has two or more independent variables

Linear: the relationships between each independent variable and the dependent variable are constant or linear

 


 

*      IV. General Equation for Multiple Regression:

*       

 

V. What to do when the relationship is not linear?

•An Example is income vs. age: although income increases with age at younger ages, this relationship is not the same after an age threshold has been reached. After certain age income decreases with age.

–Curvilinear modeling of age:  Income=a+b(education)+c(age)+d(age2)+e

 

VI. What are the data Requirements for Multiple Regession?

•Define units of analysis, e.g. households, individuals, organizations, countries. . .

•See that variables are properly coded & no major anomalies are present

•Required N: At least as many cases as variables plus one for the regression to be calculated. For inferential purposes you need a lot more cases.

•Data need to come from censuses or from probability samples from well-defined populations (for inferential purposes)

•Quantitative variables must be measured on a well-defined interval scale. An increase in a specific amount means the same in all ranges of the variable

•Variables using ordinal or nominal scales are inappropriate for linear regression. They need to be coded as indicator variables to be included in the model. (For a variable with k categories, k-1 indicators need to be created and included in the regression)

 

VII. How good is our model?

 

•Least squares regression produces the BEST linear predictions for a given data set—that is, the least variance or greatest precision, if all assumptions met

•Still we need to know if the model is correct or if the inclusion of other familiar or the transformation of the variable will result in a better model

 

VIII. What is the Coefficient of Determination or R2?

 

•R2 is 1 minus the ratio of:

 

–The sum of squared errors produced by the least squares equation

–The sum of squared errors for a least squares with no independent variables (just the intercept, which is equivalent to the mean of the dependent variable)

 

•What is the Interpretation of R2?:

–Using x and w to predict y yields a reduction of R2 % in the prediction errors, compared to prediction based only on the mean of y

 

 

IX. What are three Possible Sources of Error?

1. Measurement error: The accuracy of the data is not perfect

•2. Sampling error: The data will never be exactly like the population

•3. Uncontrolled variation: Other variables not in the analysis may disturb the relationship between each independent variable and the dependent variable

 

X. What Assumptions are made about the error term?

•We assume that errors occur in random and unsystematic fashion

 

XI.  How do We Assess Our Inferences?

•With Confidence Intervals: give us a range of the possible values of the population coefficient based on sample data

–We are not certain that the value falls in the range but we are confident that a given number of times (95) out of a 100 it will fall within the range.

•With Hypothesis Tests: They are used to answer the question of whether or not the population coefficient is zero

–Does this particular variable really affect the dependent variable?

 

XII.  How are Standard Errors used in Assessing Inferences?

•STATA will estimate a standard error for each regression coefficient

•Using the SE we construct confidence intervals and perform hypothesis test

•T-statistic = coefficient / SE

•The p-value needs to be less than the alpha level for considering a given regression coefficient statistically significant different than zero

 

XIII.  Is Multiple Regression Effective in Controlling for Variables?

•Conservative view: only a randomized experiment can really control for extraneous variables

•In experiments: The conditions are equal for all individuals (in the experiment and in the control group)

Randomly assign people to two groups, so there are no biases or notable differences between the groups. The two groups will, on average, have the same characteristics

–Observational studies do not have an experimental group and a control group.  Instead, we have statistical, simultaneous control of many variables.