RESEARCH METHODS II
LECTURE 1
Correlation and Regression:
Review
•What are the basic differences between correlation
& regression?
•What vulnerabilities do correlation & regression
share in common?
•What are the conceptual challenges regarding
causality?
Simple linear regression
•It is a statistical method for examining how an
outcome variable y depends on one or more explanatory variables x.
General Form of Probabilistic Model in Regression
Y = E
(y) + error
E(y)=a+b1X1
Multiple linear regression
•Linear regression with more than one explanatory
variable —makes it possible to:
–Combine many explanatory variables for optimal
understanding &/or prediction;
–Examine the unique contribution of each explanatory
variable, holding the levels of the other variables constant.
•Hence,
multiple regression enables us to perform, in a
setting of observational research, a rough approximation to experimental
analysis.
General Form of Probabilistic Model in Regression
Y=
E(y) + error
E(y)=a+b1X1a+b2X2+b3X3+…+bkXk
Example
•So, we are analyzing the relationship of the per
capita earnings of households to their numbers of members & their members’ages, years of education, race-ethnicity, gender
& employment statuses:
•What is the independent effect of years of education
on per capita household earnings, holding the other variables constant?
Caution!: Nonlinearity and
Causality
•Such a statistical finding raises questions: e.g., is
a year of college equivalent to a year of graduate school with regard to
household earnings?
•We’ll see that multiple regression
can accommodate nonlinear as well as linear y/xrelationships.
Basic Statistics Review
•What is a variable?
•What are the basic kinds of variables?
•How do we describe them in, first, univariate terms, & second, bivariate
terms?
•Why do we need to describe them both graphically
& numerically?
Equivalent terms for variables
•Y: dependent, outcome, response, criterion, left-hand
side
•X: independent, explanatory, predictor, regressor, control, right-hand side
Basic Statistics Review
•What’s the fundamental problem with the mean as a
measure of central tendency & standard deviation as a measure of spread?
•When should we use them? Can we use any other
alternatives?
•Despite their problems, why are the mean &
standard deviation used so commonly?
•What’s a normal distribution? What statistics
describe a normal distribution? Why is it important?
•What does it mean to standardize a variable, &
how is it done?
•Are all symmetric distributions normal?
•What’s a population? A sample?
•What’s a parameter? A statistic?
•What are the basic probability problems of samples,
& how most basically do we try to mitigate them?
•Why is a sample mean typically used to estimate a
parameter? What’s an expected value?
•What’s sampling variability? A sampling distribution?
A population distribution?
•What’s the sampling distribution of a sample mean?
The law of large numbers? The central limit theorem?
•Why’s the central limit theorem crucial to
inferential statistics?
•What’s the difference between a standard deviation
& a standard error? How do their formulas differ?
•What’s the difference between the z-&
t-distributions? Why do we typically use the latter?
•What’s a hypothesis test? What’s its purpose? Its
premise & general formula? How is it stated? What’s its interpretation?
•What are the typical standards for judging
statistical significance? To what extent are they defensible or not?
•What’s the difference between statistical &
practical significance?
•What are the possible reasons for a finding of
statistical insignificance?
•What are Type I & Type II errors?
Principles concerning
statistics & social/policy research
•(1) Anecdotal versus systematic evidence (including the
importance of theories in guiding research).
•(2) Social construction of reality.
•(3) Experimental versus observational evidence.
•(4) Beware of lurking variables.
•(5) Variability is everywhere.
•(6) All conclusions are uncertain.
Linear Regression
•I want to predict all the values of y based on a
random sample of size n. I can use the following equation:
Y=
E(y) + error
•But since I do not know the value of the error term
for each observation, then my best prediction is the mean of y.
•However, a more accurate model—& thus more precise
predictions—can be obtained by using explanatory variables to estimate each
value of y.
•Here we see a major advantage of regression versus
correlation: regression permits y/x directionality* (including multiple
explanatory variables).
•In addition, regression coefficients are expressed in
the units in which the variables are measured.
Objective: To build useful models
Definition
•“A model is a simplification of, and
approximation to, some aspect of the world. Models are never literally ‘true’or ‘false,’although good
models abstract only the ‘right’features of the
reality they represent”(King et al.).
MODELING
•We’ll focus, then, on modeling: trying to
describe how sets of explanatory variables x’sare
related to outcome variable y.
•Integral to this focus will be an emphasis on the
interconnections of theory & empirical research (including questions of
causality).
•Use graphs to check distributions & outliers
•The univariate
distributions of the variables for regression analysis need not be normal!
•But the usual caveats concerning extreme outliers
must be heeded.
•It’s not the univariate
graphs but the y/x bivariate scatterplots
that provide the key evidence on these concerns.
Multiple Linear Regression
•The characteristics of bivariate
scatterplots& correlations do not necessarily
predict whether explanatory variables will be significant or not in a multiple
regression model.
•Moreover, bivariate
relationships don’t necessarily indicate whether a Y/X relationship will be
positive or negative within a multivariate framework.
•This is because multiple regression
expresses the joint, linear effects of a set of explanatory variables on
an outcome variable.
Simple Regression Model
•Could you describe the components of the equation and
their meaning?
![]()

STATA output
•. regmath read
•Source
SS dfMS Number of obs= 200
•F( 1, 198) = 154.70
•Model 7660.75905
1 7660.75905 Prob> F
= 0.0000
•Residual
9805.03595 198 49.5203836 R-squared = 0.4386
•AdjR-squared = 0.4358
•Total
17465.795 199 87.7678141 Root MSE = 7.0371
•math Coef. Std.
Err. t P>t [95% Conf. Interval]
•read
.6051473 .0486538 12.44
0.000 .509201 .7010935
•cons
21.03816 2.58945 8.12
0.000 15.93172 26.1446
Interpreation? Causality?
Why is a regression model probabilistic rather
than deterministic?
•Because the model is estimated from sample data &
thus will include some variation due to random phenomena than can’t be
modeled or explained.
•That is, the random error component represents
all unexplained variation in outcome variable y caused by important but
omitted variables or by unexplainable random phenomena.
Random error
•The model’s random error component consists of
deviations between the observed & predicted values of y. These are
the residuals (which, to repeat, are estimates of the model’s error
component for each value of y).
•Each observed value of y minus each predicted value
of y
![]()

Ordinary Least Squares Line
•The least squares line, or regression line,
follows two properties:
•1) The expected value of the
errors (i.e. deviations or residuals) SE=0.
•2) The sum of the squared errors, SE, is smaller than
for any other straight line model with SE=0.
OLS
•The regression line is called the least squares
line because it minimizes the squared distance between the equation’s y-predictions
& the data’s y observations (i.e. it minimizes the sum of squared
errors, SSE).
•The better the model fits the data, the smaller the
distance between the y-predictions the y-observations.
Coefficients and SSE
•Here are the values of the regression model’s
estimated beta (i.e. slope or regression) coefficient & y-intercept
(i.e. constant) that minimize SSE:
