RESEARCH METHODS II

LECTURE 1

 

Correlation and Regression:

Review

 

•What are the basic differences between correlation & regression?

•What vulnerabilities do correlation & regression share in common?

•What are the conceptual challenges regarding causality?


 

Simple linear regression

 

•It is a statistical method for examining how an outcome variable y depends on one or more explanatory variables x.


General Form of Probabilistic Model in Regression

 

Y = E (y) + error

 

E(y)=a+b1X1

 

Multiple linear regression

 

•Linear regression with more than one explanatory variable —makes it possible to:

 

–Combine many explanatory variables for optimal understanding &/or prediction;

–Examine the unique contribution of each explanatory variable, holding the levels of the other variables constant.

 

 •Hence, multiple regression enables us to perform, in a setting of observational research, a rough approximation to experimental analysis.

 

General Form of Probabilistic Model in Regression

 

Y= E(y) + error

 

E(y)=a+b1X1a+b2X2+b3X3+…+bkXk


 

Example

 

•So, we are analyzing the relationship of the per capita earnings of households to their numbers of members & their members’ages, years of education, race-ethnicity, gender & employment statuses:

 

•What is the independent effect of years of education on per capita household earnings, holding the other variables constant?


 

Caution!: Nonlinearity and Causality

 

•Such a statistical finding raises questions: e.g., is a year of college equivalent to a year of graduate school with regard to household earnings?

 

•We’ll see that multiple regression can accommodate nonlinear as well as linear y/xrelationships.

Basic Statistics Review

 

•What is a variable?

•What are the basic kinds of variables?

•How do we describe them in, first, univariate terms, & second, bivariate terms?

•Why do we need to describe them both graphically & numerically?


 

Equivalent terms for variables

 

•Y: dependent, outcome, response, criterion, left-hand side

•X: independent, explanatory, predictor, regressor, control, right-hand side


 

Basic Statistics Review

 

•What’s the fundamental problem with the mean as a measure of central tendency & standard deviation as a measure of spread?

•When should we use them? Can we use any other alternatives?

•Despite their problems, why are the mean & standard deviation used so commonly?


•What’s a normal distribution? What statistics describe a normal distribution? Why is it important?

•What does it mean to standardize a variable, & how is it done?

•Are all symmetric distributions normal?


•What’s a population? A sample?

•What’s a parameter? A statistic?

•What are the basic probability problems of samples, & how most basically do we try to mitigate them?

•Why is a sample mean typically used to estimate a parameter? What’s an expected value?


•What’s sampling variability? A sampling distribution? A population distribution?

•What’s the sampling distribution of a sample mean? The law of large numbers? The central limit theorem?

•Why’s the central limit theorem crucial to inferential statistics?


•What’s the difference between a standard deviation & a standard error? How do their formulas differ?

•What’s the difference between the z-& t-distributions? Why do we typically use the latter?


•What’s a hypothesis test? What’s its purpose? Its premise & general formula? How is it stated? What’s its interpretation?


•What are the typical standards for judging statistical significance? To what extent are they defensible or not?

•What’s the difference between statistical & practical significance?

•What are the possible reasons for a finding of statistical insignificance?


•What are Type I & Type II errors?

Principles concerning

statistics & social/policy research

 

•(1) Anecdotal versus systematic evidence (including the importance of theories in guiding research).

•(2) Social construction of reality.

•(3) Experimental versus observational evidence.

•(4) Beware of lurking variables.

•(5) Variability is everywhere.

•(6) All conclusions are uncertain.


 

Linear Regression

 

•I want to predict all the values of y based on a random sample of size n. I can use the following equation:

 

Y= E(y) + error

 

•But since I do not know the value of the error term for each observation, then my best prediction is the mean of y.


•However, a more accurate model—& thus more precise predictions—can be obtained by using explanatory variables to estimate each value of y.


•Here we see a major advantage of regression versus correlation: regression permits y/x directionality* (including multiple explanatory variables).

•In addition, regression coefficients are expressed in the units in which the variables are measured.


 

Objective: To build useful models

 

Definition

 

•“A model is a simplification of, and approximation to, some aspect of the world. Models are never literally ‘true’orfalse,’although good models abstract only the ‘right’features of the reality they represent”(King et al.).


 

MODELING

 

•We’ll focus, then, on modeling: trying to describe how sets of explanatory variables x’sare related to outcome variable y.

 

•Integral to this focus will be an emphasis on the interconnections of theory & empirical research (including questions of causality).

•Use graphs to check distributions & outliers

•The univariate distributions of the variables for regression analysis need not be normal!

•But the usual caveats concerning extreme outliers must be heeded.

•It’s not the univariate graphs but the y/x bivariate scatterplots that provide the key evidence on these concerns.


Multiple Linear Regression

 

•The characteristics of bivariate scatterplots& correlations do not necessarily predict whether explanatory variables will be significant or not in a multiple regression model.

•Moreover, bivariate relationships don’t necessarily indicate whether a Y/X relationship will be positive or negative within a multivariate framework.


•This is because multiple regression expresses the joint, linear effects of a set of explanatory variables on an outcome variable.


Simple Regression Model

 

•Could you describe the components of the equation and their meaning?


 

 

STATA output

 

•. regmath read

 

•Source    SS               dfMS                           Number of obs= 200

•F( 1, 198) = 154.70

•Model      7660.75905 1      7660.75905          Prob> F = 0.0000

•Residual  9805.03595 198    49.5203836        R-squared = 0.4386

AdjR-squared = 0.4358

•Total       17465.795    199    87.7678141       Root MSE = 7.0371

 

•math    Coef.        Std. Err.        t          P>t        [95% Conf. Interval]

•read   .6051473   .0486538   12.44  0.000       .509201 .7010935

•cons  21.03816   2.58945      8.12   0.000       15.93172 26.1446

 

Interpreation? Causality?


 

Why is a regression model probabilistic rather than deterministic?

 

•Because the model is estimated from sample data & thus will include some variation due to random phenomena than can’t be modeled or explained.


•That is, the random error component represents all unexplained variation in outcome variable y caused by important but omitted variables or by unexplainable random phenomena.


 

Random error

 

•The model’s random error component consists of deviations between the observed & predicted values of y. These are the residuals (which, to repeat, are estimates of the model’s error component for each value of y).

•Each observed value of y minus each predicted value of y


 

Ordinary Least Squares Line

 

•The least squares line, or regression line, follows two properties:

 

•1) The expected value of the errors (i.e. deviations or residuals) SE=0.

•2) The sum of the squared errors, SE, is smaller than for any other straight line model with SE=0.


 

OLS

 

•The regression line is called the least squares line because it minimizes the squared distance between the equation’s y-predictions & the data’s y observations (i.e. it minimizes the sum of squared errors, SSE).

•The better the model fits the data, the smaller the distance between the y-predictions the y-observations.


 

Coefficients and SSE

 

•Here are the values of the regression model’s estimated beta (i.e. slope or regression) coefficient & y-intercept (i.e. constant) that minimize SSE: