Lecture 4. Nonlinear Models and Problems
I. How can multiple regression
handle nonlinear relationships?
•Transformation of variables
– Independent variables
– Dependent variables
•Dummy variables
•Interactions between two or
more variables
A. Nonlinear
relationships
•Linear regression requires the model to remain linear in the
parameters
•Any transformation of the independent variables does not
have any effects on the estimation process
•However, the interpretation of the coefficients is different
because the effects are no longer constant for all the values of each variable.
B. Transformation
of the dependent variable
•Logarithm of
y (log-linear models) : log (y)=β0+β1x1+β2
x 2+β3 x 3+β4
x 4+…+βk x k+ ε
y = e β0+β1x1+β2
x 2+β3
x 3+β4
x 4+…+βk
x k+ ε
– The predicted values of y will always be
positive
– Transformation of the coefficients into percentage
changes: 100(eb-1)
– Interpretation: effect of one unit
increase in x produce a b percentage change in y
•Logit Models: applied when y is a proportion
(a number between 0 and 1, not a percentage)
– This transformation preserves the lower and upper bounds of y
C. Models with power transformations and one
quantitative independent var.
•First order model : y=β0+β1x1+ ε
•Second order model: y=β0+β1x1+β2x 12+ ε
•Third order model: y=β0+β1x1+
β2x 12 +β3x 13+ ε
D. First
order model with k quantitative independent variables:
yˆ=α +β1x1+β2
x 2+β3 x 3+β4
x 4+…+βk x k+ ε
• Interpretation: -Bi:
change in E(y) for unit increase in xi when all other x’s
held constant.
E. Second order model with k
quantitative independent variables
yˆ=α +β1x1+β2
x 2+β3 x 1x 2
• The
interaction term implies that the marginal effect of one independent variable
will depend on the level of the other independent variables
• Interpretation when one independent variable is held fixed
– B1+B3X2: Change in E(y) for a one unit increase
in x1 when x2 is constant
– B2+B3X1: Change in E(y) for a one
unit increase in x2 when x1 is constant
F. Models with one qualitative
independent variable
yˆ=α +β1z1+β2
z 2
•Where z are the dummies variables of a qualitative variable with 3
categories
•The qualitative independent variables are coded as follows:
–Create a new variable for
each category (k) of the qualitative independent variable
–One if the trait is
present and 0 for its absence
–Include (k-1) dummy variables
in the regression model
•Interpretation:
A: the mean value of y for
the reference category
(or category 0)
A+B1: the mean value of y for category 1
A+B2: the mean value of y for category 2
B’s: indicate the distance between a given category and the reference category
G. Models with quantitative and qualitative
independent variables
•Write a
complete model that depicts the relationships between these variables:
•Dependent
variable: income
•Independent
variables:
–Years of Education
–Age
–Sex
–Race (White, African America,
Hispanic, Other)
II. What can go wrong with multiple regression?
•Are important
variables left out of the model?
•Does the
dependent variable affect any of the independent variables?
•How well are
the independent variables measured?
•Is the sample
large enough to detect important effects?
•Is the sample
so large that trivial effects are statistically significant?
•Do some
variables mediate the effects of other variables?
•Are some
independent variables highly correlated?
•Is the sample
biased?
•Are there any
other problems to watch for?
A. Are important variables left out of the model?
•Two reasons
for including a variable in a regression model:
– Measure the effect of the
independent variable in the dependent variable
–Control for the variable
•A control
variable is important when:
– Does the variable has a causal effect on the dependent variable?
– Is the control variable correlated with the key independent variables?
•If the
variable has a strong effects on the dependent variable but is unrelated to the
independent variables in the model there is NO need to include it.
•If an
important control variable is omitted then the regression coefficients will be
biased and your conclusions will be spurious.
B. Does the dependent variable affect any of the
independent Variables?
•This is
called “reverse causation”
•If reverse
causation is present:
– Every coefficient in the
regression model may bebiased
–Hard to correct
•How could you
avoid it:
–Using data from randomized
experiments
–Time ordering of the variable
(three points in time)
C. How well are the independent variables measured?
•If your
independent variable is afflicted with measurement error, the coefficients for
the variable will be biased
•Reliability:
methods only measure the stability of the variable
•Validity: are
we measuring what we want to measure?
D. Is the sample large
enough to detect important effects?
•Sample size
is crucial for significance test
– E.g. with a sample of 60, you
need a least a 0.25 correlation to be significantly different from zero
– With a sample of 10,000 any
correlation will be significant
•In a small
sample statistically significant coefficients should be taken seriously, but a nonsignificant coefficient is extremely weak evidence for
the absence of an effect
E. Is sample so large that trivial effects are
statistically significant?
•Large samples can lead to incorrect results
•Almost any variable you put in a regression model is likely to
show up as statistically significant
•You need to determine if the coefficient is substantially significant.
–Is the coefficient large enough to have
theoretical or practical importance
–Look for standardized coefficients
F. Do some variables mediate the effects of other
variables?
•If you
include in the model other variables that mediate the effect of an independent
variable on the dependent variable, then the real effect of the independent
variable on the dependent variable might disappear.
•If you insert
intervening variables in the regression, then the effect of the main variable
might not be observed
G.
How to tell if a variable has important indirect effects?
•Having a
clear idea of the causal ordering of the variables in the regression model
•Having more
information about the relationship:
– Dependent =
Main independent (TE)
–
Dependent=Main independent + Intervining (DE)
– Indirect
Effects=Total – Direct Effects
H. Are some independent variables highly
correlated?
•Multiple regression is designed for separating
the effects of two or more independent variables on a dependent variable when
the independent variables are correlated with one another
•However, when two independent variables are highly correlated, a
problem of multicollinearity arises
•The standard errors are very large if multicollinearity
is a problem, so neither of the highly correlated variables are
statistically significant
I. Is the sample biased?
•External validity: Can the results from the sample be
generalized to other groups?
– Depends on the
sample design and the definition of the initial population
•Internal validity: Can the sample selection process produce
erroneous results for the sample itself?
J. Are there any other
problems to watch for?
• Non-linearity
• Variables measured on an ordinal scale
• Outliers or influential cases
• Incomplete information in all cases