Lecture 4.  Nonlinear Models and Problems

I. How can multiple regression handle nonlinear relationships?

•Transformation of variables

– Independent variables

– Dependent variables

•Dummy variables

•Interactions between two or more variables

A. Nonlinear relationships

•Linear regression requires the model to remain linear in the parameters

•Any transformation of the independent variables does not have any effects on the estimation process

•However, the interpretation of the coefficients is different because the effects are no longer constant for all the values of each variable.

 

B. Transformation of the dependent variable

Logarithm of y (log-linear models) :  log (y)01x12 x 23 x 34 x 4+…+βk x k+ ε

         y = e β01x12 x 23 x 34 x 4+…+βk x k+ ε

    – The predicted values of y will always be positive

    – Transformation of the coefficients into percentage changes: 100(eb-1)
    – Interpretation: effect of one unit increase in x produce a b percentage change in y

Logit Models: applied when y is a proportion (a number between 0 and 1, not a percentage)

     This transformation preserves the lower and upper bounds of y

C. Models with power transformations and one quantitative independent var.

•First order model :     y01x1+ ε

•Second order model: y01x12x 12+ ε

•Third order model:    y01x1+ β2x 123x 13+ ε

D. First order model with k quantitative independent variables:

 yˆ=α1x12 x 23 x 34 x 4+…+βk x k+ ε

 

  Interpretation: -Bi: change in E(y) for unit increase in xi when all other x’s held constant.

E. Second order model with k quantitative independent variables

yˆ=α1x12 x 23 x 1x 2

 

  The interaction term implies that the marginal effect of one independent variable will depend on the level of the other independent variables

  Interpretation when one independent variable is held fixed

   – B1+B3X2: Change in E(y) for a one unit increase in x1 when x2 is constant
   – B2+B3X1: Change in E(y) for a one unit increase in x2 when x1 is constant

F. Models with one qualitative independent variable

yˆ=α1z12 z 2

 

•Where z are the dummies variables of a qualitative variable with 3 categories

•The qualitative independent variables are coded as follows:

   –Create a new variable for each category (k) of the qualitative independent variable

   –One if the trait is present and 0 for its absence

   –Include (k-1) dummy variables in the regression model

•Interpretation:

A: the mean value of y for the reference  category (or category 0)
A+B1: the mean value of y for category 1
A+B2: the mean value of y for category 2
B’s: indicate the distance between a given category and the reference category

G. Models with quantitative and qualitative independent variables

•Write a complete model that depicts the relationships between these variables:

•Dependent variable: income

•Independent variables:

     –Years of Education
     –Age
     –Sex
     –Race (White, African America, Hispanic, Other)

 


II.  What can go wrong with multiple regression?

•Are important variables left out of the model?

•Does the dependent variable affect any of the independent variables?

•How well are the independent variables measured?

•Is the sample large enough to detect important effects?

•Is the sample so large that trivial effects are statistically significant?

•Do some variables mediate the effects of other variables?

•Are some independent variables highly correlated?

•Is the sample biased?

•Are there any other problems to watch for?

 

A. Are important variables left out of the model?

•Two reasons for including a variable in a regression model:

– Measure the effect of the independent variable in the dependent variable

–Control for the variable

•A control variable is important when:

– Does the variable has a causal effect on the dependent variable?
– Is the control variable correlated with the key independent variables?

•If the variable has a strong effects on the dependent variable but is unrelated to the independent variables in the model there is NO need to include it.

•If an important control variable is omitted then the regression coefficients will be biased and your conclusions will be spurious.

 

B. Does the dependent variable affect any of the independent Variables?

•This is called “reverse causation”

•If reverse causation is present:

– Every coefficient in the regression model may bebiased

–Hard to correct

•How could you avoid it:

–Using data from randomized experiments

–Time ordering of the variable (three points in time)

 

C. How well are the independent variables measured?

•If your independent variable is afflicted with measurement error, the coefficients for the variable will be biased

•Reliability: methods only measure the stability of the variable

•Validity: are we measuring what we want to measure?

D. Is the sample large enough to detect important effects?

•Sample size is crucial for significance test

– E.g. with a sample of 60, you need a least a 0.25 correlation to be significantly different from zero

– With a sample of 10,000 any correlation will be significant

•In a small sample statistically significant coefficients should be taken seriously, but a nonsignificant coefficient is extremely weak evidence for the absence of an effect

 

E. Is sample so large that trivial effects are statistically significant?

Large samples can lead to incorrect results

•Almost any variable you put in a regression model is likely to show up as statistically significant

•You need to determine if the coefficient is substantially significant.

   Is the coefficient large enough to have theoretical or practical importance

   Look for standardized coefficients

 

F. Do some variables mediate the effects of other variables?

•If you include in the model other variables that mediate the effect of an independent variable on the dependent variable, then the real effect of the independent variable on the dependent variable might disappear.

•If you insert intervening variables in the regression, then the effect of the main variable might not be observed

 

G. How to tell if a variable has important indirect effects?

•Having a clear idea of the causal ordering of the variables in the regression model

•Having more information about the relationship:

– Dependent = Main independent (TE)

– Dependent=Main independent + Intervining (DE)

– Indirect Effects=Total – Direct Effects

 

H. Are some independent variables highly correlated?

Multiple regression is designed for separating the effects of two or more independent variables on a dependent variable when the independent variables are correlated with one another

•However, when two independent variables are highly correlated, a problem of multicollinearity arises

•The standard errors are very large if multicollinearity is a problem, so neither of the highly correlated variables are statistically significant

 

 

I. Is the sample biased?

•External validity: Can the results from the sample be generalized to other groups?

     – Depends on the sample design and the definition of the initial population

•Internal validity: Can the sample selection process produce erroneous results for the sample itself?

 

J. Are there any other problems to watch for?

• Non-linearity

• Variables measured on an ordinal scale

• Outliers or influential cases

• Incomplete information in all cases