Distribution of the error term

 

•We’ll eventually see that the probability distribution of e determines how well the model describes the population relationship between outcome variable y & explanatory variable x.

 

•In this context, there are four basic assumptions about the probability distribution e that:

–minimize bias &

–make confidence intervals & hypothesis tests valid.


 

Error term assumptions

 

1.The expected value of e over all possible samples is 0. That is, the mean of e does not vary with the levels of x.

2.The variance of the probability distribution of e is constant for all levels of x. That is, the variance of e does not vary with the levels of x.


3.The correlation between errors associated with any two different y observations are 0. That is, the errors are uncorrelated: the errors associated with one value of y have no effect on the errors associated with other y values.)

4.The probability distribution of e is normal.


 

IID

 

•These assumptions of the egression model can be summarized as: I.I.D.

Independently & identically distributed errors.


 

Unbiased estimators

 

•As we’ll come to understand, the assumptions make the estimated least squares line an unbiased estimator of the population value of the y-intercept and the slope coefficient—that is,. of the population value of y.

•Plus the assumptions make the standard errors of the estimated least squares line as small as possible & unbiased, so that confidence intervals & hypothesis tests are valid.


 

Variability of the error term

 

•How do we estimate the variability of the random error e (which means variability in the predicted values of outcome variable y)?

•We do so by estimating the variance of e (i.e. variance of the predicted values of the outcome variable y).


 


Variance of the error term

 

•Why must we be concerned with the variance of e?

•Because the greater the variance of e, the greater will be the errors in the estimates of the y-intercept slope coefficient.

•Thus the greater the variance of e, the more inaccurate will be the predicted value of y for any given value of x.

                                             Interpretation of the standard error of the error term

 

•Interpretation of s (yhat’sstandard error, which is the square root of the error variance):

•We are 95% certain that yhat’svalues fall within an interval of roughly +/-1.96*s.


 

Assessing the usefulness of the regression model:

making inferences about slope

 

•Depending on the selected alpha (i.e. test criterion) & on the test’s p-value, either reject or fail to reject Ho.


 

Hypothesis test’s assumptions

•Random sample

•The previously discussed four assumptions about e.


 

Conclusion for confidence interval

 

•We can say with 90% or 95% or 99% confidence that every one-unit increase/decrease in x increases/decreases y by +/-……units, on average.

•But remember: there are non-sampling sources of error, too.


 

 

Correlation

 

•Correlation: a linear relationship between two quantitative variables.

•Beware of outliers & non-linearity: graph a bivariate scatterplot in order to conclude whether conducting a correlation test makes sense or not (& thus whether an alternative measure should be used).


•Correlation assesses the degree of bivariate cluster along a straight line: the strength of a linear relationship.

•Regression examines the degree of y/xslope of a straight line: the extent to which y varies in response to changes in x.


•Regarding correlation, remember that association does not necessarily imply causation.

•And beware of lurking variables.

 

Steps to estimate the correlation coefficient

 

•Standardize each x observation & each y observation.

•Cross-multiply each pair of x & y observations.

•Divide the sum of the cross-products by

n –1.



 

 

Hypothesis Test for

correlation coefficient


 

 

Before estimating a

correlation coefficient

 

•STATA command: wcorry x, sigstar(.05)


 

Coefficient of Determination

 

•r2(in simple but not multiple regression,

•presents the proportion of the sum of squares deviations of the y values about their mean at can be attributed to a linear relationship between y & x.

 

•Interpretation: about 100(r2)% of the sample variation in y can be attributed to the use of x predicting y in the straight-line model.

 

•Higher r2signifies better fit: greater cluster along the y/x straight line


 

 

Coefficient of Determination


 

Explained vs. Unexplained Variation

 

DATA = FIT + RESIDUAL

 

SST=SSM+SSE

 

Explained vs. Unexplained Variation

 

•DATA: total variation in outcome variable y; measured by the total sum of squares

•FIT: variation in outcome variable y attributed to the explanatory variable x (i.e. the model);  measured by the model sum squares.

•RESIDUAL: variation in outcome variable y attributed to the estimated errors; measured the residual (or error) sum of squares.


 

Explained vs. Unexplained Variation

 

•Sum of Squares Total (SST): each observed y minus the mean of y ; square the values; sum the squared values.

•Sum of Squares for Model (SSM): each predicted minus the mean of y ; square the values; sum the squared values.

•Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y square the values; sum the squared  values.


•Next step: compute the variance for each component by dividing its sum of squares by its degrees of freedom—its mean Square:

 

 

Mean Square for Total =

Mean Square for Model +

Mean Square for Errors (Residuals)

 

•s: Root Mean Square (se of yhat)


 

ANOVA and F statistic

 

•The regression output displaying the sums of squares & mean square for model, residual (error) & total.

 

•How do we compute F & r2 from the NOVA table?

 

F=Mean Square Model/Mean Square Residual

•r2=Sum of Squares Model/Sum of Squares Total


 

Using the regression model for

estimation & prediction:

 

•Fundamental point: never make predictions beyond the range of the sampled (i.e. observed) x values.

 

•That is, while the model may provide good fit for the sampled range of values, it could give a poor fit outside the sampled x-value range.


 

•Another point in making predictions: the standard error for the estimated mean of y will be less than that for an estimated individual y observation.

 

•That is, there’s more uncertainty in predicting individual y values than mean y values.