Distribution of the error term
•We’ll eventually see that the probability
distribution of e determines how well the model describes the population
relationship between outcome variable y & explanatory variable x.
•In this context, there are four basic assumptions
about the probability distribution e that:
–minimize bias &
–make confidence intervals & hypothesis tests
valid.
Error term assumptions
1.The expected value of e over all possible samples
is 0. That is, the mean of e does not vary with the levels of x.
2.The variance of the probability distribution of e is
constant for all levels of x. That is, the variance of e does
not vary with the levels of x.
3.The correlation between errors associated with any two
different y observations are 0. That is, the errors are uncorrelated:
the errors associated with one value of y have no effect on the errors
associated with other y values.)
4.The probability distribution of e is normal.
IID
•These assumptions of the egression model can be
summarized as: I.I.D.
•Independently & identically distributed errors.
Unbiased estimators
•As we’ll come to understand, the assumptions make the
estimated least squares line an unbiased estimator of the population
value of the y-intercept and the slope coefficient—that is,. of the population value of y.
•Plus the assumptions make the standard errors of the
estimated least squares line as small as possible & unbiased, so
that confidence intervals & hypothesis tests are valid.
Variability of the error term
•How do we estimate the variability of the random
error e (which means variability in the predicted values of outcome
variable y)?
•We do so by estimating the variance of e (i.e.
variance of the predicted values of the outcome variable y).

![]()
Variance of the error term
•Why must we be concerned with the variance of e?
•Because the greater the variance of e, the
greater will be the errors in the estimates of the y-intercept slope
coefficient.
•Thus the greater the variance
of e, the more inaccurate will be the predicted value of y for
any given value of x.
Interpretation
of the standard error of the error term
•Interpretation of s (yhat’sstandard
error, which is the square root of the error variance):
•We are 95% certain that yhat’svalues
fall within an interval of roughly +/-1.96*s.
Assessing the usefulness of the regression model:
making inferences about slope
•Depending on the selected alpha (i.e. test criterion)
& on the test’s p-value, either reject or fail to reject Ho.
![]()

Hypothesis test’s assumptions
•Random sample
•The previously discussed four assumptions about e.
Conclusion for confidence interval
•We can say with 90% or 95% or 99% confidence that
every one-unit increase/decrease in x increases/decreases y by
+/-……units, on average.
•But remember: there are non-sampling sources of
error, too.
Correlation
•Correlation: a linear relationship between two
quantitative variables.
•Beware of outliers & non-linearity: graph a bivariate scatterplot in order to
conclude whether conducting a correlation test makes sense or not (& thus
whether an alternative measure should be used).
•Correlation assesses the degree of bivariate cluster along a straight line: the strength of a
linear relationship.
•Regression examines the degree of y/xslope of a
straight line: the extent to which y varies in response to changes in x.
•Regarding correlation, remember that association does
not necessarily imply causation.
•And beware of lurking variables.
Steps to estimate the correlation coefficient
•Standardize each x observation & each y
observation.
•Cross-multiply each pair of x & y observations.
•Divide the sum of the cross-products by
n –1.

Hypothesis Test for
correlation coefficient

Before estimating a
correlation coefficient
•STATA command: wcorry x, sigstar(.05)
Coefficient of Determination
•r2(in simple but not multiple regression,
•presents the proportion of the sum of squares
deviations of the y values about their mean at can be attributed to a
linear relationship between y & x.
•Interpretation: about 100(r2)%
of the sample variation in y can be attributed to the use of x predicting
y in the straight-line model.
•Higher r2signifies better fit: greater cluster along
the y/x straight line
Coefficient of Determination

![]()
Explained vs. Unexplained Variation
DATA = FIT + RESIDUAL
SST=SSM+SSE
Explained vs. Unexplained Variation
•DATA: total variation in outcome variable y;
measured by the total sum of squares
•FIT: variation in outcome variable y attributed
to the explanatory variable x (i.e. the model); measured by the model sum squares.
•RESIDUAL: variation in outcome variable y attributed
to the estimated errors; measured the residual (or error) sum of squares.
Explained vs. Unexplained Variation
•Sum of Squares Total (SST): each observed y minus
the mean of y ; square the values; sum the
squared values.
•Sum of Squares for Model (SSM): each predicted minus
the mean of y ; square the values; sum the
squared values.
•Sum of Squares for Errors (SSE): each observed y minus
the mean of predicted y square the values; sum the squared values.
•Next step: compute the variance for each component by
dividing its sum of squares by its degrees of freedom—its mean Square:
Mean Square for Total =
Mean Square for Model +
Mean Square for Errors (Residuals)
•s: Root Mean Square (se of yhat)
ANOVA and F statistic
•The regression output displaying the sums of squares
& mean square for model, residual (error) & total.
•How do we compute F & r2 from the NOVA
table?
•F=Mean Square Model/Mean Square Residual
•r2=Sum of Squares Model/Sum of Squares Total
Using the regression model for
estimation & prediction:
•Fundamental point: never make predictions beyond
the range of the sampled (i.e. observed) x values.
•That is, while the model may provide good fit for the
sampled range of values, it could give a poor fit outside the sampled x-value
range.
•Another point in making predictions: the standard
error for the estimated mean of y will be less than that for an
estimated individual y observation.
•That is, there’s more uncertainty in predicting
individual y values than mean y values.