Mistakes to Avoid and Reporting OLS
1- Make sure that you "clean up" the data?
In SPSS, you should convert all the "don't know," "not sure," or the like into missing values. The easiest way to do that is to go to the "variable view" and click on the "Values" of each variable to identify the ones that should be coded as missing.
Then click on "Missing" for each variable and type in the numbers that should be coded as missing.
An example of the impact of "not" cleaning up the data.
2- What is OLS regression?
The estimated regression equation is:
Y = ß0 + ß1X1 + ß2X2 + ß3D + ê
where the ßs are the OLS estimates of the Bs. OLS minimizes the sum of the squared residuals
OLS minimizes SUM ê2
The residual, ê, is the difference between the actual Y and the predicted Y and has a zero mean. In other words, OLS calculates the slope coefficients so that the difference between the predicted Y and the actual Y is minimized. (The residuals are squared in order to compare negative errors to positive errors more easily.)
The OLS estimates of the ßs:
Statistical computing packages such as SPSS routinely print out the estimated ßs when estimating a regression equation (i.e. ols1.txt).
3- What to do with OLS Regression?
Multiple regression has three major uses
1. A description or model of reality
Instead of an abstract model
EXPEND = f(INCOME, AGE)
where EXPEND (vacation expenditures) increases with INCOME (income in thousands) and decreases with AGE (the age of the tourist), we get a more descriptive picture of reality, such as:
EXPEND = 100 + 30 × INCOME - 10 × AGE
where we now know that for every unit that INCOME increases, EXPEND increases by $30 and for every unit that AGE increases, EXPEND decreases by $10.
2. The testing of hypotheses about theory
Given test statistics on the numbers above, we can determine if these are "statistically significant." Statistical significance indicates the confidence we can place in the quantitative regression results. For example, it is important to know whether there is a 5% or a 50% chance that the true effect of INCOME on EXPEND is zero.
3. Predictions about the future
Suppose we want to predict what will happen to EXPEND if INCOME increases by 10%. If average income is $30, simply plug INCOME=3 into the model:
EXPEND = 100 + 30 × (INCOME=3) - 10 × AGE
and predict that EXPEND will increase by $90 if INCOME increases by $3000, holding the age of the tourist constant.
These uses do not differ from simple regression, but, results will be less likely misleading due to confounding effects.
4- How to evaluate the OLS model?
Evaluating the overall performance of the model
We hope that our regression models will explain the variation in the dependent variable fairly accurately. If it does, we say that "the model fits the data well." Evaluating the overall fit of the model also helps us to compare models that differ with the data set, composition and number of independent variables, etc.
There are three primary statistics for evaluating overall fit:
The coefficient of determination, R2, is the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS):
R2 = ESS/TSS = SUM([Y - ê] - µY)2 / SUM(Y - µY)2
where ESS is the summation of the squared values of the difference between the predicted Ys (Y - ê) and the mean of Y (µY, a naive estimate of Y) and TSS is the summation of the squared values of the difference between the actual Ys and the mean of Y.
The R2 ranges from 0 to 1 and can be interpreted as the percentage of the variance in the dependent variable that is explained by the independent variables.
2. Adjusted R2
Adding a variable to a multiple regression equation virtually guarantees that the R2 will increase (even if the variable is not very meaningful). The adjusted R2 statistic is the same as the R2 except that it takes into account the number of independent variables (k). The adjusted R2 will increase, decrease or stay the same when a variable is added to an equation depending on whether the improvement in fit (ESS) outweighs the loss of the degree of freedom (n-k-1):
adjusted R2 = 1 - (1 - R2) × [(n - 1)/(n - k - 1)]
The adjusted R2 is most useful when comparing regression models with different numbers of independent variables.
The F statistic is the ratio of the explained to the unexplained portions of the total sum of squares (RSS=SUM ê2), adjusted for the number of independent variables (k) and the degrees of freedom (n-k-1):
F = [ESS/k] / [RSS/(n - k - 1)]
The F statistic allows the researcher to determine whether the whole model is statistically significant from zero.
Statistical computing packages such as SPSS routinely print out this stuff (i.e.ols2.txt).
What is a 'good' overall fit? It depends. Cross-sectional data will often produce R2s that seem quite low; R2=.07 might be good for some types of data while for others it might be very, very bad. The adjusted R2, F-stat, and hypothesis tests of indepedent variables are all important determinants of model fit.
5- Conservative Criteria for Hypothesis Testing
Because most data consists of samples from the population, we worry whether our ßs actually matter when explaining variation in the dependent variable.
The null hypothesis states that X is not associated with Y, therefore the ß is equal to zero; the alternative hypothesis states that X is associated with Y, therefore the ß is not equal to zero.
The t-statistic is equal to the ß divided by the standard error of ß (s.e., a measure of the dispersion of the ß)
t = ß/s.e.
A (very) rough guide to testing hypotheses might be: "t-statistics above 2 are good." Also check your t-tables and significance (confidence) levels.
Statistical computing packages such as SPSS routinely print out the standard errors, t-stats, and confidence levels (the probability that ß is not zero) when estimating a regression equation (i.e.ols3.txt).
6- Some problems and solutions
How do you choose which variables to include in your model?
Specification searches are sometimes call "data mining" (see specbias.txt for an example).
7- Violation of Assumptions
There are several assumptions which must be met for the OLS estimates to be unbiased and have minimum variance. Two of the most easily violated assumptions are:
Other important assumptions which may be violated with certain types of data are:
Violations of these assumptions are less likely to occur with many types of data so we'll leave their discussion to the extensions section.
8- Extensions of the Model
Interaction terms are combinations of independent variables. For example, if you think that men earn more per year of work experience than women do, then include an interaction term [multiply the male dummy (D) by the independent variable] along with the experience variable and male dummy.
There are many important research topics for which the dependent variable is qualitative. Researchers often want to predict whether something will happen or not, such as referendum votes, business failure, disease--anything that can be expressed as Event/Nonevent or Yes/No. Logistic regression is a type of regression analysis where the dependent varible is dichotomous and coded 0, 1.
This is one solution to endogeneity bias. If an independent variable is correlated with the error term (i.e. in a model of the number of years education chosen, the number of children might be chosen at the same time). Two-stage least squares would first predict the number of children (based on other independent variables) and then use the prediction as the independent variable in the education model.
If you have yearly, quarterly, or monthly data then the ordering of the observations matters (with cross-section data it doesn't matter if Sally comes before or after Jane). For example, regression models of monthly car sales might include monthly and lagged monthly advertisements. Some standard time-series models should be used to account for the correlation of the lagged advertisements.