Methodology

Introduction

Regression models assume that a linear relationship exits between one variable called the dependent variable and one or more other variable called independent variables. The dependent variable is the variable to be forecast and the independent variables are the explanatory variables. The first model we will examine assumes that there is only one explanatory, or independent, variable. This type of model will be referred to as a simple regression model. We will them examine regression models with more than one independent variables. Such models are called multiple regression models.

Simple Regression Model

As stated above, a simple regression model consists of one dependent and one independent variable which are assumed to be linearly related. A linear, or straight-line relationship, has the equation form:

Y = a0 + a1X

In our terminology, Y represents the dependent variable, or variable to be forecast, and X is the independent variable, or explanatory variable. Parameters a0 and a1 are used to define the specific relationship between X and Y.

In order to test the linear hypothesis between the independent variable and the dependent variable a scatterplot can be constructed. If the points, representing pairs of independent and dependent variables (X, Y), do not fall approximately on a straight line, the linear hypothesis is not appropriate. However, certain steps can be taken to transform a non-linear relationship into a linear one.

Once the validity of the linear relationship has been confirmed, the next step is to identify the precise form of the relationship. This is done by specifying values for parameters a0 and a1, which in turn, identify a specific straight line. It should be obvious that any number of straight lines can be hypothetically selected to describe the relationship between X and Y. We need to need to establish a means by which one straight line can be judged better than all others. Although many criteria exist, the most universally used is to select the straight line that minimises the MSE (mean squared error). An error, in this case, is defined as the distance between any observed value of the dependent variable and the value forecast by the straight line. We wish to choose parameters a0 and a1 for our straight line to minimise the average squared error. Based on this criterion, it is not difficult to see why this method is known as the method of least squares. Mathematically, we wish to minimise:

(Yi - a0 - a1.Xi)2

with respect to a0 and a1. This can be done by taking derivatives and setting them equal to zero. This results in the so called normal equations:

Yi = a0n + a1Xi

XiYi = a0Xi + a1Xi2

which can be solved simultaneously to obtain:


a1 =

XiYi – nXiYi

=

COV (X, Y)

(Xi)2 – nXi2

VAR (X)

a0 = Y – a.X

where Y and X are the mean values for Y and X, respectively.

Multiple Regression Model

Previously, we have examined situations in which the dependent variable was assumed to be predictable in terms of a single independent variable. Frequently, one independent variable is not enough, and several independent variables are found to influence the dependent variable. This section will extend our analysis to include regression models with one than one independent variable, or multiple regression models.

If Y is the variable to be forecast and X1, X2, X3,…Xn are the independent or explanatory variables, the multiple regression forecasting model assumes the relationship between these variables is of the general form:

Y = a0 + a1X1 + a2X2 + a3X3 + .. + anXn

where a0, a1, a2, a3,…an are the model parameters.

When there two independent variables, the form is Y = a0 + a1X1 + a2X2. Three parameters have to be specified: a0, a1 et a2. The method of least squares uses a set of basic balance equations to solve for the optimal (minimal MSE) parameters for multiple variables. The equations to be solved, in terms of our model are:

Y = a0n + a1X1 + a2X2

X1Y = a0X1 + a1X12+ a2X1X2

X2Y = a0X2 + a1X1X2 = a2X22

Strength of Relationship

Coefficient of Determination R2. The coefficient of determination represents the ratio of explained variation to total variation of the dependent variable, or the proportion of total variation that has been determined by the model. Mathematically:

R2 = (Yi*- Y)2/(Yi - Y)2

where Yi* is the forecast value for the dependent variable predicted by the linear regression equation, Yi is observed value of the dependent variable corresponding to the independent variable Xi, and Y is the observed mean for the dependent variable.

R2 assumes values between 0 and 1. For example, if R2 = 0.90, this indicates that the linear regression model has explained 90% of the variation in the data, leaving 10% unexplained.

Adjusted Coefficient of Determination Ra2. Adjusted R2 is an adjustment for the fact that when one has a large number of independent variables it is possible that R2 will become artificially high simple because some independent variables’ chance variations “explain” small parts of the variance of the dependent variable. At the extreme, when there are as many independent variables as observations, R2 will always be 1. The adjustment to the arbitrarily lowers R2 as the number of independent increases. Some authors conceive of adjusted R2 as the percentage of variance “explained in a replication, after subtracting out the contribution of chance”. When there few independent variables, R2 and adjusted R2 will be close. By contrast, when there are many independent variables, adjusted R2 may be noticeably lower. Mathematically:

Ra2 = 1 – (1 - R2)(n – 1)/(n - k - 1)

where n is the number of observations and k is the number of independent variables.

Correlation Coefficient R. The correlation coefficient is the square root of the coefficient of determination. The value of this coefficient must always fall between - 1 and + 1. If the two variables are perfectly correlated, the correlation coefficient will be either - 1 or + 1, the sign determined by whether both more in the same direction (+ 1) or in the opposite direction (- 1). Values near zero for R would indicate that there is little or no correlation between these variables.

Statistical Significance of Correlation

Regression analysis is a statistical approach in which a sample is used to estimate the true relationship among variables. The smaller the sample size the greater the likelihood that the measured correlation could be due to sampling error rather than a true relationship.

The F-test can be used to distinguish between statistically significant correlations and those due to sampling error. It consists in calculating an F statistic based on the coefficient of determination and the sample size and comparing this value with a critical value. The F statistic is calculated from:

F =

R2/(k - 1)

 

(1- R2)/(n – k)

where R2 is the coefficient of determination, k is the number of variables, and n is the number of observations or sample points.

t-Tests

t-tests are used to assess the significance of individual ai coefficients, specifically, testing the null hypothesis that the regression coefficient is zero. A common rule of thumb is to drop from the equation all variables not significant at the 0.05 level or better.

Linear Regression Assumptions and Limitations

Linearity. We emphasised earlier the importance of checking for a linear relationship between dependent and independent variables by plotting the data. An attempt to fit a linear model to a relationship which is non-linear or does not exist will yield a low and insignificant coefficient of

determination. An observed non-linear relationship can be handled by transforming one or more of the variables, such as using the logarithms or square roots of the observations.

Autocorrelation. Current values should not be correlated with previous values in a data series. This is often a problem with time series data, where many variables tend to increment over time such that knowing the value of the current observation helps one estimate the value of the previous observation. Spatial autocorrelation can also be a problem when units of analysis are geographic units and knowing the value for a given area helps one estimate the value of the adjacent area. That is, each observation should be independent of each other observation if the error terms are not to be correlated, which in turn lead to biased estimates of standard deviations and significance.

The Durbin-Watson coefficient, DW, tests for autocorrelation. DW is given by the formula:

DW = (ei – ei-1)2/ei2

The value of DW ranges from 0 to 4. Values close to 0 indicate extreme positive correlation; close to 4 indicates extreme negative autocorrelation; and close to 2 indicates no serial autocorrelation. As a rule of thumb, DW should be between 1.5 and 2.5 to indicate independence of observations.

The Durbin-Watson table gives for a given number of observations and a given level of confidence two values DL et DU:

- if DW DL, there is positive autocorrelation;

- if DW 4 - DL, there is negative autocorrelation;

- if DU DW 4 - DU, there is no autocorrelation.

For a graphical test of serial independence, a plot of Y axis against the sequence of cases on the X axis should show no pattern, indicating independence of errors. When patterns appear in the errors, the model being used has not taken full advantage of the explanatory potential in the data. This can be corrected for by adding other variables to the model or by transforming one or more existing ones.

Normally Distributed Residuals. Regression assumes that the residuals or errors are normally distributed. With a mean of zero. Although this assumption is not normally one to worry about, it is important because it underlies the various tests of significance and confidence limit determination. As a rough guide if the number of observations is 30 or more, we can assume that he residuals are normally distributed. A spot check of the residuals can also be made to ensure that they cluster near zero with large values occurring only infrequently. The best way to correct for non-normality is to increase the size of the sample. The existence of a non-zero mean for the residuals indicates bias and suggests that the forecasting model or parameters need to be re-examined.

Multicollinearity. Multicollinearity exists when two or more independent variables are themselves highly correlated. When this occurs, the apparent significance and accuracy of the results can be affected. The simple correlation coefficients should be examined to determine whether any independent variables are substantially correlated with each other. If they are, one or more of these should be eliminated and the model re-examined. As a general rule, one should eliminate one of a pair of independent variables whose simple correlation is 0.7 or greater.

Homoscedasticity. When the forecast errors have a constant variance, this condition is referred to as homoscedasticity. When not true, we say the data are heteroscedastic. If heteroscedasticity exists, it is highly likely that we will obtain a low and insignificant coefficient of determination. Non-constant error variance can be observed by examining a plot of the data. A hemoscedastic model will display a cloud of dots, whereas lack of hemosedasticity will be characterised by a pattern such as a funnel shape, indicating greater error as the dependent variable increases. Heteroscedasticity can sometimes be eliminated by introducing other independent variables which are assumed to be the cause of heteroscedasticity.