# Regression Analysis: Basic Concepts

### Text-only Preview

Allin Cottrell∗

**1**

**The simple linear model**

This model represents the dependent variable,

*yi*, as a linear function of one independent variable,

*xi*, subject to a

random ‘disturbance’ or ‘error’,

*ui*.

*yi*= β0 + β1

*xi*+

*ui*

The error term

*ui*is assumed to have a mean value of zero, a constant variance, and to be uncorrelated with itself

across observations (

*E*(

*ui u j*) = 0,

*i*=

*j*). We may summarize these conditions by saying that

*ui*‘white noise’.

The task of estimation is to determine regression coefﬁcients ˆ

β0 and ˆβ1, estimates of the unknown parameters β0

and β1 respectively. The estimated equation will have the form

ˆ

*yi*= ˆ

β0 + ˆβ1

*x*.

We deﬁne the estimated error or

*residual*associated with each pair of data values as

ˆ

*ui*=

*yi*− ˆ

*yi*=

*yi*− ( ˆ

β0 + ˆβ1

*xi*)

In a scatter diagram of

*y*against

*x*, this is the vertical distance between the observed

*yi*vaue and the ‘ﬁtted value’,

ˆ

*yi*, as shown in Figure 1.

ˆ

β0 + ˆβ1

*x*

*yi*

ˆ

*ui*

ˆ

*yi*

*xi*

Figure 1: Regression residual

Note that we are using a different symbol for this

*estimated*error ( ˆ

*ui*) as opposed to the ‘true’ disturbance

or error term deﬁned above (

*ui*). These two will coincide only if ˆ

β0 and ˆβ1 happen to be exact estimates of the

regression parameters β0 and β1.

The basic technique for determining the coefﬁcients ˆ

β0 and ˆβ1 is Ordinary Least Squares (OLS): values for ˆβ0

and ˆ

β1 are chosen so as to minimize the sum of the squared residuals (SSR). The SSR may be written as

∗Last revised 2003/02/03.

1

SSR =

ˆ

*u*2

*i*= (

*yi*− ˆ

*yi*)2 = (

*yi*− ˆ

β0 − ˆβ1

*xi*)2

(It should be understood throughout that

denotes the summation

*n*

, where

*n*denotes the number of observa-

*i*=1

tions in the sample). The minimization of SSR is a calculus exercise: we need to ﬁnd the partial derivatives of SSR

with respect to both ˆ

β0 and ˆβ1 and set them equal to zero. This generates two equations (the ‘normal equations’

of least squares) in the two unknowns, ˆ

β0 and ˆβ1. These equations are then solved jointly to yield the estimated

coefﬁcients.

We start out from:

∂SSR/∂ ˆβ0 = −2 (

*yi*− ˆβ0 − ˆβ1

*xi*) = 0

(1)

∂SSR/∂ ˆβ1 = −2

*xi*(

*yi*− ˆβ0 − ˆβ1

*xi*) = 0

(2)

Equation (1) implies that

*yi*−

*n*ˆ

β0 − ˆβ1

*xi*= 0

⇒ ˆ

β0 = ¯

*y*− ˆβ1 ¯

*x*

(3)

while equation (2) implies that

*xi yi*− ˆ

β0

*xi*− ˆβ1

*x*2

*i*= 0

(4)

We can now substitute for ˆ

β0 in equation (4), using (3). This yields

*xi yi*− ( ¯

*y*− ˆ

β1 ¯

*x*)

*xi*− ˆβ1

*x*2

*i*= 0

⇒

*xi yi*− ¯

*y xi*− ˆ

β1(

*x*2

*i*− ¯

*x xi*) = 0

*xi yi*− ¯

*y xi*

⇒ ˆ

β1 =

(5)

*x*2

*i*− ¯

*x xi*

Equations (3) and (4) can now be used to generate the regression coefﬁcients. First use (5) to ﬁnd ˆ

β1, then use (3)

to ﬁnd ˆ

β0.

**2**

**Goodness of ﬁt**

The OLS technique ensures that we ﬁnd the values of ˆ

β0 and ˆβ1 which ‘ﬁt the sample data best’, in the speciﬁc

sense of minimizing the sum of squared residuals. There is no guarantee, however, that ˆ

β0 and ˆβ1 correspond

exactly with the unknown parameters β0 and β1. Neither, in fact, is there any guarantee that the ‘best ﬁtting’ line

ﬁts the data well: maybe the data do not even approximately lie along a straight line relationship. So how do we

assess the adequacy of the ‘ﬁtted’ equation?

• First step: ﬁnd the residuals. For each

*x*-value in the sample, compute the ﬁtted value or predicted value of

*y*, using ˆ

*yi*= ˆ

β0 + ˆβ1

*xi*.

• Then subtract each ﬁtted value from the corresponding actual, observed, value of

*yi*. Squaring and summing

these differences gives the SSR, as shown in Table 1.

Now obviously, the magnitude of the SSR will depend in part on the number of data points in the sample (other

things equal, the more data points, the bigger the sum of squared residuals). To allow for this we can divide though

by the ‘degrees of freedom’, which is the number of data points minus the number of parameters to be estimated

(2 in the case of a simple regression with an intercept term). Let

*n*denote the number of data points (or ‘sample

size’), then the degrees of freedom, d.f. =

*n*− 2. The square root of the resulting expression is called the estimated

*standard error*of the regression ( ˆσ ):

SSR

ˆσ =

*n*− 2

2

Table 1: Example of ﬁnding residuals

Given ˆ

β0 = 52.3509 ; ˆβ1 = 0.1388

data (

*xi*)

data (

*yi*)

ﬁtted ( ˆ

*yi*)

ˆ

*ui*=

*yi*− ˆ

*yi*

ˆ

*u*2

*i*

1065

199.9

200.1

−0.2

0.04

1254

228.0

226.3

1.7

2.89

1300

235.0

232.7

2.3

5.29

1577

285.0

271.2

13.8

190.44

1600

239.0

274.4

−35.4

1253.16

1750

293.0

295.2

−2.2

4.84

1800

285.0

302.1

−17.1

292.41

1870

365.0

311.8

53.2

2830.24

1935

295.0

320.8

−25.8

665.64

1948

290.0

322.6

−32.6

1062.76

2254

385.0

365.1

19.9

396.01

2600

505.0

413.1

91.9

8445.61

2800

425.0

440.9

−15.9

252.81

3000

415.0

468.6

−53.6

2872.96

= 0

= 18273.6

= SSR

The standard error gives us a ﬁrst handle on how well the ﬁtted equation ﬁts the sample data. But what is a ‘big’

ˆσ and what is a ‘small’ one depends on the context. The standard error is sensitive to the units of measurement of

the dependent variable.

A more standardized statistic, which also gives a measure of the ‘goodness of ﬁt’ of the estimated equation is

*R*2. This statistic is calculated as follows:

SSR

SSR

*R*2 = 1 −

(

≡ 1 −

*yi*− ¯

*y*)2

SST

Note that SSR can be thought of as the ‘unexplained’ variation in the dependent variable—the variation ‘left

over’ once the predictions of the regression equation are taken into account. The expression

(

*yi*− ¯

*y*)2, on the

other hand, represents the

*total variation*(total sum of squares or SST) of the dependent variable around its mean

value. So

*R*2 can be written as 1 minus the proportion of the variation in

*yi*that is ‘unexplained’; or in other words

it shows

*the proportion of the variation in yi that is accounted for by the estimated equation*. As such, it must be

bounded by 0 and 1.

0 ≤

*R*2 ≤ 1

*R*2 = 1 is a ‘perfect score’, obtained only if the data points happen to lie exactly along a straight line;

*R*2 = 0 is

perfectly lousy score, indicating that

*xi*is absolutely useless as a predictor for

*yi*.

When you add an additional variable to a regression equation, there is no way it can raise the SSR, and in fact

it’s likely to lower the SSR somewhat even if the added variable is not very relevant. And lowering the SSR means

raising the

*R*2 value. One might therefore be tempted to add too many extraneous variables to a regression if one

were focussed on achieving the maximum

*R*2. An alternative calculation, the adjusted R-squared or ¯

*R*2, attaches a

small penalty to adding more variables: thus if adding an additional variable raises the ¯

*R*2 for a regression, that’s a

better indication that is has “improved” the model that if it merely raises the plain, unadjusted

*R*2. The formula is:

SSR/(

*n*

*n*

¯

−

*k*− 1)

− 1

*R*2 = 1 −

(1 −

*R*2)

SST/(

*n*− 1)

= 1 −

*n*−

*k*− 1

where

*k*+ 1 represents the number of parameters being estimated (2 in a simple regression).

*To summarize so far*: alongside the estimated regression coefﬁcients ˆ

β0 and ˆβ1, we should also examine the

sum of squared residuals (SSR), the regression standard error ( ˆσ ) and the

*R*2 value (adjusted or unadjusted), in

order to judge whether the best-ﬁtting line does in fact ﬁt the data to an adequate degree.

3

**3**

**Conﬁdence intervals for regression coefﬁcients**

As stated above, even if the OLS math is performed correctly there is no guarantee that the coefﬁcients ˆ

β0 and

ˆ

β1 thus obtained correspond exactly with the underlying parameters β0 and β1. Actually, such an exact corre-

spondence is highly unlikely. The statistical issue here is a very general one:

*estimation is inevitably subject to*

sampling error.

sampling error

As we have seen, a

*conﬁdence interval*provides a means of quantifying the uncertainty produced by sampling

error. Instead of simply stating ‘I found a sample mean income of $39,000 and that is my best guess at the

population mean, although I know it is probably wrong’, we can make a statement like: ‘I found a sample mean

of $39,000, and there is a 95 percent probability that my estimate is off the true parameter value by no more than

$1200.’

Conﬁdence intervals for regression coefﬁcients can be constructed in a similar manner. Suppose we come up

with a slope estimate of ˆ

β1 = .90, using the OLS technique, and we want to quantify our uncertainty over the true

slope parameter, β1, by drawing up a 95 percent conﬁdence interval for this parameter.

Provided our sample size is reasonably large, the rule of thumb is the same as before; the 95 percent conﬁdence

interval for β1 is given by:

ˆ

β1 ± 2 standard errors

Our single best guess at β1 (‘point estimate’) is simply ˆ

β1, since the OLS technique yields unbiased estimates

of the parameters (actually, this is not

*always*true, but we’ll postpone consideration of tricky cases where OLS

estimates are biased). And on exactly the same grounds as before, there is a 95 per chance that our estimate ˆ

β1

will lie within 2 standard errors of its mean value, β1. But how do we ﬁnd the standard error of ˆ

β1? I shall not

derive this rigorously, but give the formula along with an intuitive explanation. The standard error of ˆ

β1 (written

as se( ˆ

β1), and not to be confused with the standard error of the regression, ˆσ) is given by the formula:

ˆσ 2

se( ˆ

β1) =

(

*xi*− ¯

*x*)2

i.e., it is the square root of [the square of the regression standard error divided by the total variation of the indepen-

dent variable,

*xi*, around its mean].

What are the various components of the calculation doing? First, note the general point that the larger is se( ˆ

β1),

the wider will be the conﬁdence interval for any speciﬁed conﬁdence level. Now, according to the formula, the

larger is ˆσ , the larger will be se( ˆ

β1), and hence the wider the conﬁdence interval for the true slope. This makes

sense: ˆσ provides a measure of the ‘degree of ﬁt’ of the estimated equation, as discussed above. If the equation ﬁts

the data badly (‘large’ ˆσ ), it stands to reason that we should have a relatively high degree of uncertainty over the

true slope parameter.

Secondly, the formula tells us that, other things equal, a high degree of variation of

*xi*makes for a smaller

se( ˆ

β1), and so a tighter conﬁdence interval. Why should this be? The more

*xi*has varied in our data sample, the

better the chance we have of picking up any relationship that exists between

*x*and

*y*. Take an extreme case and

this is rather obvious: suppose that

*x*happens not to have varied at all in our sample (i.e.,

(

*xi*− ¯

*x*)2 = 0). In that

case we have no chance of detecting any inﬂuence of

*x*on

*y*. And the more the independent variable has moved,

the more any inﬂuence it may have on the dependent variable should stand out against the background ‘noise’,

*ui*.

**4**

**Example of conﬁdence interval for the slope parameter**

One example. Suppose we’re interested in whether a positive linear relationship exists between

*xi*and

*yi*. We’ve

obtained ˆ

β1 = .90 and se( ˆβ1) = .12. The approximate 95 percent conﬁdence interval for β1 is then

.90 ± 2(.12) = .90 ± .24 = .66 to 1.14

This tells us that we can state, with at least 95 percent conﬁdence, that β1 > 0, and there is a positive relationship.

On the other hand, if we had obtained se( ˆ

β1) = .61, our interval would have been

.90 ± 2(.61) = .90 ± 1.22 = −.32 to 2.12

In this case the interval straddles zero, and we cannot be conﬁdent (at the 95 percent level) that there exists a

positive relationship.

4