Exploratory and Graphical Methods of Data Analysis
Michael Friendly

Part 3: Regression Diagnostics

3.1 Regression model

The General Linear Model assumes that a dependent, response variable, y , is to be explained as linear function of one or more independent, explanatory variables, x ₁ , x ₂ , ... x _p , according to

y = b ₀ + b ₁ x ₁ + b ₂ x ₂ + b ₃ x ₃ + ... + b _p x _p + e (6)

In matrix form for a sample of n the model is,

y = X b + e (7)

In addition to the assumption that model (6) correctly specifies the relation between y and the x _i , the usual statistical inference requires the following assumptions on the residual errors:

Normality: The residuals have a normal distribution (with mean zero).
Homogeneity of variance: The variance of e _i is a constant value, s ² .
Independence: The residuals associated with distinct observations are uncorrelated.

The usual least squares estimators can be expressed in matrix notation as:

Regression coefficients

b Hat = ( X^TX ) ^-1 X^Ty
Fitted values

y hat = X b Hat = X ( X^TX ) ^-1 X^Ty = H y
where H is the ( n x n ) matrix called the hat matrix (it puts the "hat" on y ).
Residuals

e = y - y hat = ( I - H ) y

3.2 Residual Plots

The fitted values in regression contain the portion of each observation which is "explained" by the model, the residuals contain the "unexplained" portion.

Plots of the residuals can help to answer the following questions:

Has the model been correctly specified? Are there important explanatory variables which have been omitted?
Are the assumptions of the model satisfied?

Normal Q-Q plot of residuals

The standard way to test the assumption of normality of errors is to make a normal probability plot. Our ability to detect deviations from normality is enhanced by plotting confidence intervals around the expected normal 45° line and by plotting a detrended version of the Q-Q plot.

Non-constant residual variance

The usual way to look for evidence that residual variance is not constant is to make a scatter plot of the residuals against fitted values. In situations where residual variance increases systematically with x or y , you will see a fan-shaped pattern in the plot.

A better way to plot residuals to assess changes in spread is to plot the absolute value of the residual, | e _i | against y hat _i or x . This has the effect of folding the plot along a horizontal line at zero. Adding a smoothed (e.g., lowess) curve to the configuration of points helps considerably.

Transformation to equalize residual variance

Non-constant residual variance violates the assumptions of the linear regression model. When the pattern is one of systematic increase or decrease in spread with fitted value, a transformation of the response variable can often cure this defect.

One idea is to make a diagnostic plot of log ₁₀ | e sub i^* | against log ₁₀ ( x _i ) . If the points in this plot are approximately linear with slope b , it can be shown that a power transformation, y --> y ^(1-b) , where y ⁰ is interpreted to mean log (y) , will make the residual variance more nearly equal.

3.3 Leverage and Influence diagnostics

Some observations can exert a great deal of influence on the fitted regression. The influence of an observation depends on two factors:

leverage: The "horizontal" distance of the x -value from the mean, x bar . The further from the mean, the more leverage an observation has.
y-discrepancy: The vertical distance between y _i and a regression line that ignores y _i .

A conceptual formula is:

Influence = Leverage × y-Discrepancy

so, an observation must have both high influence and a large residual to be influential.

Hat values measure leverage

The diagonal entries, h _i , of the Hat matrix H are measures of the (squared) distance of observation x _i from the vector of means, x bar . In simple linear regression,

h _i = (1 / n) + ( x _i - x bar ) ² over S ( x _k - x bar ) ²

In general, it can be shown that

0 <= h _i <= 1 and S _i=1ⁿ (h _i ) = p + 1

so the average value of h _i is ( p + 1) / n . In practice, observations are typically considered high leverage if h _i > 2 ( p + 1 ) / n .

Studentized residuals measure discrepancy

The ordinary residuals, e _i = y _i - y hat _i are not very useful here, because:

the variance of the residuals varies with leverage, var ( e _i ) = s ² ( 1 - h _i ) , and
outliers on y tend to pull the regression line towards themselves.

To correct these difficulties, we commonly use what is called the studentized (deleted) residual , RSTUDENT, which is the residual estimated for y _i without using that observation. Let s _(-i) ² be the error mean square estimate of s ² omitting observation i . Then

RSTUDENT identical e _i star = e _i over s _(-i) sqrt 1 - h _i (9)

Influence measures

Two influence measures based on these ideas are commonly used. Both are functions of the product of leverage and y -discrepancy.

DFFITS: a scaled measure of the change in the predicted value, y hat _i when observation i is omitted,
DFFITS _i = y hat _i - y hat _(-i) over s _(-i) sqrt h _i = e _i over s _(-i) x sqrt h _i over 1 - h _i (10)
Belsley et al. (1980) suggest using DFFITS > 2 sqrt < ( p+1)/n > as a criterion for declaring an influential outlier.
COOK'S D: a measure of change in the parameter estimates, b Hat , when observation i is deleted. Let d _i = ( b Hat - b Hat _(-i) ) be the difference in the least squares estimates with and without observation i . Then, Cook's D is
COOKD _i = d _i^T( X^TX ) d _i over ( p + 1 ) s ² = e _i ² over (p+1) s ² x h _i over 1 - h _i ² (11)
Cook's D is approximately DFFITS ² / p , so a cutoff of 4 / n might be used to detect influential outliers.

The cutoff values for declaring influential observations are simply rules-of-thumb, not precise significance tests, and should be used merely as indicators of potential problems.

Example: Duncan Occupational Prestige Data

Occupational prestige is an important construct in sociology, but it is difficult to measure. Duncan (1961), in a classic study, attempted to estimate a substitute measure by predicting occupational prestige from census measures of income and education, which were available for many more occupations. The variables were defined as:

INCOME: The proportion of males in a given occupational category reporting income of $3500. or more in the 1950 US Census.
EDUC: The proportion of males in each occupation with at least a high school education in that census.
PRESTIGE: A survey of almost 3000 people by the National Opinion Research Center obtained ratings on a 5-point scale of the "general standing" of someone engaged in each occupation. Duncan's prestige measure was the percent of people giving a rating of "good" or "excellent".

The results of a regression analysis predicting prestige from income and education are shown below. From these results, it would appear that

the goodness of fit of the model is very good ( R ² = .83 ).
income and education have approximately equal weights in predicting occupational prestige.


                       Duncan Occupational Prestige Data
DEP VARIABLE: PRESTIGE Prestige

   ROOT MSE     13.36903     R-SQUARE       0.8282
   DEP MEAN     47.68889     ADJ R-SQ       0.8200

                              PARAMETER ESTIMATES

                 PARAMETER      STANDARD    T FOR H0:
VARIABLE  DF      ESTIMATE         ERROR   PARAMETER=0  PROB > |T|

INTERCEP   1   -6.06466292    4.27194117        -1.420      0.1631
INCOME     1    0.59873282    0.11966735         5.003      0.0001
EDUC       1    0.54583391    0.09825264         5.555      0.0001

However, plots of the influence measures for each case indicate that several observations a great deal of influence on this conclusion. One way to understand the separate effects of discrepancy and leverage is to make a plot of RSTUDENT * HATVALUE.

The plot below identifies individual observations which are either high leverage ( h _i > 2 (p+1)/n = .133 ) or have large studentized residuals (|RSTUDENT| > 2.02, the critical value of t (41) at a = .05 ).


       -+----+----+----+----+----+----+----+----+----+----+----+----+----+-
S  4.00+                                                                  +
T      |                                                                  |
U      |                                                                  |
D      |                                                                  |
E      |                                      * Minister                  |
N  3.00+                                                                  +
T      |                                                                  |
I      |                                                                  |
Z      |                                                                  |
E      |                                                                  |
D  2.00+      * Contractor                                                +
       |    * 28                                                          |
R      |      * 18                                                        |
E      |                                                                  |
S      |                                                                  |
I  1.00+      * 25     * 32                                               +
D      |  26 *  * 5* 7                                    RR Engineer *   |
U      |       * 29  * 20                                                 |
A      |    361**** 10* 13                                                |
L      |  * 37 ** 40                                                      |
   0.00+-*-44---343**-3-*-15----------------------------------------------+
       |    11 *  * 35                                                    |
       |        334*** * * 8                                              |
       |    42 * 31                                                       |
       |         * 41                                                     |
  -1.00+       * 38    * 21                                               +
       |          * 33                                                    |
       | * 22                                                             |
       |                                                                  |
       |   * 24                                     * RR Conductor        |
  -2.00+   * 23                                                           +
       |                                                                  |
       |         * Reporter                                               |
       |                                                                  |
       |                                                                  |
  -3.00+                                                                  +
       -+----+----+----+----+----+----+----+----+----+----+----+----+----+-
       .02  .04  .06  .08  .10  .12  .14  .16  .18  .20  .22  .24  .26  .28

       H LEVERAGE

large residual, low leverage: Reporter, Contractor
small residual, low leverage: RR Engineer
large residual, high leverage: Minister, RR Conductor

A high resolution version of the influence plot can be drawn showing the combined effects of residual and leverage(=influence) using a bubble symbol whose size is proportional to Cook's D or DFFITS, and the warning values for residuals and leverage. The plot below is drawn by the INFLPLOT macro, as follows:

%include inflplot;
%inflplot(data=duncan,
          y=Prestige, x=Income Educ, id=Job,
          bsize=14, color=red);

3.4 Partial residual plots

In multiple regression the correlations among predictors often makes it difficult to see the unique effects of one predictor, controlling for the others. Partial regression residual plots (Larsen & McCleary, 1972) are designed to show the relationship between y and each x _k , after the effects of all other predictors have been removed.

The partial regression plot for variable x _k is defined as follows. Let X [ k ] be the matrix of predictors omitting variable k ,

X [ k ] = [ 1 , x ₁ , x ₂ , ... , x _k-1 , x _k+1 , ... , x _p ]

and let y _k^* be the part of y which cannot be predicted from X [ k ] , that is, the residuals from regression of y on X [ k ] .

y _k^* = y - y hat _X [ k ]

Similarly, let x _k^* be the residuals from regression of x _k on X [ k ] . Then, the partial regression residual plot for x _k is the scatter plot of y _k^* vs. x _k^*.

Properties of partial residual plots

The slope of y _k^* on x _k^* is equal to b _k , the estimate of the (partial) regression coefficient, b _k , in the full model. (The simple scatter plot of y against x does not necessarily show the correct partial slope.)
The residuals from the regression line in this plot are identically the residuals for y in the full model.
The simple correlation between y _k^* and x _k^* is equal to the partial correlation between y and x _k with the other x variables partialled out or controlled. Visually, this correlation indicates how good the estimate of b sub k is.
When the relation between y and x _k is nonlinear, the partial regression plot shows both the linear and nonlinear components of the relation, controlling for the other predictors. This helps determine if a term in x _k ² or some other function of x _k needs to be added to the model.
The values x^*_ik ² are proportional to the partial leverage added to h _i by the addition of x _k to the regression. High leverage observations on x _k are those with extreme values of x _k^*.

The plot of y _k^* and x _k^* therefore serves the same purposes that an ordinary scatter plot does in simple linear regression.

Example: Duncan data

The partial regression residual plots for the Duncan occupational prestige data show a praticularly dramatic effect of the influential cases (Minister, RR conductor, RR Engineer) on the partial relationship between Prestige and Income. The effect of these cases is a large decrease in the effect of Income.

From these plots, it was decided to eliminate these three cases and repeat the regression analysis. The regression results are shown below.


                       Duncan Occupational Prestige Data
DEP VARIABLE: PRESTIGE Prestige

     ROOT MSE     11.49227     R-SQUARE       0.8762
     DEP MEAN     46.52381     ADJ R-SQ       0.8699

                              PARAMETER ESTIMATES

                 PARAMETER      STANDARD    T FOR H0:
VARIABLE  DF      ESTIMATE         ERROR   PARAMETER=0  PROB > |T|

INTERCEP   1   -6.31736312    3.67962246        -1.717      0.0939
INCOME     1    0.93066285    0.15374847         6.053      0.0001
EDUC       1    0.28464102    0.12136228         2.345      0.0242

The prediction of occupational prestige has improved somewhat ( R ² = .88 ), and the relative regression weights for income and education have changed considerably.

Partial regression plots in SAS

The PARTIAL option of PROC REG produces partial regression plots, one for each predictor given in the MODEL statement (plus one for the intercept). In these plots, points are identified by the value of the (character or numeric) variable given in the ID statement.

For the Duncan occupational prestige data, the partial regression plots were produced by the statements,

proc reg data=duncan;
   model prestige=income educ / influence partial r;
   output out=diagnose
     p=yhat r=residual rstudent=rstudent h=hatvalue
     cookd=cookd       dffits=dffits;
  id case;

The plot for INCOME in Figure 5 shows that three observations (6, 16 and 27) have great influence on the regression coefficient for income. Without these three observations, the partial effect (slope) of income on prestige would be a good deal greater.



                          PARTIAL REGRESSION RESIDUAL PLOTS
  PRESTIGE

P       80 +
r          |
e          |
s          |
t          |
i       60 +
g          |
e          |
           |
           |
        40 +                                                         27
           |                                      17
           |                                      18
           |                                 28
           |
        20 +
           |                                     2*20
           |          6                 25 5   10
           |                          32 7    1*   30            16
           |                             **  19
         0 +                            44 35 1*
           |                         **           33
           |                      1534
           |                      43**42
           |                    14  31      22
       -20 +                   45 4138
           |                               23
           |                  21    24      9
           |
           |
       -40 +
           |
           |
           |

           |
       -60 +
           |
           |
           |
           |
       -80 +
            +---------+---------+---------+---------+---------+---------+
          -60       -40       -20         0        20        40        60

            Income                                                 INCOME

Figure 5: Partial regression plot for income

PARTIAL macro

The PARTIAL macro plots high-resolution partial regression plots which show the regression line with slope b _k explicitly and which highlight potentially influential observations. The macro program takes the following parameters:

%macro PARTIAL(
        data = _LAST_,  /* name of input data set               */
        yvar =,         /* name of dependent variable           */
        xvar =,         /* list of independent variables        */
        id =,           /* ID variable                          */
        label=INFL,     /* label ALL, NONE, or INFLuential obs  */
        out =,          /* output data set: partial residuals   */
        htext=1.5,      /* height of text labels in the plots   */
        plots=&xvar,    /* which partial plots to produce       */
        gout=gseg,      /* name of graphic catalog              */
        name=PARTIAL);  /* name of graphic catalog entries      */

An example is shown in Figure 6. This plot, with all points labelled with CASE number, is produced by the statements,

%include partial ;
%partial(data=duncan, yvar=prestige,
         xvar=educ income, id=case, label=ALL);

3.5 Box-Cox Transformations

One way to select an "optimal" transformation of y (or Xs) in regression is to add parameter(s) for the transformation to the model. Consider the family of transformations,

y ^{( l )} = left { lpile y ^l - 1 over l , above log y , lpile l != 0 above l = 0 right "" , (12)

with the general linear model, y ^{( l )} = X b + e , where l , the power of y, is an additional parameter to be estimated.

Box and Cox proposed a maximum likelihood procedure, to estimate the power ( l hat ), the regression coefficients ( b Hat ) and error variance simultaneously by maximizing the likelihood,

L ( l , b , s _e ² ) = Prob ( model | data )

This gives a statistical test of whether a transformation is necessary as the test of H ₀ : l = 1 , which can be tested by a likelihood ratio test. The test statistic is the likelihood ratio G ² ,

G ² = - 2 log left [ L ( b , s _e ² | l = 1 ) over L ( b , s _e ² | l = l hat ) right ] ~ chi ₍₁₎ ²

It turns out that log L ( l ) ~ sqrt MSE , the mean square error in regression using the y ^l as the criterion, so maximizing log L is equivalent to finding the value of l which minimizes the MSE.

Hence, a simple (but computationally intensive) method is to fit the model y hat ^<( l )> = X b , for a range of l values (say, -2 to +2) and find the value l hat which minimizes the MSE. The maximum likelihood method also provides a 95% confidence interval for l as those values for which

log L ( l ) > log L ( l hat ) - ( 3.84 ) / 2

where 3.84 is the .95 percentile of the chi square distribution with 1 df. As an added benefit, one can plot the values of the partial t or F statistic for each regressor vs. l to see how the contribution of each predictor to the model varies with the power.

The BOXCOX and BOXGLM macros

The BOXCOX macro finds power transformations of the response variable in a regression model (fitted with PROC REG) by the Box-Cox method, with graphic display of the maximum likelihood solution for l , t-values for model effects, and the influence of observations on choice of power. An analogous program, BOXGLM, carries out a similar analysis for ANOVA or regression models fitted with PROC GLM.

For the auto data, the previous plots were constructed with the following statements:

%include data(auto);
%include boxcox;       *-- unnecessary in PsychLab;
%boxcox(data=auto,
        resp=MPG,
        model=Weight Displa Gratio,
        id=model,
        out=coxout,
        gplot=RMSE EFFECT INFL,
        pplot=RMSE EFFECT INFL,
        lopower=-2.5, hipower=2.5, conf=.99);

The OUT= parameter specifies the name of the dataset to contain the transformed response.

Previous section

Next section.