Exploratory and Graphical Methods of Data Analysis
Michael Friendly

Scatterplots can be enhanced by adding information to help you interpret the display or understand aspects of the data that are not shown directly.

One idea is to add an elliptical confidence region ( data ellipse ) around the mean to highlight the joint relationship. With data for several groups this helps show the extent to which the relationship is the same for all groups. (Better than individual regression lines.)

The 50% ellipse is analogous to the central box in the boxplot.
The correlation for the total sample may be very different from the within-group correlations.
The size of each ellipse shows within-group variances on each variable.

For bivariate normal observations, x _i = ( x _i , y sub i ) , the elliptical region containing ( 1 - a ) of the data is given by the values x satisfying:

( x _- x bar )^T S ^size +2 -1 ( x _- x bar ) <= chi ₂ ² ( 1 - a ) ' ' . (3)

where x bar = ( x bar , y bar ) are the sample means, S ( 2 × 2 ) is the covariance matrix of (x , y ) , and chi ₂ ² ( 1 - a ) is the (1 - a ) percentage point of the chi² distribution with two degrees of freedom.

CONTOUR macro

The macro CONTOUR uses PROC IML to calculate points on the boundary of the ellipse for each group in the input data set.

%macro CONTOUR(
       data=_LAST_,              /* input data set    */
       x=,                       /* X variable        */
       y=,                       /* Y variable        */
       group=,                   /* Group variable    */
       pvalue= .5,               /* probability value */
       out=CONTOUR,              /* output data set   */
       colors=RED GREEN BLUE BLACK);

For example, a set of 50% data ellipses for Weight vs. Price of automobiles grouped by region of origin is produced by these statements:

%include contour;
%contour(data=auto, x=price, y=weight, group=origin);

2.2 Smoothing scatterplots

Our eyes are often quite effective in detecting patterns in scatterplots that could not be captured by any quantitative measures. Sometimes, however, the pattern may be too weak, or the distributions of the variables too uneven, to be able to clearly see the relation between Y and X. In these cases, it is useful to plot some smooth curve(s) or other display which shows how Y changes with X.

Trace lines and boxplots

One set of ideas involves dividing the data into roughly equal-sized strips according to their ordered x-values. For each strip, we can plot:

the median and hinge trace lines, e.g., a plot of median ( y | x ) vs. x . (Example: Draft lottery data)
parallel boxplots, i.e., boxplot of all y values in the strip positioned at the middle x -value. (Example: Baseball salaries vs. batting average)

The plot below shows the lottery numbers assigned to each birthday (Day of Year) in the 1970 US Draft Lottery carried out by the Selective Service to determine the order in which elligible men would be drafted into the US Army. It was supposed to be completely random, and looks that way from the scatter plot


   SCATTER DRAFT

    366  -        **    *      *  2 * **               *
 L       -    ***     * *  **2       ** 2       2*      *   *
 O       - ***  *    * 2    * ** *  *    *2    *  *              *
 T       -***        *    * *     2*        *  **      *     *  *  2*
 T       -     *  2*2**       * * **     *  **  **       **
 E  275  o   *      2   ****  *  *   2  **   *          *    **  *
 R       -*           *2* * *2   * *      *  *     * * *  **
 Y       -  **     *   *   *          *          * *******2     *   *
         -***  ****  *  **2       **   *   *      *      *  *
 N       - 2    *  *   *      **    *   *    * *     2* *  2   *
 U  184     *  * **          * **   *   * 2       ***    **   **
 M       -*    *      2 *                   * *  ** * *  *       2** 2
 B       -   * * 2   *    **   *           *  **2     ** *     3
 E       -  *        **    *    2    * *  *            *      3 **  2 *
 R       -    *     *            * ** ***    *2* *  * *   *    *  *
     92  + *  ****          *       *   *  **           *  * *   *  *2*
         -    *  *       2   * * *   **     *      **    *  *2*     **
         -   *2    *        *   *     ** *    2   2  *       *  * *  *
         -          * *  2    *** *       *  * * *          2 *   *2
         -  *     *        *       **  * **** *   *   * *   *     **  *
      1  -       *           *                     2*     **    * *   *
          -------------------+------------------+-------------------
          1                 123                 244                 366
                                                           DAY OF YEAR

The next slide shows the trace line plot of the medians (M), upper (H), and lower (L) hinges of the lottery numbers against day of the year. It is clear, even with fairly coarse resolution, that the lottery numbers decrease towards the end of the year, indicating a defect in the method used to select these "random numbers".



     342  +                    H     H
 L        -         H          HH   HHH H
 O        -        HH    HHH H HHH HH H  H
 T        -H     H H HHHHHHHHHH  HHH   HHHHH    HH
 T        -H   HHHHH  HHHH   HH    H    HHHHH  HH   H        H
 E   268  + HHHHH H                       H    HH H HH   HHH H
 R        - HHHH    M                       HHH H HHHHHH HHHH H
 Y        -  HH    MMMMMMMM    MM             HH  H   HHHH    HH
          -      MMM  MMMMMMMM MM  MMM        H        H       H  H  H
 N        -M    MMM         MMMM MMMMMM                H     M HHHHHHHHH
 U   194  +MM  M    L        MM   MM   MMM               MMMMM     HH  H
 M        - MMMM   LL                   MMMMM   MM  MMMMMM   MM     H  H
 B        -   M      L   L                M MMMMM MMMM M      MM
 E        -        L LLLL                     MM  MM       L   MMM
 R        -L    LLLL  LLLLL     L                        LLLLL   MMMMMMM
     121  + L  LLLL      L L L LL   LL                   LL LL     MM  M
          - L  L            LLLLL  LLLL L       LL  LL LLL  LL
          - LLLL           LLLLL LLLLL LLL LL  LLL  LLLLLL    LL
          -  LL              LL  LLL   LLLLLL  LL L LL LL     LLLLL  LLL
          -   L                           L  LLL  LLL  L         LLLLL L
      47  +                             L         LL               LL  L
           +-------------------+-------------------+-------------------+
          21                 129                 237                 346
                                                            DAY OF YEAR

Lowess smoothing

Another generally useful technique is robust, locally weighted regression smoothing , often called lowess . The procedure finds a smoothed fitted value, y hat _i , for each x _i by fitting a weighted regression to the points in the neighborhood of x _i . The points closest to x _i receive the greatest weight. This is the "locally weighted regression" part of the procedure, and the weights are called neighborhood weights .

The robust part works as follows: Once the fitted values, y hat _i , have been found, the residuals, r _i = y sub i - y hat _i are used to determine a new set of weights ( robustness weights ) so that points which have large residuals are down-weighted, and the locally weighted regression is repeated.

In practice, lowess depends on the choice of a smoothing parameter, f , 0 < f <= 1 , the fraction of the data points to be considered in the calculation of y hat _i . Choosing f = .5 means that only the r = [ .5 n ] points closest to x _i have non-zero weights. Increasing the value of f makes the fitted curve smoother, decreasing f lets the curve follow the data more closely.

Step 1: Choose window around x, select the r observations closest to x _i . Call these x _i1 , x _i2 , ... , x _ir . The corresponding y values are y _i1 , y _i2 , ... , y _ir . The window half width for x _i is the distance to the furthest observation.:
h _i = max | x _i - x _ij |
Step 2: Find weights
For each x _ij in the window of x _i , find the weight,
w _ij = W left "[" x _ij - x _i over h _i right "]"
where W ( bullet ) is the tricube weight function,
W ( z ) = left "{" lpile 0 above ( 1 - | z | ³ ) ³ lpile for | z | >= 1 above for | z | < 1 right " "
Step 3: Fit weighted linear regression
Use weighted least squares to find coefficients, a _i , b _i , to minimize S to r < w _ij e sub ij ² > in
y _ij = a _i + b _i x _ij + e _ij
Step 4: Find fitted value, yhat i
y hat _i = a _i + b _i x _i , using the slope and intercept found by weighted least squares. All this gives one smoothed value, y hat _i for x _i !
Step 5: Down-weight outliers
Calculate residuals, e _i = y _i - y hat _i . From these, calculate the robustness weights that discount observations with large residuals:
delta _i = B left "[" e _i over 6 Mdn ( | e | ) right "]"
where B ( bullet ) is the bisquare weight function,
B ( z ) = left "{" lpile 0 above ( 1 - z ² ) ² lpile for | z | >= 1 above for | z | < 1 right " "
Step 6: Repeat
Repeat steps 1 - 4, but use the compound weights, delta _j w _ij in the individual regressions, finding new fitted values, y hat _i . The effect is that outliers receive small weights, even when they are quite near x _i . Mostly, one robustness step is sufficient; once or twice I've seen cases where three iterations were better than two, however.

Choosing f

curve, but may result in oversmoothing. f = .5 is often a good choice, but values in the range .33 to .67 may be tried.

The figure below shows the lowess smoothed fit for the draft lottery data. The steady decline in Lottery Number over the last two-thirds of the year is now quite clear. (The plot is over-dramatic, and perhaps misleading, since the vertical scale covers a smaller range than the plots of the raw data and trace lines.)


       SMOOTH ~ DRAFT LOWESS }366
       SCATTER SMOOTH

    213  -             26666663
 L       o    *666666674      366
 O       -46665                  66
 T       -                        *65
 T       -                          *66
 E  189                               66*
 R       -                               563
 Y       -                                 3664
         -                                    366663
 N       -                                         3666*
 U  165  +                                             56*
 M       -                                               55
 B       -                                                26*
 E       -                                                  54
 R       -                                                   26*
    141  +                                                     54
         -                                                      26
         -                                                        62
         -                                                         44
         -                                                          26
    117  -                                                            4
          -------------------+------------------+-------------------
          1                 123                 244                 366
                                                           DAY OF YEAR

LOWESS macro

The lowess procedure involves a series of weighted regressions, one for each data point. The process can be carried out easily with the matrix operations of PROC IML.

The LOWESS macro reads the input data set into PROC IML, calculates the smoothed y hat values, and creates an output data set containing the original and smoothed data. The macro takes the following parameters:

 /*---------------------------------------------------------------*
  * LOWESS SAS Locally weighted robust scatterplot smoothing      *
  *---------------------------------------------------------------*
%macro LOWESS(
   data=_LAST_,    /* name of input data set            */
   out=SMOOTH,     /* name of output data set           */
   x = X,          /* name of independent variable      */
   y = Y,          /* name of Y variable to be smoothed */
   id=,            /* optional row ID variable          */
   f = .50,        /* lowess window width               */
   p = 1,          /* 1=linear fit, 2=quadratic         */
   iter=2,         /* total number of iterations        */
   plot=NO,        /* draw the plot?                    */
   gplot=NO,       /* draw the plot?                    */
   pplot=NO,       /* draw a printer plot?              */
   symbol=circle,
   htext=1.5,
   hsym=1.5,
   name=LOWESS);   /* name for graphic catalog entry    */

e.g.,

%include lowess;
%lowess( data=auto, x=weight, y=mpg, f=.3, gplot=YES );

2.3 Transformations for linearity

If we think of y as a response (or "dependent") variable, and x as a factor (or "independent") variable, we might like to fit y as a function of x

y = f ( x ) + residual = fit + residual

All other things equal, we prefer a "simple" f ( x ) like a linear function, to a more complex one. If a scatterplot of y against x appears substantially non-linear, there are two choices:

Bend the model: Try fitting a quadratic, cubic, or other polynomial in x instead of a linear model.
Unbend the data: Transform either y --> y^', or x --> x^' (or both), so that the relation between the transformed variables is more nearly linear,
y^'= a + b x^'+ residual

Once again, the ladder of powers provides a scheme for understanding the effect of various power transformations. Visual examination of the scatterplot of the raw data, possiblly enhanced with a lowess smoothed curve, can suggest the direction to move on the ladder for transforming x , or y , or both.

As a preliminary example, consider the simple case where

y = size -2 1 over 5 x ²

for x = 0, 1, 2, 3, 4, 5. Here, the relation can be straightened by transforming x :

x^T= x ² --> y = size -2 1 over 5 x^T

or y :

y^T= sqrt y --> y = sqrt size -2 1 over 5 x^T

The data and the transformed values are:

X      Y      X}     Y.9
0     0        0     0
1     0.2      1     0.4472
2     0.8      4     0.8944
3     1.8      9     1.342
4     3.2     16     1.789
5     5       25     2.236

These values are plotted below:


       SCAT X, Y                         SCAT XP, Y
 5-                        f       5-                        f
  -                                 -
  -                                 -
  -                                 -
  -                   f             -               f
  -                                 -
  -              f                  -        f
  -                                 -
  -         f                       -   f
  -                                 -
 0f----f--------------------       0ff------------------------
  0           X            5         0           X'         25

       SCAT X , YP
 2.5-
    -                        f
    -
    -                   f
    -
    -              f
    -         f
    -
    -    f
    -
 0  f-------------------------
    0           X            5

Tukey's arrow rule

More generally, a power of x or of y can be selected by examining the curvature in the scatter plot of y vs. x and drawing an arrow from the inside to the outside of the "bulge".

The arrow points in the direction to move along the ladder of powers for x or for y . That is, if the arrow points toward smaller values of x (or y ), move down the ladder of powers, toward sqrt x or log ( x ) (or sqrt y , log ( y ) ). The greater the curvature, the further from 1 (raw data) should be the power.

Summary points

The arrow rule indicates the direction to move but does not indicate, except roughly, what the power should be. To see what effect a given transformation,

x --> x ^p

y --> y ^q

would have, you would have to transform the data and examine the scatterplot of the re-expressed values.

Since the power transformations are order-preserving, we can try different power transformations on a few well chosen points, to see what effect they would have on all the data.

Tukey (1977) suggests finding three summary points which are the median ( x , y ) values in three equal-sized strips based on the ordered x -values. Use subscripts L , M , H to denote the ( x , y ) values for the Low, Middle, and High strip.

If the relation is linear, these three points will fall on a line, or equivalently, the slope of the line from (x , y ) sub L to (x , y ) _M will be the same as the slope of the line from (x , y ) _M to (x , y ) sub H .



 Y-        -       -               Y-        -       -    H
  -        -       -                -        -       -
  -                    H            -
  -                                 -
  -                                 -
  -            M                    -
  -                                 -            M
  -                                 -
  -    L   -       -                -    L   -       -
  -        -       -                -        -       -
 0+-------------------------       0+-------------------------
  0           X                      0           X

On the other hand, if the relation is not linear, the lines connecting the pairs of points will have different slopes.

Finding summary points

Sort the ( x , y ) pairs in increasing order by x -value.
Divide the points into thirds, based on ordered x -values. If n / 3 is not an integer, put the [ n / 3 ] points with the smallest x -values in the Low group, the [ n / 3 ] points with the largest x -values in the High group, and the remaining points in the Middle group.
This division is modified when equal x -values straddle the dividing line: they are all kept in the same third. Also, neither end portion should cover more than half the range of the x -values, which can happen if the distribution of x is highly skewed.
Then x _L is the median of the x s in the L group; y _L is the median y . The summary points (x _M , y _M ) and (x _H , y _H ) are defined similarly.

This section uses data on literacy rates and gross national product for 22 nations to show the details of the calculations. (The plot indicates the relation is highly nonlinear. The arrow rule indicates we should transform x to lower powers or y to higher powers.)


 NEPAL            45  5
 BURMA            57 47.5
 UGANDA           64 27.5
 S. VIETNAM       76 17.5            SCAT GNP
 THAILAND         96 68        100-         f f           f          f
 HAITI           105 10.5         -                f
 INDONESIA       131 17.5         -        f
 S. KOREA        144 77       L   -  f
 GHANA           172 22.5     I   -   f
 PERU            179 47.5     T   - f   f
 EL SALVADOR     219 39.4     E   -    f
 BR.GUIANA       235 74       R   -      f
 HONG KONG       272 57.5     A   -f f   f
 PANAMA          329 65.7     C   -   f
 LEBANON         362 47.5     Y   -
 SINGAPORE       400 50           -f
 ARGENTINA       490 86.4         -fff
 ICELAND         572 98.5         - f
 CZECHOSLOVAK    680 97.5         -f
 FRANCE          943 96.4        0-------------------------------------
 NEW ZEALAND    1310 98.5            0           GNP               2000
 CANADA         1947 97.5

Since the distribution of GNP is so skewed, the range rule suggests allocating only two points to the upper third. The summary points are found to be:


         2 RSTAB GNP
   SUMPTS: SPLIT IS 7 13 2
   SUMMARY POINTS
   L     76 17.5
   M    329 65.7
   H   1629 98

Ratio of slopes

The curvature of the data can be measured by the ratio of slopes :

r = m _HM over m _ML = ( y _H - y _M ) / ( x _H - x _M ) over ( y _M - y _L ) / ( x _M - x _L ) (4)

As shown earlier, a linear relation implies r approx 1 (or log r approx 0 ).

For the GNP data, this ratio is quite far from 1:


      1 1 SLOPES S
SUMMARY    HALF    SLOPE
POINTS     SLOPES  RATIO
[X*1] [Y*1]
1629 98
           0.02485
 329 65.7          0.1304
           0.1905
  76 17.5

Since we are using medians, and the ladder of powers is order-preserving (as long as all values are positive), we can transform the summary points to various powers, and the ratio of slopes for the transformed summary points will tell what effect that transformation will have on the whole data set.

If we transform x --> x ^p , and y --> y ^q , then the ratio of slopes for that pair of transformations is

r ^< (p,q) = m _HM ^< (p,q) over m _ML ^< (p,q) = ( y _H ^q - y _M ^q ) / ( x _H ^p - x _M ^p ) over ( y _M ^q - y _L ^q ) / ( x _M ^p - x _L ^p ) (5)

Since we know we must go down the scale of powers for x , we try log x , then - 1 / sqrt x . The first is not far enough, but the second is too far. p = - 1/3 is an in-between power, and seems to work quite well.


      0 1 SLOPES S
SUMMARY     HALF   SLOPE
POINTS      SLOPES RATIO
[X*0] [Y*1]
3.212 98
            46.49
2.517 65.7         0.6138
            75.74
1.881 17.5

      H.5 1 SLOPES S
SUMMARY        HALF   SLOPE
POINTS         SLOPES RATIO
[X*H0.5] [Y*1]
H0.02478 98
               1064
H0.05513 65.7         1.315
                809
H0.1147  17.5

      H.33 1 SLOPES S
SUMMARY        HALF   SLOPE
POINTS         SLOPES RATIO
[X*H0.33] [Y*1]
H0.08711 98
               533.3
H0.1477  65.7         1.016
               524.9
H0.2395  17.5

Ratio of slopes table

This processs can be automated, and the results organized in a ratio of slopes table which shows the slope ratio for all combinations of some set of powers of x and of y . Note how roughly equal values tend to run along diagonals.


   RATIOS OF SLOPES  :
      YE  [H2  ] [H1  ] [ H.5] [LOG ] [  .5] [RAW ] [ 2  ]
    X$
   [H2  ]   .778  2.213  3.575  5.590  8.459 12.394 24.385
   [H1  ]   .175   .499   .806  1.261  1.908  2.796  5.500
   [ H.5]   .083   .235   .379   .593   .898  1.315  2.588
   [LOG ]   .039   .110   .177   .277   .419   .614  1.208
   [  .5]   .018   .051   .082   .128   .194   .284   .559
   [RAW ]   .008   .023   .038   .059   .089   .130   .257
   [ 2  ]   .002   .005   .008   .012   .018   .027   .053

The figure below shows the data and scatterplot when GNP is transformed to the reciprocal cube root.

         NEG. CUBE ROOT OF X IS CLOSEST TO BEING STRAIGHT
         GNP3 ~ GNP
         GNP3[;1]~GNP[;1] POWER H.333

         GNP3  TRANSFORMED X
   H0.2815   5
   H0.2602  47.5
   H0.2503  27.5               SCAT GNP3
   H0.2364  17.5         100                         ff  f f   -
   H0.2187  68                                          f      -
   H0.2123  10.5                                    f          -
   H0.1972  17.5         L                 f                   -
   H0.1911  77           I                     f               -
   H0.1801  22.5         T             f          f            -
   H0.1777  47.5         E                      f              -
   H0.1662  39.4         R                         f           -
   H0.1623  74           A        f          f    f            -
   H0.1546  57.5         C                     f               -
   H0.1451  65.7         Y                                     -
   H0.1406  47.5                   f                           -
   H0.136   50                       f    f  f                 -
   H0.1271  86.4                        f                      -
   H0.1207  98.5               f                               -
   H0.114   97.5           0------------------------------------
   H0.1022  96.4            H0.3         GNP*H1 3          H0.05
   H0.09161 98.5
   H0.08029 97.5

The plot shows no signs of nonlinearity. However, some find "reciprocal cube root" of GNP hard to interpret, and might prefer a slightly curved relationship with log GNP .

2.4 Plotting discrete data

When the x or y values are highly discrete, or when there is a lot of data, it may be difficult to see patterns in scatterplots because of overplotting --multiple points occur at the same plot locations. Standard printer plots often use the characters A, B, C, etc. to represent 1, 2, 3, ... observations at the same plot location.

The plot below shows data generated from a mixture of three bivariate normal distributions. It is hard to see that there are three regions of high density.


       PLOT OF Y*X    LEGEND: A = 1 OBS, B = 2 OBS, ETC.

 Y |                                                  AA  A
   |                                               A    BA
50 +                                          A A   A        A
   |                                      B   A ABACD ABB CAB
   |                                       A   AACBCBAA BA
   |                                    AA B  BAAACACA B
   |                                 A AA B ABAABA A A  A
40 +                                AA  A B   D
   |                            A        AA B
   |                                    A CA  AA
   |                       A   B    BBC BBA  AB A
   |                       A  B BAB E AAAA  AA   A
30 +                    A A    ADAAC  A A   A A
   |                   A  A AB   A BA BAAAA
   |                      A   A AB A A A A A
   |                  A A AAA AA
   |                 A     A  A    AAA
20 +                A AABAAA  A   A  A
   |             AA A     BACBB A
   |        A ABA ABB   DAA BAA B A
   |       AAA  AAB CACAAB  A
   |    B      AA  AF AAB B A
10 +    AA      AB      B
   |          A      B
   |                B
   |
   |
 0 +
   --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
    -3   0   3   6   9  12  15  18  21  24  27  30  33  36  39

                                X

Sunflower plots

One idea for such data is a sunflower plot . Here, the data are binned into (x , y ) cells, and the frequency of points in each cell is represented with a sunflower symbol. The sunflower symbol depicts the number of observations by the number of radial "petals" from a central point.

Jittering

Another idea is to add a small amount of random noise to each point to reduce overplotting. This technique, called jittering , is particularly useful when the data is discrete.

To jitter, add a uniformly distributed random quantity, scaled so that it breaks up the overlap, but does not corrupt the data. Let u _i = u [ -1, 1 ] be uniformly distributed from -1 to 1. Then, to jitter x , calculate

x^T_i = x _i + s u _i

where s is the scale factor, which might be some small fraction (e.g., .02) of the range of the data, or 1/2 the rounding interval if the data have been rounded. If the y variable is also discrete, the same process can be applied to jitter y --> y^T.

Then, in the scatter plot of y (or y^T) vs. x^T it should be easier to see the density of points in different regions.

Previous section

Next section.

Part 2: Examining Relationships

2.1 Enhanced Scatterplots