Exploratory and Graphical Methods of Data Analysis
Michael Friendly

Part 4: Multivariate Data Displays

4.1 Scatterplot matrix

With more than two variables, some ingenuity is required to display their relationships on a flat piece of paper. One idea is to display the scatterplot for all pairs of variables in an organized display, called a scatterplot matrix .

The next slide shows a scatterplot matrix of four measurements on three species of Iris flowers, the length and width of the sepal, and length and width of the petal. Iris species is reflected in the color and shape of the plotting symbol (which unfortunately do not remain visually distinct when photoreduced).

SCATMAT macro

The SCATMAT macro constructs a scatterplot matrix for any number of variables. A class or grouping variable in the data set can be used to determine the color and shape of the plotting symbol.

%macro SCATMAT(
    data =_LAST_,          /* data set to be plotted           */
    var  =_NUMERIC_,       /* variables to be plotted - can be */
                           /* a list or X1-X4 or VARA--VARB    */
    group=,                /* grouping variable (plot symbol)  */
    symbols=%str(- + : $ = X _ Y),
    colors=BLACK RED GREEN BLUE BROWN YELLOW ORANGE PURPLE,
    gcat=GSEG);            /* graphic catalog for plot matrix  */

The scatterplot matrix for the Iris data is produced by these statements:

%include scatmat;
%scatmat(data=iris,
         var=SEPALLEN SEPALWID PETALLEN PETALWID,
         group=spec_no);

4.2 Glyphs, Stars & Profiles

Another class of ideas for plotting multivariate data is to choose two variables for a scatterplot, and represent additional variables in the plotting symbol used for each observation.

Glyph plots

The next slide shows a glyph plot of the Iris data. The foreground variables are the lengths of two parts of the iris flower, petal and sepal. Two other background variables--the width of these two parts-- are shown by the length and angle of a line that emanates from the plotted point.

Star plots

A star plot can be used to present many variables at once. The idea is to plot each observation as a star-shaped figure with one ray for each variable. For a given observation, the length of each ray is made proportional to the size of that variable.

Star plots differ from glyph plots in that all variables are used to construct the plotted star figure; there is no separation into foreground and background variables. Instead, the star-shaped figures are usually arranged in a rectangular array on the page.

The main use of star plots would be in detecting which groups of observations seem to cluster together. They would appear as stars with similar shape.

Profiles

A profile plot displays a set of variables as a set of line segments connecting points, one for each variable. There is one profile for each observation.

If the variables are measured on different scales, there two ways the data are typically displayed, neither very satisfactory:

Plot the raw data profiles. Variables with a large range tend to dominate the display.
Plot the standard score profiles, i.e., change each variable to z scores, z _ij = ( x _ij - x bar sub j ) / s _j . Unless the observations have relatively flat standard score profiles, the result will be hard to see.

Linear profiles display

A far more effective display of profiles was suggested by Hartigan (1975). The idea behind this technique is to make the profiles as smooth as possible, by choosing the position of the variables along the horizontal scale in an optimal way. A sensible definition of smoothness requires that each profile be as nearly linear as possible.

If the data values are denoted y _ij , we first allow each variable to be transformed to standard scores,

y _ij star = ( y _ij - y bar _j ) / s _j

where y bar _j is the mean of variable j , and s _j is the standard deviation. This allows us to equate variables with very different ranges, such as the crime variables.

Say the variables are positioned on the horizontal scale at positions x _j , the values we wish to find. Then if each case has a linear profile, that means that

y _ij star = a _i + b _i x _j

with slope b _i and intercept a _i . Hartigan (1975) shows that the solution can be expressed in terms of the first two eigenvectors of the correlation matrix C .

The linear profiles display of the crimes data is shown in the next slide. Note that property crimes are positioned on the left, crimes of violence on the right. Cities with a positive slope (Dallas, Chicago) have a greater prevalence of violent crime than property crime; for cities with a negative slope (Los Angeles, Denver, Honolulu) the reverse is true.

Overall crime rate for each city is represented by the intercept of each line: cities at the top (Los Angeles, New York, New Orleans) having the most crimes and cities at the bottom having the least.

4.3 Biplot

Glyphs, stars, and profile plots focus all attention on the observations and portray the relations among the variables only implicitly. The biplot, proposed by Gabriel (1971; 1980), displays the observations and variables in the same plot, in a way that depicts their joint relationships.

The biplot is based on the idea that any data matrix, Y ( n × p ) , can be represented approximately as the product of a two-column matrix, A ( n × 2 ) , and a two-row matrix, B^T(2 × p) ,

Y = A B^T (13)

so that any score y _ij of Y is approximately the inner product of row i of A , and of column j of B^T,

y _ij = a _i^Tb _j = a _i1 b _1j + a _i2 b _2j (14)

Thus, the rows of A represent the observations in a two-dimensional space, and the columns of B^T represent the variables in the same space.

The approximation used in the biplot is like that in principal components analysis : the two biplot dimensions account for the greatest possible variance of the original data matrix. In the biplot display,

the observations are plotted as points. The configuration of points is essentially the same as scores on the first two principal components.
the variables are plotted as vectors from the origin. The angles between the vectors represent the correlations among the variables.

Example: Demographics of Some American Indian Tribes

The table below shows median years of schooling, percentage below the poverty line, and a derived "economic index" (.6 schooling + .1 poverty) for 5 American Indian tribes.


       OBS    TRIBE       SCHOOL    POVERTY    ECONOMIC

        1     SHOSHONE     10.3      29.0        9.08
        2     APACHE        8.9      46.8       10.02
        3     SIOUX        10.2      46.3       10.75
        4     NAVAJOS       5.4      60.2        9.26
        5     HOPIS        11.3      44.7       11.25

For the biplot, the data is usually first expressed in terms of deviations from the mean of each variable.


                    Deviations from Means
            Y        SCHOOL   POVERTY  ECONOMIC

            SHOSHONE   1.08   -16.4     -0.992
            APACHE    -0.32     1.4     -0.052
            SIOUX      0.98     0.9      0.678
            NAVAJOS   -3.82    14.8     -0.812
            HOPIS      2.08    -0.7      1.178

The row and column dimensions are found from the singular value decomposition of this matrix. (In this case, the fit is perfect, since there are only two linearly independent variables.)


                 Biplot coordinates
                           Dim1     Dim2

          OBS SHOSHONE  -3.4579  -0.9040
          OBS APACHE     0.3024  -0.0619
          OBS SIOUX      0.1567   0.6842
          OBS NAVAJOS    3.2110  -0.9216
          OBS HOPIS     -0.2122   1.2034

          VAR SCHOOL    -0.7305   1.5996
          VAR POVERTY    4.6791   0.2435
          VAR ECONOMIC   0.0296   0.9841

Algebraically, this expresses each score as the product of the observation row and the variable row:

Shoshone-school = -3.46*-0.73 + (-.904)*1.6
                = 1.08

Geometrically, this corresponds to dropping a perpendicular line from the Shoshone point to the School vector in the biplot.


 D      2 +
 i        |
 m        |
 e        |
 n        |              school
 s    1.5 +
 i        |
 o        |
 n        |                 HOPIS
          |
 2      1 +                 economic
          |
          |
          |                   SIOUX
          |
      0.5 +
          |
          |
          |                                        poverty
          |
        0 +                    +
          |                    APACHE
          |
          |
          |
     -0.5 +
          |
          |
          |
          |SHOSHONE                         NAVAJOS
       -1 +
          |
          |
          |
          |
     -1.5 +
           +---------+---------+---------+---------+---------+
          -4        -2         0         2         4         6
           Dimension 1

Example: Crime rates in the US

As a substantive example of the biplot technique, we consider data on the rates of various crimes (per 100,000 population) in the U.S. states.

The horizontal dimension in the plot is interpreted as overall crime rate. The states at the right (Nevada, California, NY) are high in crime, those at the left (North Dakota, South Dakota) are low. The vertical dimension is a contrast between property crime vs. violent crime. The angles between the variable vectors represent the correlations of the variables.

BIPLOT macro

A simple biplot can be constructed from the output data sets of PROC PRINCOMP. A more general version was implemented as a SAS macro in PROC IML to allow different scalings (GH, JK, and symmetric factorizations) of the variable and observation points. The BIPLOT macro takes the following parameters:

%macro BIPLOT(
   data=_LAST_,     /* Data set for biplot     */
   var =_NUMERIC_,  /* Variables for biplot    */
   id  =ID,         /* Observation ID variable */
   dim =2,          /* Number of dimensions    */
   factype=SYM,     /* factor type: GH|SYM|JK  */
   scale=1,         /* Scale factor for vars   */
   out =BIPLOT,     /* Biplot coordinates      */
   anno=BIANNO,     /* Annotate labels         */
   std=MEAN,        /* Standardize: NO|MEAN|STD*/
   pplot=YES);      /* Produce printer plot?   */

The number of biplot dimensions is specified by the DIM= parameter and the type of biplot factorization by the FACTYPE= value. The BIPLOT macro constructs two output data sets, identified by the parameters OUT= and ANNO=. The macro will produce a printed plot (if PPLOT=YES), but leaves it to the user to construct a PROC GPLOT plot, since the scaling of the axes should be specified to achieve an appropriate geometry in the plot.

4.4 Assessing multivariate normality

A graphical technique for assessing multivariate normality is an application the ideas of Q-Q plots described earlier. For each multivariate observation, x _i^T= ( x _i1 , x sub i2 , ... , x _ip ) , we compute the generalized (Mahalanobis) squared distance from the mean vector x bar , defined as:

d _i ² = ( x _i - x bar )^TS ^-1 ( x _i - x bar ) (15)

where S is the p x p sample variance covariance matrix. This is the multivariate analog of the square of the standard score for a single variable,

z _i ² = ( x _i - x bar ) ² over s ²

With p variables, d _i ² is distributed approximately as chi² with p degrees of freedom for large samples from the multivariate normal distribution. Therefore, a Q-Q plot of the ordered distance values, d _(i) ² , against the corresponding quantiles of the chi² ( p ) distribution should yield a straight line through the origin for multivariate normal data.

4.5 Detecting multivariate outliers

For single variables, outliers are cases with extreme standard scores, say | z _i | > 4 . Analogously, we could define a multivariate outlier as a case with a large Mahalanobis distance. In the chi² probability plot, these would appear as points in the upper right which are substantially above the line for the expected chi² quantiles.

Unfortunately, like all classical (e.g., least squares) techniques, the chi² plot for multivariate normality is not resistant to the effects of outliers. A few discrepant observations not only affect the mean vector, but also inflates the variance covariance matrix. Thus, the effect of the few wild observations is spread through all the d ² values. Moreover, this tends to decrease the range of the d ² values, making it harder to detect extreme ones.

One reasonably general solution is to use a multivariate trimming procedure (Gnanadesikan & Kettenring, 1972; Gnanadesikan, 1977), to calculate squared distances which are not affected by potential outliers.

This is an iterative process where, on each iteration, some proportion of the observations with the largest d ² values are temporarily set aside, and the trimmed mean, x bar sub (-) and trimmed variance covariance matrix, S _(-) are computed from the remaining observations. Then new d sup 2 values are computed from

d _i ² = ( x _i - x bar _(-) )^TS _(-) ^-1 ( x _i - x bar _(-) ) (16)

The effect of trimming is that observations with large distances do not contribute to the calculations for the remaining observations.

Example: AUTO data

The original chi² plot for the AUTO data, which did not use trimming, showed only a slight tendency for points to drift away from the reference line. However, the trimming procedure indicates that six observations are suspiciously deviant from the rest. Note that the DSQ values increase, and probability values decrease, from the first to the second iteration as the six observations are excluded from the calculations of mean and covariance matrix.


         Observations trimmed in calculating Mahalanobis distance

     OBS    MODEL               PASS    CASE      DSQ        PROB

       1    AMC PACER            1        2     22.7827    0.0189645
       2    CAD. SEVILLE         1       14     23.8780    0.0132575
       3    CHEV. CHEVETTE       1       15     23.5344    0.0148453
       4    VW RABBIT DIESEL     1       66     25.3503    0.0080987
       5    VW DASHER            1       68     36.3782    0.0001464
       6    AMC PACER            2        2     34.4366    0.0003068
       7    CAD. SEVILLE         2       14     42.1712    0.0000151
       8    CHEV. CHEVETTE       2       15     36.7623    0.0001263
       9    PLYM. CHAMP          2       52     20.9623    0.0337638
      10    VW RABBIT DIESEL     2       66     44.2961    0.0000064
      11    VW DASHER            2       68     78.5944    0.0000000

OUTLIER macro

This scheme for outlier detection has been implemented in a general SAS macro, OUTLIER. The arguments to the macro are shown below. PVALUE is the probability, such that an observation is trimmed when its D ² has a probability less than PVALUE. The macro produces an output data set (the OUT= parameter) containing the variables DSQ and EXPECTED (the chi² quantile) in addition to the input variables.

%macro OUTLIER(
         data=_LAST_,      /* Data set to analyze            */
         var=_NUMERIC_,    /* input variables                */
         id=,              /* ID variable for observations   */
         out=CHIPLOT,      /* Output dataset for plotting    */
         pvalue=.1,        /* Prob < pvalue -> weight=0      */
         passes=2,         /* Number of passes               */
         print=YES);       /* Print OUT= data set?           */

By default, two passes (PASSES= ) are made through the iterative procedure. A printer plot of DSQ * EXPECTED is produced automatically, and you can use PROC GPLOT afterward with the OUT= data set to give a high-resolution plot with labelled outliers, as shown in the example below.

The outlier procedure is run on the auto data with these statements:

%include OUTLIER;       * get the macro;
%include AUTO;          * get the data;
data AUTO;
   set AUTO;
   repair = mean(of REP77 REP78);
   if repair=. then delete;
%OUTLIER(data=AUTO,
  var=PRICE MPG REPAIR HROOM RSEAT TRUNK WEIGHT LENGTH TURN DISPLA GRATIO,
  pvalue=.05, id=MODEL, out=CHIPLOT, print=NO);

The plot is produced from the OUT= data set as shown below:

data labels;
   set chiplot;
   if prob < .05;            * only label outliers;
   xsys='2'; ysys='2';
   y = dsq;
   function = 'LABEL';
   text = model;
   size = 1.4;

proc gplot data=chiplot ;
   plot dsq      * expected = 1
        expected * expected = 2
        / overlay anno=labels vaxis=axis1 haxis=axis2;
   symbol1 f=special v=K h=1.5 i=none c=black;
   symbol2 v=none i=join c=red;
   label dsq     ='Squared Distance'
         expected='Chi-square quantile';
   axis1 label=(a=90 r=0 h=1.5 f=duplex) value=(h=1.3);
   axis2 order=(0 to 30 by 10) label=(h=1.5 f=duplex) value=(h=1.3);
   title h=1.5 'Outlier plot for Auto data';

Canonical discriminant plots and MANOVA

Canonical discriminant analysis is a dimension-reduction technique for analyzing differences in means of groups on multiple measures in terms of a small number of dimensions. The mathematics of canonical discriminant analysis and a one-way multivariate analysis of variance (MANOVA) are the same. In MANOVA, the emphasis is on testing the hypothesis that the mean vector is the same for all groups, taking the pattern of intercorrelations into account. Canonical discriminant analysis is concerned with finding the smallest number of linear combinations of the variables which maximally discriminate among the group mean vectors, in the sense of giving the largest possible univariate F statistic.

For p variables, x^T= ( x ₁ , ... , x _p ) , sets of weights, c = ( c ₁ , ... , c sub p ) , are found so as to maximize the F statistic for the derived canonical variable, CAN1 = S c _i x sub i . That is, the canonical weights indicate the linear combination of the variables which best discriminates among groups, in the sense of having the greatest univariate F . When there are more than two groups, it is possible to find additional linear combinations, CAN2, CAN3, ... CANs (s is the smaller of p and g - 1 ) which

are all uncorrelated
each have the maximum possible univariate F statistic

Plotting canonical discriminant scores

A canonical discriminant plot shows the scores of all individuals on the first 2 (or 3, plotted pairwise) canonical dimensions. This plot has the following properties:

Groups widely separated on a given dimension are discriminated well by that linear combination of the variables.
The contribution of each variable to each canonical dimension can be shown by plotting the variable weights as vectors in the plot.
A variable aligned with one of the canonical dimensions contributes highly to discrimination among the groups on that canonical variable.
The length of each variable vector is proportional to its contribution to discrimination through the two canonical variables.
Angles between the vectors represent the correlations between the variables, in a way similar to the biplot.
The mean for each group on the canonical variables can also be shown, together with confidence circles.

Example: Diabetes Data

Reaven and Miller (1979) examined the relationship between measures of blood plasma glucose and insulin in 145 non-obese adult patients at the Stanford Clinical Research Center in order to examine ways of classifying people as "normal", "overt diabetic", or "chemical diabetic". Each patient underwent a glucose tolerance test and the following variables were measured: Relative weight, Fasting Plasma Glucose (GLUFAST), Test Plasma Glucose (GLUTEST, a measure of intolerance to insulin), and Steady State Plasma Glucose (SSPG, a measure of insulin resistance).

Canonical discriminant analysis is performed with PROC CANDISC. Since there are 3 groups s = g - 1 = 2 dimensions are sufficient to display all possible discrimination among groups, and the printed output indicates that both dimensions are highly significant.

              Canonical Discriminant Analysis

                       Adjusted       Approx       Squared
        Canonical      Canonical     Standard     Canonical
       Correlation    Correlation     Error      Correlation

  1      0.909389       0.906426     0.014418      0.826988
  2      0.625998       0.618252     0.050677      0.391874

            Test of H0: The canonical correlations in the
              current row and all that follow are zero

      Likelihood
         Ratio      Approx F      Num DF      Den DF    Pr > F

 1    0.10521315     57.4891          10         276    0.0001
 2    0.60812612     22.3928           4         139    0.0001

The canonical discriminant plot below shows that the three groups are ordered on CAN1 from Normal to Overt Diabetic with Chemical Diabetics in the middle. The two glucose measures and the measure SSPG of insulin resistance are most influential in separating the groups along the first canonical dimension. The Chemical Diabetics are apparently higher than the other groups on these three variables. Among the remaining variables, INSTEST is most influential, and contributes mainly to the second canonical dimension. The wide separation of the 99% confidence circles portrays the strength of group differences on the canonical dimensions and the variable vectors help to interpret the meaning of these dimensions.

CANPLOT macro

Canonical discriminant plots are constructed by the CANPLOT macro. The plot for the diabetes data is produced by these statements:

title 'Canonical Discriminant analysis plot: Diabetes data';
%include canplot;         *-- unnecessary in PsychLab;
%include diabetes;        *-- %include data(diabetes) in PsychLab;
%canplot(
     data=diabetes,
     class=group,
     var=relwt glufast glutest instest sspg,
     scale=3.5);

The SCALE parameter sets the scaling of the variable vectors relative to the points in the plot.

Previous section