data trunc stem | leaf 400.0 --> 400 40 | 0 86.3 --> 86 8 | 6 59.6 --> 59 5 | 9Using the tens' digit for the leaf:
data trunc stem | leaf 400.0 --> 40 4 | 0 86.3 --> 8 0 | 8 59.6 --> 5 0 | 5
    0 |
    1 |
    2 |
    3 |
    4 |
    5 |
    6 |
0-0111111111111111112222222222222333344444455555555666666677777888 1-000001222222333445555666778889 2-0015 3-0 4-0 5- 6-5
UNIT: 10 LPS: 1 64) 0-0111111111111111112222222222222333344444455555555666666677777888 37 1-000001222222333445555666778889 7 2-0015 3 3-0 2 4-0 1 5- 1 6-5
UNIT: 10 LPS: 5 18 0-011111111111111111 35 0-22222222222223333 49 0-44444455555555 12) 0-666666677777 40 0-888 37 1-000001 31 1-222222333 22 1-445555 16 1-66677 11 1-8889 7 2-001 4 2- 4 2-5 HIGH: 300 400 650The display above uses five lines for each stem (LPS=5), so each line corresponds to a class interval of 20 in a histogram. Note that leaves of 0 & 1 go on one line, 2 & 3 on the next, etc.
Two lines per stem is another possibility:
UNIT: 10 LPS: 2 41 0-01111111111111111122222222222223333444444 23) 0-55555555666666677777888 37 1-00000122222233344 20 1-5555666778889 7 2-001 4 2-5 3 3-0 2 3- 2 4-0 1 4- 1 5- 1 5- 1 6- 1 6-5
PROC UNIVARIATE in SAS constructs stem and leaf displays for each variable when you use the PLOT option. If the infant mortality rates are the variable IMR in the data set NATION, these lines give (among other things) a stem and leaf display:
proc univariate data=nations plot; var imr;Figure 1 shows the stem and leaf portion of the PROC UNIVARIATE output for the infant mortality data.
 VARIABLE=IMR            Infant Mortality Rate
 STEM LEAF                                       #                       BOXPLOT
    6 5                                          1                          *
    6
    5
    5
    4
    4 0                                          1                          0
    3
    3 0                                          1                          0
    2 6                                          1                          |
    2 002                                        3                          |
    1 555566666778899                           15                          |
    1 000011222233344                           15                       +-----+
    0 55555556666666667778888889                26                       *--+--*
    0 11111111112222222222222233333333344444    38                       +-----+
      ----+----+----+----+----+----+----+---
   MULTIPLY STEM.LEAF BY 10**+02
 Figure 1:  PROC UNIVARIATE Stem and leaf
display for infant mortality data
depth of median = d ( M ) = n + 1 / 2where n is the number of observations. If n is even, so d ( M ) is a half integer (e.g., 61.5) we average the two adjacent values (at depth 61 and 62) to get the median, M .
The median is a letter value , marked "M", defined as the middle observation in a distribution. Other letter values are the hinges , marked "H", which are located at the middle of each of the upper and lower halves of the ordered values. (They are essentially the same as the quartiles , but the definitions differ slightly.)
The hinges are also defined in terms of their depth:
depth of hinges = d ( H ) = [ d ( M ) ] + 1 / 2where [ ] means to drop the fractional part, if any.
Other letter values, called eighths (E), sixteenths (D), etc. are defined similarly as the observation half way, in number of observations, between the previous letter value and the nearer extreme.
d ( M ) = n + 1 / 2
d ( H ) = [ d ( M ) ] + 1 / 2
d ( E ) = [ d ( H ) ] + 1 / 2
d ( D ) = [ d ( E ) ] + 1 / 2This process can be continued until we obtain the extremes at depth 1. The calculations can be done directly from the stem and leaf display, but you should take the original data values (before truncation) as the letter values.
UNIT: 10 LPS: 5 18 0|011111111111111111 E= 16.9 35 0|22222222222223333 H=20 -> 25.7 49 0|44444455555555 12) 0|666666677777 D(M)= 101+1/2 = 51 M=60 -> 60.6 40 0|888 37 1|000001 31 1|222222333 D(H)= [51]+1/2= 26 H=120->129.4 22 1|445555 16 1|66677 D(E)= [26]+1/2= 13.5 E=(162.5+170)/2 11 1|8889 = 166.25 7 2|001 D(D)= [13.5]+1/2=7 D=200 4 2| 4 2|5 HIGH: 300 400 650
The batch can be summarized by a letter-value display :
       LETRVALS IMR
  DEPTH   LOWER   UPPER    SPREAD   MIDPT
 M 51  -   60.6    60.6 -            60.6
 H 26  -   26.2   129.4 -   103.2    77.8
 E 13H -   16.9   166.3 -   149.4    91.6
 D  7  -   12.8   200   -   187.2   106.4
 1  1  -    9.6   650   -   640.4   329.8
For each pair of letter values, the average of the two, called a midsummary , contributes information about the central tendency of the distribution. The difference, called a spread , indicates the variability of the data.
Note that:
   letter    normal spread
   ------    -------------
     H          1.349
     E          2.301
     D          3.068
     C          3.726
H L - 1.5 x H-spr
H U + 1.5 x H-spr
For the infant mortality data, the calculation of fences is (from the letter value display)
inner fences:  25.7 - 1.5 * 103.7 = -129.85
              129.4 + 1.5 * 103.7 =  284.95  (outside: 300, 400)
outer fences:  25.7 -  3  * 103.7 = -285.4
              129.4 +  3  * 103.7 =  440.5   (far out: 650-Saudi Arabia)
adjacent:      259
Here is the boxplot, showing two outside, and one far-out values:
      BOXPLOT IMR
    9.60              223.07              436.53              650.00
     +-------------------+-------------------+-------------------+
       +--------+
     <-|  |     |----------->  o          o                     X
       +--------+
                             LIBYA      AFGHAN.              S. ARABIA
Note that the skewness of the data is apparent in (a) the location
of the median relative to the hinges, (b) the relative lengths of
the whisker lines, and (c) the outside values on the high end.
The PLOT option of PROC UNIVARIATE also gives a small boxplot. When you have a grouping variable, you can produce full-page, side-by-side boxplots for each group on the printer with PROC UNIVARIATE.
The countries in the NATIONS data set are classified by REGION. Adding the statement BY REGION to the previous example gives side-by-side boxplots. (The data must first be sorted by REGION.)
proc univariate data=nations plot; var imr; by region;
                             Univariate Procedure
                               Schematic Plots
 Variable=IMR           Infant Mortality Rate
             |
         700 +
             |
             |                                                *
             |
         600 +
             |
             |
             |
         500 +
             |
             |
             |
         400 +                                                0
             |
             |
             |
         300 +                        0
             |
             |                        |
             |                        |
         200 +                        |                       |
             |            *        +-----+                    |
             |                     *--+--*                    |
             |                     |     |                 +-----+
         100 +                     +-----+                 |  +  |
             |         +-----+        |                    |     |
             |         *--+--*        |           0        *-----*
             |            |                    *--+--*     +-----+
           0 +                                 +-----+        |
              ------------+-----------+-----------+-----------+-----------
      REGION          Americas      Africa      Europe    Asia/Oce
 Figure 2:  Boxplots for IMR by region.
The side-by-side display allows visually comparing the groups in
central value, spread, shape, and unusual values.
Figure 3 shows boxplots of Infant mortality rate by region for the NATIONS data. This plot is drawn by the statements below. The option I=BOXT draws horizontal lines at the tips of the whisker lines.
title h=1.5 'Boxplot with I=BOX';
symbol i=boxt mode=include             /* V6.06 */
       v=+ color=black h=1.2;
proc gplot data=nations ;
   plot imr * region
        / frame vaxis=axis1 haxis=axis2
          vminor=1 hminor=0;
   axis1 label=(h=1.5 a=90 r=0) value=(h=1.3);
   axis2 label=(h=1.5)          value=(h=1.3) offset=(8 pct);
 Figure 3:  Simple boxplot with I=BOX
option
The notches to give approximate 95% comparison intervals are calculated as:
Median ± 1.58 ( H-spr / sqrt n )
                               /* Description of Parameters:         */
%macro BOXPLOT(                /* -------------------------          */
       data=_LAST_,            /* Input dataset                      */
       class=,                 /* Grouping variable(s)               */
       var=,                   /* Ordinate variable                  */
       id=,                    /* Observation ID variable            */
       width=.5,               /* Box width as proportion of maximum */
       notch=0,                /* =0|1, 1=draw notched boxes         */
       connect=0,              /* =0 or line style to connect medians*/
       f=0.5,                  /* Notch depth, fraction of halfwidth */
       fn=1,                   /* Box width proportional to &FN      */
       varfmt=,                /* Format for ordinate variable       */
       classfmt=,              /* Format for class variable(s)       */
       varlab=,                /* Label for ordinate variable        */
       classlab=,              /* Label for class variable(s)        */
       yorder=,                /* Tick marks, range for ordinate     */
       anno=,                  /* Addition to ANNOTATE set           */
       out=boxstat,            /* Output data set: quartiles, etc.   */
       name=BOXPLOT            /* Name for graphic catalog entry     */
       );
The figure below shows notched boxplots for the IMR data produced
using the BOXPLOT macro.  The width of each box is made
proportional to  sqrt n .
%include boxplot;
title1 h=1.5 'Infant Mortality Rate by Region';
%boxplot(data=nations, class=region, var=imr,id=nation,
         notch=1, fn=sqrt(n), classfmt=region.,
         varlab=Infant Mortality Rate, classlab=Region);
A particularly useful family of transformations is what Tukey calls the ladder of powers , where a variable x is transformed to t p ( x ) by a power p according to
t p ( x ) = left { lpile x p -1 over p , above log 10 x , lpile p != 0 above p = 0 right "" . (1)Although the family of transformations given by (1) is defined for all powers p , it is common to use only the simple integer and half-integer powers listed in Table 1. (In some situations, we may add the cube root, p = 1 / 3 to this list.) Table 1: Simple transformations in the ladder of powers
       +---------------------------------------------+
       | Power     Transformation      Re-expression |
       |---------------------------------------------|
       |   3       Cube                 x 3      |
       |                                             |
       |   2       Square               x 2      |
       |                                             |
       |   1       NONE (Raw)           x            |
       |                                             |
       |  1/2      Square root          sqrt x       |
       |                                             |
       |   0       Log                  log10 x      |
       |                                             |
       | -1/2      Reciprocal root     -1/sqrt x     |
       |                                             |
       |  -1       Reciprocal          -1 / x        |
       +---------------------------------------------+
The important properties of this family of transformations are:
     log X        X                X**2
       0           1                1
 1  {                }  9             }     99
       1          10              100
 1  {                } 90             }   9900
       2         100            10000
 1  {                }900             } 990000
       3        1000          1000000
Since the power transformations preserve the order of the data values, we can find a suitable transformation by looking at the letter values and their midsummaries transformed to various powers. In a symmetric distribution, the midsummaries do not systematically change as we go out toward the extremes.
       LETRVALS IMR * .5
  DEPTH   LOWER   UPPER    SPREAD   MIDPT
 M 51  -    7.8     7.8 -             7.8
 H 26  -    5.1    11.4 -     6.3     8.2
 E 13H -    4.1    12.9 -     8.8     8.5
 D  7  -    3.6    14.1 -    10.6     8.9
 1  1  -    3.1    25.5 -    22.4    14.3
       LETRVALS 10zIMR
  DEPTH   LOWER   UPPER    SPREAD   MIDPT
 M 51  -    1.8     1.8 -             1.8
 H 26  -    1.4     2.1 -      .7     1.8
 E 13H -    1.2     2.2 -     1       1.7
 D  7  -    1.1     2.3 -     1.2     1.7
 1  1  -    1       2.8 -     1.8     1.9
The boxplots for  sqrt IMR  and  log  IMR  show the
effect of these transformations on the shape of the data. The
distribution becomes more symmetric as the power decreases.
      BOXPLOT IMR * .5
    3.10               10.56               18.03               25.50
     +-------------------+-------------------+-------------------+
          +----------------+
     ----|       |        |----------------------              
          +----------------+
      BOXPLOT 10zIMR
     .98                1.59                2.20                2.81
     +-------------------+-------------------+-------------------+
                   +----------------------+
     -------------|           |          |----------------------
                   +----------------------+
Graphical methods can also be used to find a power transformation for symmetry. The simplest, useful plots are based on the idea that in a symmetric distribution, the distances of points at the lower end to the median should match the distances of corresponding points in the upper end to the median. In general, if the distribution were perfectly symmetric the observations at depth i should satisfy,
Lower distance to median = Upper distance to median
fwd 1.25i M - x (i) = x (n+1-i) - M
mid identical ( x (n+1-i) + x (i) ) / 2 vs. x (n+1-i) - x (i) identical spread .This plot is based on the idea that in a symmetric distribution, each mid value should equal the median,
( x (n+1-i) + x (i) ) / 2 = M .
Since the spreads increase as we go from the center out to the tails, this plot depicts how the pattern of mid values changes from the middle of the distribution to the extremes. Because the plot is untilted (slope = 0) when the distribution is symmetric, expansion of the vertical scale allows us to see systematic departures from flatness far more clearly.
x (i) + x (n+1-i) over 2 - Mas the vertical coordinate, against a squared measure of spread,
Lower 2 + Upper 2 over 4 M = ( M - x (i) ) 2 + ( x (n+1-i) - M ) 2 over 4 Mas the vertical coordinate. If this graph is approximately linear, with a slope, b , then p = 1 - b is the indicated power for a transformation to approximate symmetry.
proc sort data=analyze out=sortup;
   by X;
proc sort data=analyze out=sortdn;
   by descending X;
data symplot;
   merge sortup(rename=(X=frombot))      /* frombot = x(i)     */
         sortdn(rename=(X=fromtop));     /* fromtop = x(n+1-i) */
   mid = (fromtop + frombot) / 2;
   spread = fromtop - frombot;
These plots are all implemented in the SYMPLOT macro, which takes
the following parameters:
 %macro SYMPLOT(
       data=_LAST_,     /* data to be analyzed                   */
       var=,            /* variable to be plotted                */
       plot=MIDSPR,     /* Type of plot(s): NONE, or any of      */
                        /* UPLO, MIDSPR, MIDZSQ, or POWER        */
       trim=0,          /* # or % of extreme obs. to be trimmed  */
       symbol=dot,      /* plotting symbol                       */
       out=symplot,     /* output data set                       */
       name=SYMPLOT);   /* name for graphic catalog entry        */
For example, the symmetry power plot of the IMR data is produced by
these statements:
%include symplot; title h=1.6 'IMR data: Symmetry transformation plot'; %symplot(data=nations,var=IMR,plot=POWER,trim=5 pct);Trimming 5% of the observations in each tail ensures that extreme observations do not influence the choice of the power transformation.
The dot chart , developed by Cleveland (1984) is quite effective for this purpose, as long as the number of observations is not too large. If the observations are sorted by the response variable, the dot chart portrays the empirical cumulative distribution of the sample.
Simple dotplots can be produced on a printer with the TIMEPLOT procedure. The observations are arranged on the vertical axis in their order in the data set, so the data set should be sorted in descending order of the response variable. For example, the statements below produce the printer plot shown in Figure 4. The options REF=0 and JOINREF produce the dashed line connecting the plotting symbol to the vertical axis.
%include data(duncan); proc sort data=duncan; by descending prestige; proc timeplot data=duncan; plot prestige = '*' / ref=0 joinref; id job;
                       Duncan Occupational Prestige Data
JOB                 Prestige      min                                        max
                                  0                                           97
                                 *---------------------------------------------*
Physician                 97     ||-------------------------------------------*|
Professor                 93     ||------------------------------------------* |
Banker                    92     ||------------------------------------------* |
Architect                 90     ||-----------------------------------------*  |
Chemist                   90     ||-----------------------------------------*  |
Dentist                   90     ||-----------------------------------------*  |
Lawyer                    89     ||----------------------------------------*   |
Civil Eng.                88     ||----------------------------------------*   |
Minister                  87     ||---------------------------------------*    |
Pilot                     83     ||--------------------------------------*     |
Accountant                82     ||-------------------------------------*      |
Factory Owner             81     ||-------------------------------------*      |
Author                    76     ||----------------------------------*         |
Contractor                76     ||----------------------------------*         |
PS Teacher                73     ||---------------------------------*          |
RR Engineer               67     ||------------------------------*             |
Welfare Wrkr.             59     ||--------------------------*                 |
Undertaker                57     ||-------------------------*                  |
Machinist                 57     ||-------------------------*                  |
Electrician               53     ||------------------------*                   |
Reporter                  52     ||-----------------------*                    |
Store Manager             45     ||--------------------*                       |
Insur. Agent              41     ||------------------*                         |
Policeman                 41     ||------------------*                         |
Bookkeeper                39     ||-----------------*                          |
RR Conductor              38     ||-----------------*                          |
Mail carrier              34     ||---------------*                            |
Carpenter                 33     ||--------------*                             |
Plumber                   29     ||------------*                               |
Auto repair               26     ||-----------*                                |
Machine opr.              24     ||----------*                                 |
Barber                    20     ||--------*                                   |
Motorman                  19     ||--------*                                   |
Store clerk               16     ||------*                                     |
Cook                      16     ||------*                                     |
Coal miner                15     ||------*                                     |
Truck driver              13     ||-----*                                      |
Watchman                  11     ||----*                                       |
Gas stn attn              10     ||----*                                       |
Taxi driver               10     ||----*                                       |
Waiter                    10     ||----*                                       |
Janitor                    8     ||---*                                        |
Bartender                  7     ||--*                                         |
Soda clerk                 6     ||--*                                         |
Shoe-shiner                3     ||*                                           |
                                 *---------------------------------------------*
 Figure 4:  Dot chart of Occupational
Prestige from PROC TIMEPLOT
%macro dotplot(
       data=_LAST_,          /* input data set                         */
       xvar=,                /* horizontal variable (response)         */
       xorder=,              /* plotting range of response             */
       xref=,                /* reference lines for response variable  */
       yvar=,                /* vertical variable (observation label)  */
       ysortby=&xvar,        /* how to sort observations               */
       ylabel=,              /* label for y variable                   */
       group=,               /* vertical grouping variable             */
       gpfmt=,               /* format for printing group variable     */
                             /* value (include the . at the end)       */
       connect=DOT,          /* draw lines to ZERO, DOT, AXIS, or NONE */
       dline=2,              /* style of horizontal lines              */
       dcolor=BLACK,         /* color of horizontal lines              */
       errbar=,              /* variable giving length of error bar    */
                             /* for each observation                   */
       name=DOTPLOT);        /* Name for graphic catalog entry         */
The high-resolution dot chart of occupational prestige is produced
by these statements:
%include DUNCAN ;   * get the data;
%include DOTPLOT;   * get the macro;
%dotplot( data=duncan,
          yvar=job, xvar=prestige,
          connect=DOT);
Quantile comparison plots or Q - Q plots compare the empirical cumulative distribution of the data
G HAT ( x ) = number ( X <= x ) over n = estimated Prob ( X <= x )with the cumulative probability distribution (CDF) of the reference distribution,
F ( z ) = Prob ( Z <= z )
A general, graphical approach is to plot the ordered data values, x (i) , against the expected values of those observations if they followed the particular reference distribution. When the data follows the reference distribution, the points in such a plot will follow a straight line. Departures from the line shows how the data differ from the assumed distribution.
p i = i - 1/2 over n(Other definitions are used also; this one splits each observation, assigning half of it above, and half below x (i) )
z i = F -1 ( p i )
expected = Median ( x ) + ( H-spr / 1.349 ) z i
One way to do this is to calculate an estimated standard error, s Hat ( z i ) , of the ordinate z i and plot curves showing the interval z i ± 2 s Hat ( z sub i ) to give approximate 95% confidence intervals. Chambers et al. (1983) give formulas for estimating the standard errors for most of the common comparison distributions. For the normal Q-Q plot, the standard error is estimated by
s Hat ( z i ) = s hat over f ( z i ) sqrt p i ( 1 - p i ) over n (2)where f ( z i ) is the normal density function, and s hat estimates the slope of the comparison line, the standard deviation of x . Chambers et al. (1983) stress using a robust estimate of s hat , such as the measure based on the hinge-spread, so that the standard errors are not distorted by outliers or straggling tails.
To produce your own plot with SAS/GRAPH software, you can sort the data and calculate the normal quantiles in a DATA step with the PROBIT function as shown below. This is the most general method, since you can easily adjust the calculation of p i , use a different F -1 function for Q-Q plots based on other distributions, and add calculations for standard errors, as described above.
**- Normal Q-Q plot using PROBIT --;
proc sort data=nations;
   by imr;
proc univariate noprint data=nations;
   var imr;
   output out=stats n=nobs median=median qrange=hspr
                           mean=mean std=std;
data quantile;
   set nations;
   if _n_=1 then set stats;
   if imr ^= . ;                       *-- remove missing data!;
   i + 1;
   p = (i - .5) / nobs;
   z = probit(p);
   normal = mean   + z * std;
   robust = median + z * hspr/1.349;
proc gplot data=quantile;
   plot imr   * z = 1
        normal* z = 2
        robust* z = 3 / overlay frame;
   symbol1 i=none h=1.1 v=- color=black;
   symbol2 i=join l=20 v=none color=blue;
   symbol3 i=join l=1   v=none color=red;
%macro nqplot (
        data=_LAST_,    /* input data set                      */
        var=x,          /* variable to be plotted              */
        out=nqplot,     /* output data set                     */
        mu=MEDIAN,      /* est of mean of normal distribution: */
                        /*  MEAN, MEDIAN or literal value      */
        s=HSPR,     /* est of std deviation of normal:     */
                        /*  STD, HSPR, or literal value        */
        stderr=YES,     /* plot std errors around curves?      */
        detrend=YES,    /* plot detrended version?             */
        lh=1.5,         /* height for axis labels              */
        anno=,          /* name of input annotate data set     */
        name=NQPLOT,    /* name of graphic catalog entries     */
        gout=);         /* name of graphic catalog             */
Only the name of the data set and the variable to be plotted need
be supplied.  The normal Q-Q plot of the IMR data would be produced
with this statement:
%nqplot(data=nations,var=imr);
![[Previous]](../../icons/larrow.gif) Previous section
 Previous section
![[Next]](../../icons/rarrow.gif) Next section.
 Next section.