[Previous] [Next] [Up] [Top] Categorical Data Analysis with Graphics
Michael Friendly

Part 3: Plots for two-way frequency tables

Several schemes for representing contingency tables graphically are based on the fact that when the row and column variables are independent, the estimated expected frequencies, e sub ij , are products of the row and column totals (divided by the grand total). Then, each cell can be represented by a rectangle whose area shows the cell frequency, f sub ij , or deviation from independence.

Sieve diagrams

Table 1 shows data on the relation between hair color and eye color among 592 subjects (students in a statistics course) collected by Snee (1974). The Pearson chi² for these data is 138.3 with 9 degrees of freedom, indicating substantial departure from independence. The question is how to understand the nature of the association between hair and eye color.

Table 1: Hair-color eye-color data

                      Hair Color
Eye
Color     BLACK    BROWN      RED    BLOND  | Total
                                            |
Brown        68      119       26        7  |   220
Blue         20       84       17       94  |   215
Hazel        15       54       14       10  |    93
Green         5       29       14       16  |    64
--------------------------------------------+------
Total       108      286       71      127  |   592

For any two-way table, the expected frequencies under independence can be represented by rectangles whose widths are proportional to the total frequency in each column, f sub +j , and whose heights are proportional to the total frequency in each row, f sub i+ ; the area of each rectangle is then proportional to e sub ij . Figure 7 shows the expected frequencies for the hair and eye color data.

Fig
Figure 7: Expected frequencies under independence.

Riedwyl and Schüpbach (1983, 1994) proposed a sieve diagram (later called a parquet diagram ) based on this principle. In this display the area of each rectangle is proportional to expected frequency and observed frequency is shown by the number of squares in each rectangle. Hence, the difference between observed and expected frequency appears as the density of shading, using color to indicate whether the deviation from independence is positive or negative. (In monochrome versions, positive deviations are shown by solid lines, negative by broken lines.) The sieve diagram for hair color and eye color is shown in Figure 8.

Fig
Figure 8: Sieve diagram for hair-eye data.

Figure 9 shows data on vision acuity in a large sample of women (n=7477). The diagonal cells show the obvious: people tend to have the same visual acuity in both eyes, and there is strong lack of indepence. The off diagonal cells show a more subtle pattern which suggests symmetry, and a diagonals model.

Figure 10 shows the frequencies with which draft-age men with birthdays in various months were assigned priority values for induction into the US Army in the 1972 draft lottery. The assignment was supposed to be random, but the figure shows a greater tendency for those born in the latter months of the year to be assigned smaller priority values.

Fig Fig
Figure 9: Vision classification data for 7477 women Figure 10: Data from the US Draft Lottery

Association plot for two-way tables

In the sieve diagram the foreground (rectangles) shows expected frequencies; deviations from independence are shown by color and density of shading. The association plot (Cohen, 1980; Friendly, 1991) puts deviations from independence in the foreground: the area of each box is made proportional to observed - expected frequency.

For a two-way contingency table, the signed contribution to Pearson chi² for cell i, %j is

d sub ij = < f sub ij - e sub ij > over < sqrt < e sub ij > > = roman ' std. residual' fwd 300 chi² = Sigma Sigma sub ij %% ( d sub ij ) sup 2

In the association plot , each cell is shown by a rectangle:

(signed) height ~ d sub ij
width = sqrt < e sub ij> .

      sqrt e sub ij
  +---------------------+
  |                     |
  | area = f_ij - e_ij  |
  |                     | d sub ij |=  {f sub ij  -  e sub ij} over {sqrt e sub ij}   
  |                     |
  +---------------------+

The rectangles for each row in the table are positioned relative to a baseline representing independence ( d sub ij = 0 ) shown by a dotted line. Cells with observed > expected frequency rise above the line (and are colored black); cells that contain less than the expected frequency fall below it (and are shaded red). Fig

Figure 11: Association plot for hair-color, eye-color

Portraying agreement: Observer Agreement chart

Inter-observer agreement is often used as a method of assessing the reliability of a subjective classification or assessment procedure. For example, two (or more) clinical psychologists might classify patients on a scale with categories: normal, mildly impaired, severely impaired.

Measuring agreement

Strength of agreement vs. strength of association: Observers ratings can be strongly associated without strong agreement.
Marginal homogeneity: If observers tend to use the categories with different frequency, this will affect measures of agreement.

Cohen's Kappa

A commonly used measure of agreement, Cohen's kappa ( kappa ) compares the observed agreement, P sub o = Sigma p sub ii , to agreement expected by chance if the two observer's ratings were independent, P sub c = Sigma p sub i+ % p sub +i .

(4)

For perfect agreement, kappa = 1 .
Minimum kappa < 0 , and lower bound depends on marginal totals.
Unweighted kappa only counts strict agreement (same category assigned by both observers). A weighted version of kappa is used when one wishes to allow for partial agreement. For example, exact agreements might be given full weight, one-category difference given weight 1/2. (This makes sense only when the categories are ordered, as in severity of diagnosis.)

Example

The table below (from Agresti, 1990) summarizes responses of 91 married couples to a questionnaire item,

Sex is fun for me and my partner (a) Never or occasionally, (b) fairly often, (c) very often, (d) almost always.

              |-------- Wife's Rating -------|
Husband's     Never   Fairly     Very   Almost
Rating          fun    often    Often   always       SUM

Never fun         7        7        2        3        19
Fairly often      2        8        3        7        20
Very often        1        5        4        9        19
Almost always     2        8        9       14        33

SUM              12       28       18       33        91

Unweighted kappa gives the following results

 Observed and Expected Agreement (under independence)

Observed agreement           0.3626
Expected agreement           0.2680

Cohen's Kappa (Std. Error)   0.1293  (0.1343)

Two commonly-used pattern of weights are those based on integer spacing of the category scale and Fleiss-Cohen weights .

       Integer Weights                 Fleiss-Cohen Weights
   1     2/3     1/3       0          1     8/9     5/9      0
 2/3       1     2/3     1/3        8/9       1     8/9    5/9
 1/3     2/3       1     2/3        5/9     8/9       1    8/9
   0     1/3     2/3       1          0     5/9     8/9      1

These weights give a somewhat higher assessment of agreement (perhaps too high).

                    Obs    Exp              Std     Lower     Upper
                    Agree  Agree   Kappa    Error    95%       95%

 Unweighted         0.363  0.268   0.1293   0.134   -0.1339   0.3926
 Integer Weights    0.635  0.560   0.1701   0.065    0.0423   0.2978
 Fleiss-Cohen Wts   0.814  0.722   0.3320   0.125    0.0861   0.5780

Computing Kappa with SAS

In Version 6.10, PROC FREQ provides the kappa statistic with the AGREE option, as shown in the following example.

title 'Kappa for Agreement';
data fun;
   label husband = 'Husband rating'
         wife    = 'Wife Rating';
   do husband = 1 to 4;
   do wife    = 1 to 4;
      input count @@;
      output;
      end; end;
cards;
 7     7     2      3
 2     8     3      7
 1     5     4      9
 2     8     9     14
;
proc freq;
   weight count;
   tables husband * wife / noprint agree;
run;

This produces the following output:

+-------------------------------------------------------------------+
|                                                                   |
|                       Kappa for Agreement                         |
|             STATISTICS FOR TABLE OF HUSBAND BY WIFE               |
|                                                                   |
|                         Test of Symmetry                          |
|                         ----------------                          |
|      Statistic = 3.878        DF = 6        Prob = 0.693          |
|                                                                   |
|                        Kappa Coefficients                         |
|      Statistic        Value     ASE   95% Confidence Bounds       |
|      ------------------------------------------------------       |
|      Simple Kappa     0.129   0.069      -0.005    0.264          |
|      Weighted Kappa   0.237   0.078       0.084    0.391          |
|                                                                   |
|      Sample Size = 91                                             |
|                                                                   |
+-------------------------------------------------------------------+

Bangdiwala's Observer Agreement Chart

The observer agreement chart (Bangdiwala, 1987) provides a simple graphic representation of the strength of agreement in a contingency table, and a measure of strength of agreement with an intuitive interpretation.

The agreement chart is constructed as an n x n square, where n is the total sample size. Black squares, each of size n sub ii x n sub ii , show observed agreement. These are positioned within larger rectangles, each of size n sub i+ x n sub +i . The large rectangle shows the maximum possible agreement, given the marginal totals. Thus, a visual impression of the strength of agreement is

(5)

Fig
Figure 12: Agreement chart for husbands and wives sexual fun. The B sub N measure is the ratio of the areas of the dark squares to their enclosing rectangles, counting only exact agreement. B sub N = 0.146 for these data.

Partial agreement

Partial agreement is allowed by including a weighted contribution from off-diagonal cells, b steps from the main diagonal.

left "" matrix < ccol < ' ' above ' ' above n sub < i,i-b > above ' ' above ' ' > ccol < ' ' above ' ' above ... above ' ' above ' ' > ccol < n sub < i-b,i > above : above n sub ii above : above n sub < i+b,i > > ccol < ' ' above ' ' above ... above ' ' above ' ' > ccol < ' ' above ' ' above n sub < i,i+b > above ' ' above ' ' > > right "" fwd 350 left "" matrix < ccol < ' ' above ' ' above w sub 2 above ' ' above ' ' > ccol < ' ' above ' ' above w sub 1 above ' ' above ' ' > ccol < w sub 2 above w sub 1 above 1 above w sub 1 above w sub 2 > ccol < ' ' above ' ' above w sub 1 above ' ' above ' ' > ccol < ' ' above ' ' above w sub 2 above ' ' above ' ' > > right ""

This is incorporated in the agreement chart by successively lighter shaded rectangles whose size is proportional to the sum of the cell frequencies, denoted A sub bi , shown schematically above. A sub 1i allows 1-step disagreements, A sub 2i includes 2-step disagreements, etc. From this, one can define a weighted measure of agreement, analogous to weighted kappa .

B sub N sup w = < roman 'weighted sum of areas of agreement' > over < roman 'area of rectangles' > = size +3 1 - < Sigma from i to k % [ n sub i+ n sub +i - n sub ii sup 2 - Sigma from b=1 to q % w sub b A sub bi ] > over < Sigma from i to k % n sub i+ % n sub +i >

where w sub b is the weight for A sub bi , the shaded area b steps away from the main diagonal, and q is the furthest level of partial disagreement to be considered.

Fig
Figure 13: Weighted agreement chart.. The B sub N sup w measure is the ratio of the areas of the dark squares to their enclosing rectangles, weighting cells one step removed from exact agreement with w sub 1 = 8 div 9 = .889 . B sub N sup w = 0.628 for these data.

Observer bias

With an ordered scale, it may happen that one observer consistently tends to classify the objects into higher or lower categories than the other. This produces differences in the marginal totals, n sub i+ , and n sub +i . While special tests exist for marginal homogeneity , the observer agreement chart shows this directly by the relation of the dark squares to the diagonal line: When the marginal totals are the same, the squares fall along the diagonal.

Example

The table below shows the classification of 69 New Orleans patients regarding multiple sclerosis diagnosis by neurologists in New Orleans and Winnipeg. The agreement chart shows the two intermediate categories lie largely above the line, indicating that the Winnipeg neurologist tends to classify patients into more severe diagnostic categories.

New Orleans    |------- Winnipeg Neurologist ------|
Neurologist    Certain  Probable  Possible  Doubtful       SUM

 Certain  MS         5         3         0         0         8
 Probable MS         3        11         4         0        18
 Possible MS         2        13         3         4        22
 Doubtful MS         1         2         4        14        21

 SUM                11        29        11        18        69

Figure 14: Weighted agreement chart.

Testing marginal homogeneity

We can test the hypothesis that the marginal totals in the four diagnostic categories are equal for both neurologists using the CATMOD procedure. The following statements read the frequencies, creating a data set ms with variables win_diag and no_diag for the diagnostic categories assigned by the Winnepeg and New Orleans neurologists, respectively. Note that zero frequencies are changed to a small number so that CATMOD will not treat them as structural zeros and eliminate these cells from the table.

title "Classification of Multiple Sclerosis: Marginal Homogeneity";
proc format;
   value diagnos 1='Certain ' 2='Probable'  3='Possible'  4='Doubtful';

data ms;
 format win_diag no_diag diagnos.;
   do win_diag = 1 to 4;
   do no_diag  = 1 to 4;
      input count @@;
      if count=0 then count=1e-10;
      output;
      end; end;
cards;
   5     3     0      0
   3    11     4      0
   2    13     3      4
   1     2     4     14
;

In this analysis the diagnostic categories for the two neurologists are repeated measures, since each patient is rated twice. To test whether the marginal frequencies of ratings is the same we specify response marginals (The oneway option displays the marginal frequencies, not shown here.)

title "Classification of Multiple Sclerosis: Marginal Homogeneity";
proc catmod data=ms;
   weight count;
   response marginals;
   model win_diag * no_diag = _response_ / oneway;
   repeated neuro 2 / _response_= neuro;

The test of marginal homogeneity is the test of NEURO in this model:

+-------------------------------------------------------------------+
|                                                                   |
|                   ANALYSIS-OF-VARIANCE TABLE                      |
|                                                                   |
|       Source                   DF   Chi-Square      Prob          |
|       --------------------------------------------------          |
|       INTERCEPT                 3       222.62    0.0000          |
|       NEURO                     3        10.54    0.0145          |
|                                                                   |
|       RESIDUAL                  0          .       .              |
|                                                                   |
+-------------------------------------------------------------------+

Because the diagnostic categories are ordered, we can actually obtain a more powerful test by assigning scores to the diagnostic category and testing if the mean scores are the same for both neurologists. To do this, we specify response means.

title2 'Testing means';
proc catmod data=ms order=data;
   weight count;
   response means;
   model win_diag * no_diag = _response_;
   repeated neuro 2 / _response_= neuro;

+-------------------------------------------------------------------+
|                                                                   |
|                   ANALYSIS-OF-VARIANCE TABLE                      |
|                                                                   |
|       Source                   DF   Chi-Square      Prob          |
|       --------------------------------------------------          |
|       INTERCEPT                 1       570.61    0.0000          |
|       NEURO                     1         7.97    0.0048          |
|                                                                   |
|       RESIDUAL                  0          .       .              |
|                                                                   |
+-------------------------------------------------------------------+

Four-fold display for 2 x 2 tables

For a 2 x 2 table, the departure from independence can be measured by the sample odds ratio, theta = (f sub 11 / f sub 12 ) / (f sub 21 / f sub 22 ) . The four-fold display shows the frequencies in a 2 x 2 table in a way that depicts the odds ratio. In this display the frequency in each cell is shown by a quarter circle, whose radius is proportional to sqrt f sub ij , so again area is proportional to count. An association between the variables (odds ratio != 1 ) is shown by the tendency of diagonally opposite cells in one direction to differ in size from those in the opposite direction, and we use color and shading to show this direction. If the marginal proportions in the table differ markedly, the table may first be standardized (using iterative proportional fitting) to a table with equal margins but the same odds ratio.

Figure 15 shows aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and gender. At issue is whether the data show evidence of sex bias in admission practices (Bickel et al., 1975). The figure shows the cell frequencies numerically, but margins for both sex and admission are equated in the display. For these data the sample odds ratio, Odds (Admit|Male) / (Admit|Female) is 1.84 indicating that males are almost twice as likely in this sample to be admitted. The four-fold display shows this imbalance clearly.

Figure 15: Four-fold display for Berkeley admissions. The area of each shaded quadrant shows the frequency, standardized to equate the margins for sex and admission. Circular arcs show the limits of a 99% confidence interval for the odds ratio.

Confidence rings for the odds ratio

The fourfold display is constructed so that the four quadrants will align vertically and horizontally when the odds ratio is 1. Confidence rings for the observed odds ratio provide a visual test of the hypothesis of no association ( H sub 0 : theta = 1 ). They have the property that rings for adjacent quadrants overlap iff the observed counts are consistent with this null hypothesis.

The 99% confidence intervals in Figure 15 do not overlap, indicating a significant association between sex and admission. The width of the confidence rings give a visual indication of the precision of the data.

2 x 2 x k tables

In a 2 x 2 x k table, the last dimension often corresponds to "strata" or populations, and it is typically of interest to see if the association between the first two variables is homogeneous across strata. For such tables, simply make one fourfold panel for each strata. The standardization of marginal frequencies is designed allow easy visual comparison of the pattern of association across two or more populations.

The admissions data shown in Figure 15 were obtained from six departments, so to determine the source of the apparent sex bias in favor of males, we make a new plot, Figure 16, stratified by department.

Surprisingly, Figure 16 shows that, for five of the six departments, the odds of admission is approximately the same for both men and women applicants. Department A appears to differs from the others, with women approximately 2.86 ( = ' ' ( 313/19 ) / (512/89) ) times as likely to gain admission. This appearance is confirmed by the confidence rings, which in Figure 16 are joint 99% intervals for theta sub c , ' ' % c = 1, ... , k .

Figure 16: Fourfold display of Berkeley admissions, by department. In each panel the confidence rings for adjacent quadrants overlap if the odds ratio for admission and sex does not differ significantly from 1. The data in each panel have been standardized as in Figure 15.

(This result, which contradicts the display for the aggregate data in Figure 15, is a classic example of Simpson's paradox. The resolution of this contradiction can be found in the large differences in admission rates among departments as we shall see later.)

[Previous] [Next] [Up] [Top]

Contents

Cohen's Kappa

Example

Computing Kappa with SAS

Partial agreement

Example

Testing marginal homogeneity

Confidence rings for the odds ratio

2 x 2 x k tables