Mosaics: Input Data

The same dataset format is used for data entered in a form, for uploaded files, and for the sample datasets.

Dataset Format

The input data for Mosaics is a multiway frequency table, structured as follows:

If there are K factors (classification variables), with levels n₁, n₂, ..., n_K , the input data set must contain n₁ * n₂ * ... * n_K observations, one for each cell of the table. There can be no missing cells, but there may be cells with 0 frequency.
The frequency in each cell is contained in a variable named COUNT.
The levels (cell values) of the K factor variables may be character (up to 8 characters) or numeric. Character values cannot contain embedded spaces.
Each data line contains K+1 fields, which may be separated by commas and/or blanks or tabs.
The names of the K+1 variables (which must include COUNT) should appear before the data lines on a line starting with VAR =.
An optional title can appear before the data lines on a line starting with TITLE:
Any blank lines or lines starting with either # or * are ignored.

Example

Here is an example for a two-way (4 x 4) table of frequencies of people classified by hair color and eye color:

# Data from Snee 1974
TITLE: HairEye Data
VAR= HAIR  EYE     COUNT

Black    Brown      68
Brown    Brown     119
Red      Brown      26
Blond    Brown       7
Black    Blue       20
Brown    Blue       84
Red      Blue       17
Blond    Blue       94
Black    Hazel      15
Brown    Hazel      54
Red      Hazel      14
Blond    Hazel      10
Black    Green       5
Brown    Green      29
Red      Green      14
Blond    Green      16

The order of the columns is irrelevant, but the order of the rows defines the variable ordering, as explained below.

The factor variables in the data table are considered ordered by the order of the rows (cells) in the table rather than by the order of the columns (variables). This order is the order that the variables are entered into the mosaic display. You can reorder the variables using the Variable Order option in the Analysis Options panel.

The factor variables are ordered so that:

The factor which varies most rapidly is the first variable.
The factor which varies least rapidly is the last variable.

In the Hair-color, eye-color data, therefore, HAIR is the first factor and EYE is the second factor. The result would be the same if the first two columns of the data table were interchanged. However, sorting the rows of the table so that EYE color varied most rapidly would make the variables ordered EYE then HAIR.

The variable ordering is closely tied to the sequence of models fit in the Mosaics. When one variable is a response, and the other variables are considered explanatory variables, it usually makes sense for the response variable to be last in the variable ordering.

Other Sample data sets

Hair-color, Eye-color and Gender A three-way (4 x 4 x 2) table. The variables are ordered: HAIR, EYE, SEX.
Berkeley Admissions Data A three-way (2 x 2 x 6) table. The variables are ordered: DEPT, GENDER, ADMIT, so Admission is regarded as the response, with Dept and Gender as explanatory variables.
Gender, Occupation and Heart Disease A three-way (2 x 3 x 2) table. The variables are ordered: Gender, Occup, Heart.
Divorce Data Effects of pre- and extramarital sexual c=activity on divorce [Agresti, Table 7.3]. A four-way (2 x 2 x 2 x 2) table. The variables are ordered: Gender, PreSex, ExtraSex, and Marital status.
Abortion Opinion Data A four-way (2 x 2 x 6 x 2) table. The variables are ordered: Race, Sex, AgeGp, and Opinion.

Mosaics: Input Data

Contents

Dataset Format

Example

Variable ordering

Other Sample data sets