Contents
The same dataset format is used for data entered in a form, for
uploaded files, and for the sample datasets.
The input data for Mosaics is a multiway frequency table, structured as follows:
- If there are K factors (classification variables), with levels
n1,
n2, ...,
nK
,
the input data set must contain
n1 *
n2 * ... *
nK
observations, one for each cell of the table. There can be no missing
cells, but there may be cells with 0 frequency.
- The frequency in each cell is contained in a variable named COUNT.
- The levels (cell values) of the K factor variables may be
character (up to 8 characters)
or numeric. Character values cannot contain embedded spaces.
- Each data line contains K+1 fields, which may be separated by
commas and/or blanks or tabs.
- The names of the K+1 variables (which must include COUNT)
should appear before the data lines on a line starting with
VAR =.
- An optional title can appear before the data lines on a line starting with
TITLE:
- Any blank lines or lines starting with either # or * are ignored.
Here is an example for a two-way (4 x 4) table of frequencies of people classified
by hair color and eye color:
# Data from Snee 1974
TITLE: HairEye Data
VAR= HAIR EYE COUNT
Black Brown 68
Brown Brown 119
Red Brown 26
Blond Brown 7
Black Blue 20
Brown Blue 84
Red Blue 17
Blond Blue 94
Black Hazel 15
Brown Hazel 54
Red Hazel 14
Blond Hazel 10
Black Green 5
Brown Green 29
Red Green 14
Blond Green 16
The order of the columns is irrelevant, but the
order of the rows defines the variable ordering, as explained below.
The factor variables in the data table are considered ordered
by the order of the rows (cells) in the table rather than by the order of
the columns (variables). This order is the order that the variables are entered
into the mosaic display.
You can reorder the variables using the
Variable Order option in the Analysis Options panel.
The factor variables are ordered so that:
- The factor which varies most rapidly is the first variable.
- The factor which varies least rapidly is the last variable.
In the Hair-color, eye-color data, therefore, HAIR is the first factor and
EYE is the second factor. The result would be the same if the first two columns
of the data table were interchanged.
However, sorting the rows of the table so that EYE color varied most rapidly
would make the variables ordered EYE then HAIR.
The variable ordering is closely tied to the sequence of models fit in
the Mosaics. When one variable is a response,
and the other variables are considered
explanatory variables, it usually makes sense
for the response variable to be last in the variable ordering.
- Hair-color, Eye-color and Gender
A three-way (4 x 4 x 2) table. The variables are ordered: HAIR, EYE, SEX.
- Berkeley Admissions Data
A three-way (2 x 2 x 6) table. The variables are ordered: DEPT, GENDER, ADMIT,
so Admission is regarded as the response, with Dept and Gender as
explanatory variables.
- Gender, Occupation and Heart Disease
A three-way (2 x 3 x 2) table. The variables are ordered: Gender, Occup, Heart.
- Divorce Data
Effects of pre- and extramarital sexual c=activity on divorce [Agresti, Table 7.3].
A four-way (2 x 2 x 2 x 2) table. The variables are ordered: Gender, PreSex,
ExtraSex, and Marital status.
- Abortion Opinion Data
A four-way (2 x 2 x 6 x 2) table. The variables are ordered: Race, Sex, AgeGp,
and Opinion.