SAS EDA Tools
The main links to the SAS macro programs on the SCS Web Site are:
A subset of the most useful of these for EDA is available in a zip
file, edatools.zip (19K)
Most of the macro programs have external documentation and examples
available on the
web. A few, marked '[internal documentation only]' have this information in the form of descriptive
comments at the beginning.
Univariate
-
boxplot macro
- Produces standard and notched boxplots
for a single response variable with one or more grouping variables.
-
datachk macro
- The datachk macro performs basic data screening/checking
on numeric variables in a dataset, and is designed to give a compact
overview of many variables.
-
nqplot macro
- Produces theoretical normal quantile-quantile
(Q-Q) plots for single variable.
Options provide a classical (mu, sigma) or
robust (median, IQR) comparison line, standard error envelope,
and a detrended plot.
-
splot macro
- Draws low-res (printer) schematic plots (boxplots) for one or more variables.
-
symbox macro
- Displays boxplots of a single variable raised to various powers in side-by-side boxplots as
an aid to finding a power transformation to symmetry.
-
symplot macro
- Produces a variety of diagnostic
plots for assessing symmetry of a data distribution and
finding a power transformation to make the data more symmetric.
Bivariate
-
contour macro
- Plots a bivariate scatterplot with a bivariate
data ellipse for one or more groups with one or more confidence
coefficients.
-
lowess macro
- Performs robust, locally weighted scatterplot smoothing
(Cleveland, 1979).
-
resline macro
- Fits a resistant line to X-Y data and determines transformations to make the relation linear.
-
sunplot macro
- Sunflower plot for X-Y data.
The sunflower plot displays a bivariate dataset using "sunflower
symbols" to show the number of observations in the neighborhood
of each XY point.
Multivariate
-
coplot macro
- Constructs a conditioning plot - plots of Y * X | Z,
showing how the relationship between X and Y depends on Z.
-
corrgram macro
- Draws a corrgram -- a schematic plot of a correlation matrix.
Variables are permuted so that
``similar'' variables are positioned adjacently, and cells of a
matrix are shaded or filled to show the correlation value.
-
cqplot macro
- The cqplot macro produces quantile-quantile comparison plots for
multivariate normal data (based on squared Mahalanobis distances
from the centroid) or for other data which
should follow a Chi-square distribution, together with
estimated confidence bands.
-
outlier macro
- Detects multivariate outliers.
The OUTLIER macro calculates
robust Mahalanobis distances by iterative multivariate trimming
(Gnanadesikan & Kettenring, 1972; Gnanadesikan, 1977),
and produces a chisquare Q-Q plot.
- scatmat macro
- Draws a scatterplot matrix for all pairs of
variables.
A classification variable may be used to assign the plotting symbol
and/or color of each point.
Missing data
miss macro
[internal documentation only]
The MISS macro carries out maximum likelihood estimation of
the mean
and covariance matrix of the multivariate normal
distribution for incomplete data using the EM algorithm.
It also carries out data augmentation to produce multiple data sets with
randomly imputed values for the missing data.
miss macro
[internal documentation only]
The MISSCOMB macro combines information from two or more analyses of multiply
imputed data sets to produce a single set of estimates and associated
statistics.
missing macro
[internal documentation only]
The MISSING macro screens a data set for missing variables
(for which a large percentage of the observations are missing),
and optionally drops variables meeting some criterion of
missingness.
missrc macro
The missrc macro estimates cell probabilities in an n-way table with
ignorable missing data (missing completely at random [MCAR], or missing at
random [MAR]) on the table variables. The results are equivalent to the
use of the EM algorithm.
GLMs
-
boxcox macro
- Finds power transformations of the response
variable in a regression model (PROC REG) by the Box-Cox method,
with graphic display of the maximum likelihood
solution, t-values for model effects, and the
influence of observations on choice of power.
-
boxglm macro
- Finds power transformations of the response
variable in a general linear model (PROC GLM) by the Box-Cox method.
-
boxtid macro
- Finds power transformations of predictor
variables in a general linear model (PROC GLM) by the Box-Tidwell method.
-
inflogis macro
- Produces an influence plot for a logistic
regression model. The plot shows a measure of
badness of fit for a given case (DIFDEV or DIFCHISQ)
vs. the fitted probability (PRED) or leverage (HAT),
using an influence measure (C or CBAR) as the size of
a bubble.
-
inflplot macro
- Produces an influence plot for a regression model
-- a plot of studentized residuals vs. leverage
(hat-value), using COOK's D or DFFITS as the size of
a bubble symbol.
- meanplot macro
- The meanplot macro produces 1-way, 2-way, or 3-way plots of means for
a factorial design with any number of factor variables.
-
partial macro
- Produces partial regression residual plots.
Observations with high leverage and/or large studentized
residuals can be individually labeled.
-
robust macro
- Robust fitting for linear models (PROC REG, PROC GLM. PROC LOGISTIC)
via iterative re-weighting.