SAS EDA Tools

The main links to the SAS macro programs on the SCS Web Site are:

A subset of the most useful of these for EDA is available in a zip file, edatools.zip (19K)

Most of the macro programs have external documentation and examples available on the web. A few, marked '[internal documentation only]' have this information in the form of descriptive comments at the beginning.

Univariate

boxplot macro
Produces standard and notched boxplots for a single response variable with one or more grouping variables.
datachk macro
The datachk macro performs basic data screening/checking on numeric variables in a dataset, and is designed to give a compact overview of many variables.
nqplot macro
Produces theoretical normal quantile-quantile (Q-Q) plots for single variable. Options provide a classical (mu, sigma) or robust (median, IQR) comparison line, standard error envelope, and a detrended plot.
splot macro
Draws low-res (printer) schematic plots (boxplots) for one or more variables.
symbox macro
Displays boxplots of a single variable raised to various powers in side-by-side boxplots as an aid to finding a power transformation to symmetry.
symplot macro
Produces a variety of diagnostic plots for assessing symmetry of a data distribution and finding a power transformation to make the data more symmetric.

Bivariate

contour macro
Plots a bivariate scatterplot with a bivariate data ellipse for one or more groups with one or more confidence coefficients.
lowess macro
Performs robust, locally weighted scatterplot smoothing (Cleveland, 1979).
resline macro
Fits a resistant line to X-Y data and determines transformations to make the relation linear.
sunplot macro
Sunflower plot for X-Y data. The sunflower plot displays a bivariate dataset using "sunflower symbols" to show the number of observations in the neighborhood of each XY point.

Multivariate

coplot macro
Constructs a conditioning plot - plots of Y * X | Z, showing how the relationship between X and Y depends on Z.
corrgram macro
Draws a corrgram -- a schematic plot of a correlation matrix. Variables are permuted so that ``similar'' variables are positioned adjacently, and cells of a matrix are shaded or filled to show the correlation value.
cqplot macro
The cqplot macro produces quantile-quantile comparison plots for multivariate normal data (based on squared Mahalanobis distances from the centroid) or for other data which should follow a Chi-square distribution, together with estimated confidence bands.
outlier macro
Detects multivariate outliers. The OUTLIER macro calculates robust Mahalanobis distances by iterative multivariate trimming (Gnanadesikan & Kettenring, 1972; Gnanadesikan, 1977), and produces a chisquare Q-Q plot.
scatmat macro
Draws a scatterplot matrix for all pairs of variables. A classification variable may be used to assign the plotting symbol and/or color of each point.

Missing data

miss macro [internal documentation only]
The MISS macro carries out maximum likelihood estimation of the mean and covariance matrix of the multivariate normal distribution for incomplete data using the EM algorithm. It also carries out data augmentation to produce multiple data sets with randomly imputed values for the missing data.
miss macro [internal documentation only]
The MISSCOMB macro combines information from two or more analyses of multiply imputed data sets to produce a single set of estimates and associated statistics.
missing macro [internal documentation only]
The MISSING macro screens a data set for missing variables (for which a large percentage of the observations are missing), and optionally drops variables meeting some criterion of missingness.
missrc macro
The missrc macro estimates cell probabilities in an n-way table with ignorable missing data (missing completely at random [MCAR], or missing at random [MAR]) on the table variables. The results are equivalent to the use of the EM algorithm.

GLMs

boxcox macro
Finds power transformations of the response variable in a regression model (PROC REG) by the Box-Cox method, with graphic display of the maximum likelihood solution, t-values for model effects, and the influence of observations on choice of power.
boxglm macro
Finds power transformations of the response variable in a general linear model (PROC GLM) by the Box-Cox method.
boxtid macro
Finds power transformations of predictor variables in a general linear model (PROC GLM) by the Box-Tidwell method.
inflogis macro
Produces an influence plot for a logistic regression model. The plot shows a measure of badness of fit for a given case (DIFDEV or DIFCHISQ) vs. the fitted probability (PRED) or leverage (HAT), using an influence measure (C or CBAR) as the size of a bubble.
inflplot macro
Produces an influence plot for a regression model -- a plot of studentized residuals vs. leverage (hat-value), using COOK's D or DFFITS as the size of a bubble symbol.
meanplot macro
The meanplot macro produces 1-way, 2-way, or 3-way plots of means for a factorial design with any number of factor variables.
partial macro
Produces partial regression residual plots. Observations with high leverage and/or large studentized residuals can be individually labeled.
robust macro
Robust fitting for linear models (PROC REG, PROC GLM. PROC LOGISTIC) via iterative re-weighting.