This is a Boxplot side by side window. If it is empty, drag one or more quantitative variables into the window. Data desk will make a vertical Boxplot for each variable, all on the same vertical scale.

Boxplots and Dotplots work similarly. Switch between these two views with the Add Boxes/Remove Boxes command in the global HyperView menu. Generally, boxplots give a better overview of the relationship of the variables and Dotplots show more detail and facilitate identifying and working with (e.g. assigning colors or symbols too) individual points. Boxplots show the InterQuartile Range (IQR) of each variable as the size of the box and the median of each variable as a horizontal line across the box. They nominate possible outliers in terms of the IQR and show them as individual points. Because this outlier nomination is local to each variable, it is an effective way to identify points that might be unusual for that variable even if they are not unusual when viewed overall. Boxplots exhibit potential outliers as individual points. You can identify any point with the query (?) tool.

In addition to adding or removing boxes, the global HyperView offers a ZScore Transformation. This is useful when plotting variables that have very different scales. For each variable, Datadesk subtracts its mean and divides by its standard deviation.

Boxplots offer a convenient comparison of the distributions of quantitative variables. Look for trends in their variability (comparing the IQRs) and identify and examine the nominated possible outliers. The Modify > Lines > Show Lines/Hide Lines command alternately draws lines that connect the points for each case or hides them (even though the individual points are not displayed in the boxplots.)

This is a Boxplot y by x window. If it is empty, drag a
quantitative variable onto the y-axis and a categorical
variable that names groups onto the x-axis. Data desk will
make a Boxplot for each group.
### Working with
Boxplots

Boxplots and Dot plots work similarly. Switch
between these two views with the Add Boxes/Remove Boxes
command in the global HyperView menu. Generally, boxplots
give a better overview of the relationship of the groups
and dot plots show more detail and facilitate identifying
and working with (e.g. assigning colors or symbols to)
individual points. Boxplots show the InterQuartile Range
(IQR) of each group as the size of the box and the median
of each group as a horizontal line across the box. They
nominate possible outliers in terms of the IQR and show
them as individual points. Because this outlier nomination
is local to each group, it is an effective way to identify
points that might be unusual for that group even if they
are not unusual when viewed overall.
### Boxplot HyperViews

In
addition to adding or removing boxes, the global HyperView
offers to Drop scales. This generates hotresult variables
with the max and min for each group. These can be useful in
other calculations. The HyperView attached to the name of
the categorical (x-axis) variable offers bar charts and pie
charts. The HyperView attached to the quantitative (y-axis)
variable offers a histogram and normal probability plot.
You can identify any outlying point with the query (?) tool
### Learning from
Boxplots

Boxplots offer a convenient comparison of the
distribution of a quantitative variable across categories
of a categorical variable. Look for clusters and outliers.
The vertical style of Boxplots overprints multiple points
if they appear at the same place. The ? tool will indicate
the number of overprints and will identify them
successively on multiple clicks. Consider making a
histogram of the y-axis variable (from its HyperView) and
then selecting all the points in one of the categories with
the rectangle selector, lassor selector or knife tool to
see the sub-histogram for that group against the overall
distribution.

This is a Cluster Analysis window. If it is empty, just
drag quantitative variables into it singly or in groups. A
cluster tree graph will appear. Calc > Calculation
Options > Cluster Analysis Options lets you choose the
clustering method.
### Working with Cluster
Analysis

Click on any tree node to select all cases
"below" that node. Cases are selected in all graphics and
editing windows. Plot colors and symbols operate on the
points at the bottom (left) of the cluster tree, so it is
easy to select a cluster and assign a color or symbol to
it.
### Cluster Analysis
HyperViews

The Save Distances command records the
distances of the successive nodes in the cluster tree.
### Learning from Cluster
Analysis

Clustering is an exploratory method best used
along with other methods. Select clusters and examine them
in other analyses. Consider making an indicator variable
for the selected cases with the Modify > Selection >
Record Hot Set command.

This is a Contingency Table window. If there is no content,
drag variables into the window, dropping them in the title
area to specify which will define the rows of the table and
which will define the columns. You can drag other variables
in and drop them in these locations to replace these at any
time. Contingency tables treat the variables as categorical
even if they contain numerals. The table will have a row
for each individual category of the row variable, a column
for each category of the column variable, and counts of the
number of cases falling into the combination of categories
in the body of the table. HyperView commands include
opening Contingency Table Options (also available from the
Calc menu under Calculation Options.) These include
specifying whether marginal values should be displayed and
whether the displayed values should include counts,
proportions, or both. Options also include a chi square
statistic to summarize the degree of association between
the variables and the expected values and standardized
residuals that go into the chi square calculation. (Chi
square = the sum of the squared standardized residuals, so
these can reveal just where the table shows an association
between the variables.)
### Working with Contingency
Tables

Click on any cell or margin value to drop down a
HyperView menu that offers to select the cases represented
by that value, to create an indicator (or dummy) variable
that is one for the cases in that cell and zero for the
others, or to record a 0/1 variable and make it the
Selector for subsequent commands. Choose Compute Counts
from the global HyperView menu to record hotresult
variables that name the row categories and column
categories, and the counts.
### Learning from Contingency
Tables

A chi square statistic with a low p-value can be
interpreted as evidence that the two variables are not
statistically independent. Depending on your interpretation
of the variables, the test may indicate whether the
distributions of counts among several categories are alike
or different. When a test is statistically significant, it
is almost always worthwhile to examine the standardized
residuals to understand which cells contributed to the
larger chi square value. If one or two cells stand out with
larger residuals, you can click on them to select those
cases for further examination in other displays. (All case
selections always appear in all displays and editing
windows.)
### Special Features of Contingency
Tables

If the number of categories in a variable
exceeds the categories limit (default 50), Datadesk will
warn you before proceeding to make the table. This could
happen if you select a quantitative variable rather than a
categorical variable. You can adjust the categories limit
at Data Desk > Preferences > Categories Limit…

This is a Correlation Table window. If it shows no values,
drag variables into the window and drop them anywhere. You
can drag additional variable into the window at any time to
add them to the correlation table. The drop-down menu at
the top lets you choose among Pearson product-moment
corrosion, Kendall's tau, Spearman's rho, and covariances.
The table shows pairwise correlations computed for all
cases with numeric values on both variables. Correlation
tables can take you directly to related analyses. The
Hyperview found by clicking the triangle in the upper left
of the window offers features and related commands.
Correlation tables are versatile with many special
features.
### Working with Correlation
Tables

Click on the row or column name of any variable
to Select or Locate its icon, to make a Histogram or Normal
Probability Plot of the variable or to remove it from the
table. Click on any correlation to make a Scatterplot of
the two variables. Because correlations are symmetric,
Datadesk offers to plot either variable on the y-axis. If
any value in one of the variables in the table is modified,
the table will offer to recalculate by displaying a red
exclamation mark where the global Hyperview menu usually
appears in the upper left corner of the window.
### Correlation HyperViews

The
global Hyperview offers to make a Plot Matrix of all the
variables. This is the natural visualization of a
correlation table. Turn on Automatic Update to have the
table immediately recomputed if any data value is changed
### Learning from
Correlations

Correlations summarize the association
between pairs of variables. Pearson correlation measures
linear association. Kendall's tau measure monotonicity.
Spearman's rho is the Pearson correlation of the ranks, and
is less sensitive to possible outliers. Covariances, unlike
the other association measures available in this table,
measure association using the original measurement units of
the variables rather than re-scaling them. It is a good
idea to examine the scatterplot corresponding to any
correlation that is of importance or interest. Pearson
correlation is sensitive to outliers and is only
appropriate when the associate is linear. You can check
both with the scatterplot. Kendall's tau and Spearman's rho
do not require a linear relationship and are less sensitive
to outliers.
### Special Features of Correlation
Tables

Correlations are computed for all cases that
have numeric data in both variables. Thus the correlations
in the table may be for different numbers of cases and for
different cases if data are missing for different cases in
different variables. Pearson correlation is closely
associated with linear regression. A convenient way to get
to a regression is to make the offered scatterplot and then
choose Regression from the scatterplot's Hyperview menu.
This has the advantage of offering a check for nonlinearity
and outliers along the way. When developing a multiple
regression model, it can be helpful to save the residuals
and drag them into a correlation table of the available
predictors. This will show which remaining predictors are
correlated with the residuals and are thus good candidates
to include in your model. When a variable is added to the
model, the residuals will update and the correlation table
will offer to update to reflect the change.

This is a Dotplot side by side window. If it is empty, drag
one or more quantitative variables into the window. Data
desk will make a vertical dotplot for each variable, all on
the same vertical scale.
### Working with Dotplots side by
side

Dotplots and Boxplots work similarly. Switch
between these two views with the Add Boxes/Remove Boxes
command in the global HyperView menu. Generally, boxplots
give a better overview of the relationship of the variables
and dotplots show more detail and facilitate identifying
and working with (e.g. assigning colors or symbols to)
individual points.
### Dotplot HyperViews

In
addition to adding or removing boxes, the global HyperView
offers a ZScore Transformation. This is usefule when
plotting variables that have very different scales. For
each variable, Datadesk subtracts its mean and divides by
its standard deviation. This makes it easier to compare
distributions within the variables. Because these are
different variables, each case is represented in each of
the dotplots, so when a point is selected, it highlights in
each dotplot. You can identify any point with the query (?)
tool
### Learning from
Dotplots

Dotplots offer a convenient comparison of the
distribution of a quantitative variables. Look for trends
in the variability of the variables and for possible
outliers. The vertical style of dotplots overprints
multiple points if they appear at the same place. The ?
tool will indicate the number of overprints and will
identify them successively on multiple clicks. The Modify
> Lines > Show Lines/Hide Lines command alternately
draws lines that connect the points for each case or hides
them. With lines shown, this is a kind of parallel
coordinate plot.

This is a Frequency Breakdown window. If it is empty, drag
a categorical variable into it. Frequency breakdowns show,
for each category, the count of cases falling in that
category, the percentage of cases in the category, and a
cumulative percentage of cases. Drag a new variable over
the variable name to replace it.
### Working with Frequency
Breakdowns

Click on the column heads to save and locate
a hotresult variable holding the values of the column. As
hotresults, these will update if the variable is changed or
replaced. The HyperView attached to the variable name
offers to make a bar chart of pie chart of the variable.
Hyperviews attached to any of the values in the table offer
to select that category in all other displays and editing
windows.
### Frequency Breakdown
HyperViews

The Frequency Options command in the global
hyperview offers a variety of alternative calculations for
a frequency window. These include the expected values and
standardized residuals for a chi square test of homogeneity
of the hypothesis of equal cell counts

This is a histogram window. If it is blank, then drag a variable's icon into the window and drop it anywhere. The variable must contain at least one numeric case. To change the displayed variable, drag a new variable icon onto the axis label at the bottom of the plot, and drop it there.

This is a Linear Model window. If there is no analysis, drag your response variable into the indicated place at the top of the table. Drag your factor or predictor variables into the indicated place.
At any time, you can drag additional variables into the window at these locations for either the factors or response. Click on the name of a variable to remove it from the model.
There can be more than one response variable, in which case, the analysis is a multivariate linear model, and the Type of analysis will indicate that.
The response variable can be a binary categorical variable, in which case you should click on the type of analysis and choose Logistic.
If all factors are categorical (discrete) and the response variable(s) are quantitative, the analysis is a multivariate ANOVA (MANOVA).
If all factors are quantitative (continuous) and the response variable(s) are quantitative, the analysis is a multivariate regression.
If the factors are a mix of quantitative and categorical and the response variable(s) are quantitative, the analysis is a multivariate analysis of covariance (ANOCOV).
The multivariate linear model is an extraordinarily general analysis with many special versions. You may want to consult the Data desk documentation for further information.
### Working with Linear Models

In the Factors panel, specify for each factor whether it is a fixed or random effect and whether it is continuous (quantitative) or discrete (categorical).
Indicate nesting by dragging a line between a factor and the parentheses next to the factor in which it is nested.
Select either Type I (sequential) or Type III (partial) sum of squares.
The Design Help button shows how to specify common designs.
The Interactions Sub-panel allows you to call for all interactions up to a specified level, to select two terms and add their interaction, or to select and remove an interaction. It also computes and displays the maximum df available for each interaction (fewer df may ultimately be available as the model is computed due to collinearities and empty cells.), the basis for the expected mean squares, and the denominator to be used for the appropriate F-test.
The Up and Down buttons allow re-ordering of factors and interactions for sequential sums of squares.
The modifications panel accommodates a selector variable to analyze a subset of the data and a variance variable to make the analysis a weighted analysis.
The Results panel opens to reveal output tables appropriate to the type of analysis.
A sub-panel opens to offer coefficients, expected cell means, and post-hoc tests.
The results for a multivariate analysis include a selection among the most common multivariate tests. For a multivariate analysis, the analysis for a specific response is itself an ANOVA or ANOCOV.
### Linear Model HyperViews

The global HyperView menu offers a variety of diagnostic displays and calculations similar to those offered for regression.
HyperViews on each variable offer histograms and normal probability plots.
Each panel has a window icon. The HyperView attached to that offers to pull the panel out into a separate window (so the analysis can fit for easily on a small screen) and to make a static copy of it (for comparison with an alternative model.)
ANOVA/ANOCOV panels behave in the same was as ANOVA windows.
The tables of results for multivariate tests offer HyperViews with appropriate supplementary information such as eigenvalues associated with the selected factor.

This is a Pie Chart window. If it is empty, drag the icon of a variable into the window and drop it there. Drag a new variable into the window to plot it instead.
A pie chart treats the displayed variable as naming categories even if its contents are numbers. It divides a circle into segments that correspond in size to the relative frequency of each category named in the variable. The segments are colored and a color key is provided to the right of the circle.
### Working with Pie Charts

Select the cases in any sector of the pie chart by clicking on it or on its color square in the key. There is no need to choose a plot tool. Cases selected in a pie chart highlight in all other displays and editing windows.
### Pie Chart HyperViews

The natural alternative display for the same variable is a Bar chart, which is offered in the HyperView.
The natural associated table is a Frequency Table, also offered in the HyperView.
The option of using patterns instead of colors is available, and useful for publication when colors will not be available.
### Learning from Pie Charts

Pie charts have been maligned because the human eye finds it harder to perceive the relative size of angles than the relative heights of bars in a bar chart. However, they are compact and easily read. They are most useful when displaying a variable that divides some "whole" into segments.
### Special Features of Pie Charts

The colors assigned to the pie chart slices are selected to be as different from each other as possible. They are the same colors that Datadesk will choose if the displayed variable is used to color points in another display with the Modify>Colors>Add Group command. Thus, a pie chart of the variable can serve as a quick key to the assigned colors.

This is a Principal Components window. If it is empty then drag some variables into the window individually or in groups.
### Working with Principal Components

The principal components results report the eigenvalues, the eigenvectors, and an unrotated factor matrix.
### Principal Components HyperViews

The global HyperView offers Principal Components Options. These include a choice of basing the analysis on correlations or on covariances. Generally, it is best to use correlations unless the variables are measured on comparable scales. The Options also offer a choice of results to be saved. Results are saved in a new folder that is placed in the Data folder found in the "file" at the upper right of the Data desk window.
### Learning from Principal Components

Locate the PC's folder in the Data folder of the File icon. Two folders hold the columns of the U and V' matrices of the Singular Value Decomposition (SVD) of the matrix made up of the columns of data; X = UDV' where D is a diagonal matrix of the eigenvalues.
A rotating plot of the columns of U is the same as a rotating plot of the columns of X (the original data) except for the orientation of the axes. For more than 3 variables, the rotating plot of the columns of U may show a more "interesting" orientation of the data.
### Special Features of Principal Components

The U and V columns are themselves derived variables so you can open the to see the linear combinations of the argument variables.

This is a Linear Regression window. If it is empty, drag your response variable into the top line of the table where indicated. Drag predictor variables–either individually or as a group–into the bottom row of the table where indicated. When both parts of the analysis are specified, the regression is computed and the rest of the table will fill in.
You can drag a new variable into the response variable row to replace the response variable. You can drag additional variables into the predictor area to add them to the model. Alternatively, you can drag a new variable directly over the name of an existing predictor to replace it in the regression model.
Linear Regression fits a linear model to predict or describe the response variable in terms of the predictor variables.
The values in the regression table are:
The number of cases in the model. A case that is missing a numeric value in any of the variables in the model will be excluded from the calculation.
R-squared gives the percentage of the variance of the response variable accounted for by the regression model.
R squared (adjusted) is a value adjusted for the number of cases and variables, suitable for comparing regression models with different numbers of predictors.
F-ratio is a global indicator of whether the response can be modeled by the predictors.
The final sub-table names each predictor and gives:

• its coeffient in the model,

• The estimated standard error of that coefficient

• The t-ratio for testing the standard null hypothess that the true value of the coefficient in this models ins zero.

• The p-value of that t-test.

### Working with Regressions

Regression is one of the most versatile statistical models and is widely used. Datadesk offers great flexibility for building, diagnosing, and understanding regressions.
Drag potential predictor variables into your model to add them to the model or replace existing predictors. Remove predictors by clicking on them and choosing the Remove Predictor command.
HyperView menus are attached to various parts of the regression table.
The menu attached to variable names offers to locate or select the variables, display them with Histograms or Normal probability plots, or make a Scatterplot of the response variable against that predictor
The HyperView attached to each coefficient offers to make the partial regression plot that corresponds to that coefficient or to "drop coefficients" into hotresult variables that can be used in other calculations.
The HyperView attached to the standard errors offers to drop the coefficients.
The Hyperiew attached to each t-ratio offers a plot of studentized residuals or of residuals vs that variable, and also offers to drop the t-ratio in a hotresult.
The Hyperview attached to each p-values offers to drop the p-value as a hotresult.
The HyperViews attached to the Sum of Squares values and the df values drop hotresults containing those values. They are provided primarily to be available for other calculations.
The HyperView attached to the adjusted R-squared drops that value as a hotresult. This is provided primarily for use in automatically optimizing the regression model by maximizing the adjusted R-squared value.
### Regression HyperViews

The global HyperView offers a variety of displays and diagnostic statistics:
Scatterplot residuals vs predicted values.
Scatterplot studentized residuals vs predicted values. This command computes the externally studentized residuals and plots them against the predicted values. Studentized residuals are adjusted to all have the same standard errors, so this plot may be more appropriate for assessing whether the regression assumption of constant variance around the model is satisfied.
Potential-Residual plot. This is a diagnostic plot that can help identify influential cases. It is a good idea to identify and understand any cases that stand apart from the rest of the data in this display.
Assign Variance Variable. To compute a weighted regression, select a variable that holds the estimated variances of the cases and then choose this command. The reciprocal of the variances will be the weights in the model.
Turn On Automatic Update. As in other DD windows, this causes the regression to update immediately upon any change to a variable in the model. See the discussion of special features for some ideas on using this capability.
Compute> This is a submenu offering to compute a variety of diagnostic and related statistics. The computed statistics are saved as hotresult variables, so all will update if the regression model is updated. Computed statistics include:
Predicted values
Residuals
Leverages
Externally studentized residuals
DFFITS
Cook's D
Hadi's Influence

Likelihood Mahalanobis distances (based on the predictors) Prob Plot> This is a submenu offering to make a Normal Probability Plot of any of a variety of related diagnostic statistics. The availalble plots include: Residuals Leverages Externally studentized residuals Cook's D Hadi's influence Likelihood For many of these measures the best indication of an extraordinary case is that it stands away from the other values. A Normal probability plot offers a good way to look for that and one in which individual cases can be easily identified with the ? tool.

### Learning from Regression

Regressions are found for a variety of reasons and can be part of other analyses.
If you are particularly interested in a coefficient in a multiple regression, you should make and examine the partial regression plot available by clicking on the coefficient. That plot displays the relationship between the respons variable and the predictor in question after removing the linear effects of the other predictors in the model. You can interpret it as you would a simple scatterplot. You should identify and understand any cases that stand away from the body of the data and you should be concerned if the relationship looks nonlinear.
No regression is complete without an examination of the residuals, so either a plot of residuals vs predicted or of studentized residuals vs predicted is highly recommended.
### Special Features of Regression

Many of Data desk's abilities are particularly valuable when you build and interpret a linear regression.
Consider some of the following options:
If you have found (or suspect) a curved relationship between the response variable and the predictors–for example, if the scatterplot of residuals is curved–select the response variable. (You can do that from the HyperView attached to its name in the regression table.) Then choose
Manip > Transform > Dynamic > Box-Cox
Datadesk will make a new derived variable and a slider.
Drag the derived variable into the regression table to replace the original response variable.
If you haven't already, make a scatterplot of the residuals and/or a Normal probability plot of the residuals.
Set those plots to Automatic Update using the commands in their HyperView menus.
Now, sliding the control on the slider will re-express the response variable. A slider value of 1 is the raw data. A value of ½ takes a square root. A value of 0 specifies a (natural) logarithm. - ½ specifies a negative reciprocal root, and -1 specifies a negative reciprocal. As you slide, the regression model is continuously re-computed along with the residuals, predicted values, and any other statistics you have computed or plotted. Plots of those values set to automatic update will change smoothly to reflect the change. You may set the regression table itself to Automatic Update, but that isn't necessary.

With this trick, it is easy to see and understand the effects on your regression model of re-expressing the response variable and to find an optimal re-expression function.

A similar trick can help you to choose between two potential predictors. Choose both variables and select Manip > Transform> Dynamic > Mix X and Y.

Datadesk makes a derived variable and slider. The slider is bounded at 0 and 1. At 0 the derived variable is equal to the X variable. At 1 it is equal to the Y variable. Between those values it is a weighted combination of scaled versions of these variables. With this variable you can "slide" from one variable to the other and watch the consequences in the plots you have set to automatically update. This can be remarkably informative. You may, for example, see a cluster of points that move together in one of the plots, helping to identify those cases as related to each other.

You may discover a case that you conclude should be omitted from the analysis. (Perhaps, for example, you identified the case with the ? tool and then used the Special > Web Search Query function to learn more about the case, concluding that it was in some important way different from the other cases.) One convenient way to set the case aside without losing it is to create a special indicator variable that is 1 for that case and 0 for all the others.

Select the case in any display.

Open a variable that names the cases (if you have one)

Choose Modify > Selection > Record as Indicators

Data desk will make an indicator variable for each selected case, naming the variable with the name of the case.

Now drag those indicators into the regression model to add them as predictors.

The effects of these cases will be removed from the analysis.

The p-value associated with the t-test on each coefficient is a statistical test of whether the case is in fact an outlier from the regression model.

• its coeffient in the model,

• The estimated standard error of that coefficient

• The t-ratio for testing the standard null hypothess that the true value of the coefficient in this models ins zero.

• The p-value of that t-test.

Likelihood Mahalanobis distances (based on the predictors) Prob Plot> This is a submenu offering to make a Normal Probability Plot of any of a variety of related diagnostic statistics. The availalble plots include: Residuals Leverages Externally studentized residuals Cook's D Hadi's influence Likelihood For many of these measures the best indication of an extraordinary case is that it stands away from the other values. A Normal probability plot offers a good way to look for that and one in which individual cases can be easily identified with the ? tool.

With this trick, it is easy to see and understand the effects on your regression model of re-expressing the response variable and to find an optimal re-expression function.

A similar trick can help you to choose between two potential predictors. Choose both variables and select Manip > Transform> Dynamic > Mix X and Y.

Datadesk makes a derived variable and slider. The slider is bounded at 0 and 1. At 0 the derived variable is equal to the X variable. At 1 it is equal to the Y variable. Between those values it is a weighted combination of scaled versions of these variables. With this variable you can "slide" from one variable to the other and watch the consequences in the plots you have set to automatically update. This can be remarkably informative. You may, for example, see a cluster of points that move together in one of the plots, helping to identify those cases as related to each other.

You may discover a case that you conclude should be omitted from the analysis. (Perhaps, for example, you identified the case with the ? tool and then used the Special > Web Search Query function to learn more about the case, concluding that it was in some important way different from the other cases.) One convenient way to set the case aside without losing it is to create a special indicator variable that is 1 for that case and 0 for all the others.

Select the case in any display.

Open a variable that names the cases (if you have one)

Choose Modify > Selection > Record as Indicators

Data desk will make an indicator variable for each selected case, naming the variable with the name of the case.

Now drag those indicators into the regression model to add them as predictors.

The effects of these cases will be removed from the analysis.

The p-value associated with the t-test on each coefficient is a statistical test of whether the case is in fact an outlier from the regression model.