This is a Linear Regression window. If it is empty, drag your response variable into the top line of the table where indicated. Drag predictor variables–either individually or as a group–into the bottom row of the table where indicated. When both parts of the analysis are specified, the regression is computed and the rest of the table will fill in.
You can drag a new variable into the response variable row to replace the response variable. You can drag additional variables into the predictor area to add them to the model. Alternatively, you can drag a new variable directly over the name of an existing predictor to replace it in the regression model.
Linear Regression fits a linear model to predict or describe the response variable in terms of the predictor variables.
The values in the regression table are:
The number of cases in the model. A case that is missing a numeric value in any of the variables in the model will be excluded from the calculation.
R-squared gives the percentage of the variance of the response variable accounted for by the regression model.
R squared (adjusted) is a value adjusted for the number of cases and variables, suitable for comparing regression models with different numbers of predictors.
F-ratio is a global indicator of whether the response can be modeled by the predictors.
The final sub-table names each predictor and gives:
• its coeffient in the model,
• The estimated standard error of that coefficient
• The t-ratio for testing the standard null hypothess that the true value of the coefficient in this models ins zero.
• The p-value of that t-test.
Working with Regressions
Regression is one of the most versatile statistical models and is widely used. Datadesk offers great flexibility for building, diagnosing, and understanding regressions.
Drag potential predictor variables into your model to add them to the model or replace existing predictors. Remove predictors by clicking on them and choosing the Remove Predictor command.
HyperView menus are attached to various parts of the regression table.
The menu attached to variable names offers to locate or select the variables, display them with Histograms or Normal probability plots, or make a Scatterplot of the response variable against that predictor
The HyperView attached to each coefficient offers to make the partial regression plot that corresponds to that coefficient or to "drop coefficients" into hotresult variables that can be used in other calculations.
The HyperView attached to the standard errors offers to drop the coefficients.
The Hyperiew attached to each t-ratio offers a plot of studentized residuals or of residuals vs that variable, and also offers to drop the t-ratio in a hotresult.
The Hyperview attached to each p-values offers to drop the p-value as a hotresult.
The HyperViews attached to the Sum of Squares values and the df values drop hotresults containing those values. They are provided primarily to be available for other calculations.
The HyperView attached to the adjusted R-squared drops that value as a hotresult. This is provided primarily for use in automatically optimizing the regression model by maximizing the adjusted R-squared value.
The global HyperView offers a variety of displays and diagnostic statistics:
Scatterplot residuals vs predicted values.
Scatterplot studentized residuals vs predicted values. This command computes the externally studentized residuals and plots them against the predicted values. Studentized residuals are adjusted to all have the same standard errors, so this plot may be more appropriate for assessing whether the regression assumption of constant variance around the model is satisfied.
Potential-Residual plot. This is a diagnostic plot that can help identify influential cases. It is a good idea to identify and understand any cases that stand apart from the rest of the data in this display.
Assign Variance Variable. To compute a weighted regression, select a variable that holds the estimated variances of the cases and then choose this command. The reciprocal of the variances will be the weights in the model.
Turn On Automatic Update. As in other DD windows, this causes the regression to update immediately upon any change to a variable in the model. See the discussion of special features for some ideas on using this capability.
Compute> This is a submenu offering to compute a variety of diagnostic and related statistics. The computed statistics are saved as hotresult variables, so all will update if the regression model is updated. Computed statistics include:
Externally studentized residuals
Mahalanobis distances (based on the predictors)
Prob Plot> This is a submenu offering to make a Normal Probability Plot of any of a variety of related diagnostic statistics. The availalble plots include:
Externally studentized residuals
For many of these measures the best indication of an extraordinary case is that it stands away from the other values. A Normal probability plot offers a good way to look for that and one in which individual cases can be easily identified with the ? tool.
Learning from Regression
Regressions are found for a variety of reasons and can be part of other analyses.
If you are particularly interested in a coefficient in a multiple regression, you should make and examine the partial regression plot available by clicking on the coefficient. That plot displays the relationship between the respons variable and the predictor in question after removing the linear effects of the other predictors in the model. You can interpret it as you would a simple scatterplot. You should identify and understand any cases that stand away from the body of the data and you should be concerned if the relationship looks nonlinear.
No regression is complete without an examination of the residuals, so either a plot of residuals vs predicted or of studentized residuals vs predicted is highly recommended.
Special Features of Regression
Many of Data desk's abilities are particularly valuable when you build and interpret a linear regression.
Consider some of the following options:
If you have found (or suspect) a curved relationship between the response variable and the predictors–for example, if the scatterplot of residuals is curved–select the response variable. (You can do that from the HyperView attached to its name in the regression table.) Then choose
Manip > Transform > Dynamic > Box-Cox
Datadesk will make a new derived variable and a slider.
Drag the derived variable into the regression table to replace the original response variable.
If you haven't already, make a scatterplot of the residuals and/or a Normal probability plot of the residuals.
Set those plots to Automatic Update using the commands in their HyperView menus.
Now, sliding the control on the slider will re-express the response variable. A slider value of 1 is the raw data. A value of ½ takes a square root. A value of 0 specifies a (natural) logarithm. - ½ specifies a negative reciprocal root, and -1 specifies a negative reciprocal. As you slide, the regression model is continuously re-computed along with the residuals, predicted values, and any other statistics you have computed or plotted. Plots of those values set to automatic update will change smoothly to reflect the change. You may set the regression table itself to Automatic Update, but that isn't necessary.
With this trick, it is easy to see and understand the effects on your regression model of re-expressing the response variable and to find an optimal re-expression function.
A similar trick can help you to choose between two potential predictors. Choose both variables and select Manip > Transform> Dynamic > Mix X and Y.
Datadesk makes a derived variable and slider. The slider is bounded at 0 and 1. At 0 the derived variable is equal to the X variable. At 1 it is equal to the Y variable. Between those values it is a weighted combination of scaled versions of these variables. With this variable you can "slide" from one variable to the other and watch the consequences in the plots you have set to automatically update. This can be remarkably informative. You may, for example, see a cluster of points that move together in one of the plots, helping to identify those cases as related to each other.
You may discover a case that you conclude should be omitted from the analysis. (Perhaps, for example, you identified the case with the ? tool and then used the Special > Web Search Query function to learn more about the case, concluding that it was in some important way different from the other cases.) One convenient way to set the case aside without losing it is to create a special indicator variable that is 1 for that case and 0 for all the others.
Select the case in any display.
Open a variable that names the cases (if you have one)
Choose Modify > Selection > Record as Indicators
Data desk will make an indicator variable for each selected case, naming the variable with the name of the case.
Now drag those indicators into the regression model to add them as predictors.
The effects of these cases will be removed from the analysis.
The p-value associated with the t-test on each coefficient is a statistical test of whether the case is in fact an outlier from the regression model.