- Hypothesis Testing in Multiple Regression
- Overall F-test for all regression coefficients
- Extra Sum of Squares F-tests for tests involving 1 or more parameters
- t-tests for individual parameters

- Interpretation of estimates
- Multiple R-Squared vs adjusted R-squared
- Estimated mean, confidence intervals and prediction intervals

- Conceptual Exercises 1-8 (Be prepared to discuss in lab)
- Exercise 15
- Exercise 16: parts a, c-e (skip b), and also turn in a plot with a
- scatter plot of the data
- fitted regression model (add the equation)
- 95% confidence intervals for the mean
- 95% prediction intervals for future observations

- Exercise 17 (explain why R-squared is not useful for selecting among these models)
- For the energetic student: Exercises 20-22. Do not turn in.

- Download the Old Faithful data,
Ex0724.asc again, unless you have saved the dataframe
previously.
- Open the
**Commands Window**(Select Command Window from the Windows menu) - To have S-Plus create all of our dummy variables automatically for
us, we need to change one option on how S-Plus handles factors. In the
command window, enter
options(contrasts="contr.treatment") If we have a categorical or factor variable with k levels, this will tell S-Plus to create k-1 dummy or indicator variables, where the first dummy variable is an indicator for the 2nd level, the 2nd dummy variable is an indicator of the 3rd level, and the last dummy variable is an indicator of the last level. The first level corresponds to having all dummy variables equal to 0. - Now let's tell S-Plus to treat date as a factor rather than as a
continuous variable. Go to the
**Data**menu and select**Change Data Type**. Select the column for "date", and then for Column Type select "factor". Click on OK. - Fit the regression model of interval on duration and date. Go to
the Statistics menu, select Regression, and then Linear. In the dialog
box, create the model formula as
interval ~ duration + date It is important that you put date last so that it appears last in the sequential sum of squares in the ANOVA table. In the box for

**Save Model Object**as enter**Full.lm**(ie this is the full model). Under the Results page/tab, check the box for ANOVA and save the fitted values in the dataframe. Click OK to run the regression. (verify that residuals are ok, then discard plots)In the Report Window, you will have the ANOVA table for the Full Model. Verify that the df for date are 7 and not 1. Because we are treating it as a factor we should have 1 df for every dummy variable or 8 - 1. If it is treated as a continuous variable, the df would be 1 and something likely screwed up with the options() command earlier.

- Now fit the reduced model. Go to the linear regression menu and
create the formula as
interval ~ duration Save the Model Object as

**Reduced.lm**. Under the Results menu, click the ANOVA box, but not the fitted values box. Click OK. The ANOVA table for the reduced model will be in the Report Window. - Now to answer the question in the text! To test whether there is
any difference in mean intervals due to date, construct the Extra Sum of
Squares F-test using the Residual SS from these two tables. You can do
the calculations in the command window to get the F-statistic,
**but in your write up, you should show the steps involved in the calculation i.e., show how you calculate the F-stat based on the full and reduced model ANOVA tables.** - To get a p-value, use the pf() function in the command window,
i.e for an F-stat = 5 with 4 and 10 df, the p-value is
1 - pf(5,4,10) - Is your p-value and F-stat different from the F-stat and p-value
for the row corresponding to date in the sequential ANOVA table for the
Full Model? They should agree if you added date LAST in the model
formula. i.e. it is looking at the Extra Sum of Squares for adding date,
after taking into account duration. Note: you can always add up the
sequential Sum of Squares provided that they are in the right order to
get the Extra Sum of Squares, i.e. for the F-test in class we could have
added the Sum of Squares from I and I:log.duration to get the Extra Sum
of Squares as long as the model was written in the order log.duration +
I + I:log.duration
- S-Plus can automatically get the F-stat and p-value for the Extra
Sum of Squares test by using the
Model Comparison option. Go to the
**Statistics**Menu and select**Compare Models**. Select the two models, Full.lm and Reduced.lm (use a ctrl-click if they are not in order) Click on OK. The F-stat and p-value should agree with what you have calculated by hand. The variable listed under Test specifies which variables are being dropped from the model under Ho; the df represent the change in df's between the two models (the number of parameters being set to 0 under Ho - note - - this is sometimes negative because of the order - use the absolute value to the numerator df for the F-test. The absolute value of the Sum of Squares represents the Extra Sum of Squares due to the added variables. The F-stat is the {Extra SS/number of parameters that equal 0 under Ho}/MSE from the Full model. -
Do the results make sense visually? Make a scatter plot with the simple
linear regression model (y = interval, x = duration) using the graph
menu. Now, add the regression lines for each date: Go to the
**Insert**menu and select**Plot**and choose**Line Plot**. For the x-axis select duration, but for the y-axis select the column for your fitted values, fit. Under the**Subset Rows with**option, enter date==1 (this will only plot the cases where date is equal to 1). In the Sort/Smooth tab, select Sort X,Y by X to have the data sorted so that the lines are connected in the right order. Click OK; you should see the fitted regression line for that date added to the plot. Repeat for the other 7 dates, i.e. use date==2, date==3,... date==8. From the Insert Menu, add a Legend, Title, or any other information (such as the equation for the simple linear regression model). - Provide an interpretation and confidence interval for each coefficient in the Full model and the Reduced model. Are the confidence intervals consistent with the results of the Extra Sum of Squares test? Explain.

- Download the Galileo data,
Case1001.asc for Case Study 10.1. Read the background material on
the problem before lab.
- Fit the quadratic regression model to answer parts a, c-e. For c-e
we need predictions when height = 500 punti. A computational trick is to
subtract off the value 500 from height. This way, the estimated mean at
500 punti is the intercept,
and the standard error of the intercept is the standard error of the
mean. To do this calculation by hand is otherwise pretty painful :-)
For part a you may turn in results based on fitting the model using
(height - 500).
From the statistics menu, bring up the regression dialog. For the model formula enter:

distance ~ I(height-500) + I((height-500)^2) *The I() function serves to "protect" the meaning of the expressions inside the parentheses. Some symbols have different meanings in a model formula than they do normally, i.e. X1*X2 means fit the model with X1 + X2 + X1:X2 or the main effects + interaction. In the model formula, "height - 1" would mean to fit the model with height and no intercept; using I(height - 1) would fit the model with an intercept and subtract one from each height value. So if we want to do the transformations on the fly, rather than creating a transformed variable in the dataframe use I() to be careful. Note, this I() is a function which is not the same as the indicator variable we created for the class example.*Under the Predict Tab, check off the values for Predictions, Confidence Intervals and Standard Errors (standard errors for the means at the observed height values). Specify Case1001 for the New data and the dataframe for saving results. Under the Results tab, check off the box for correlation Matrix of Estimates. Click OK. - You should be able to identify the estimates, their standard
errors, and the estimate of sigma^2. The table or matrix of variances
and covariances has to be assembled with a little work. The estimated
variances of the coefficient estimates are the squares of their standard
errors. The estimated covariance between two coefficient estimates is
the correlation between them times the product of their standard
errors, i.e., correlation(beta1-hat, beta2-hat)*SE(beta1-hat)*SE(beta2-hat).
The other parts for c-e can be obtained from the output (review pages
264-268).
- Write out the least squares regression equation for the model
distance ~ I(height-500) + I((height-500)^2) and rewrite it so that the mean
is a function of height and height^2 (e.g., not (height - 500) and verify
that the fitted equation is the same as what is reported on page 265.
i.e. the coefficient multiplying height and height^2 should be the same
as on page 265, so that even though we are using a different
"parameterization" of the model, we get the same fitted values.
- Use the Transform menu to calculate se.pred, the standard error of
predicted values using the se.fit column. Also use the transform option
to create the upper and lower prediction intervals. The formula on page
185 still applies, although SE.fit has changed.
- Create a scatter plot of the data. Use the Insert Plot feature as
described earlier to add
- fitted mean (from the quadratic regression)
- 95% confidence intervals for the mean (the values created by S-Plus)
- 95% prediction intervals for the mean (what you just calculated above)

- For the energetic student (do not turn in) For height = 100 punti, verify by hand using the estimated variance covariance matrix that the standard error of the mean is the same as that calculated and added to the dataframe using S-Plus. (see section 10.4.3 for the expression)

- Fit the simple linear regression on height, and save the fitted
values. Rename the column say fit1. Double-Click on the space for the
column name to enter a name.
- You do not need to refit the quadratic model if you have the
R-squared and mean squared residuals. Rename the fitted values
from Exercise 16 to fit2.
- To fit the cubic regression, use the model formula
distance ~ height + I(height^2) + I(height^3) Save the fitted values and rename then as fit3.

- For the next model add I(height^4) to the formula above. Save the
fitted values as fit4. Repeat until you have the output for all models.
- The multiple R-squared in the Report Window is the what the text
calls R-squared, regression sum of squares/ total Sum of squares. Using
the expression on 276 calculate the adjusted R-squared for each model.
Note: total mean square is just the sample variance that can be obtained
from the Statistics, Data Summaries, Summary Statistics menu or can be
obtained from the sequential ANOVA table by adding up all of the Sum of
Squares (for each variable plus the residual). The total df = n-1.
Create a table of R-squared and Adjusted R-squared.
- Create a scatter plot of the data. Add the fitted values for each model using Insert Plot > LinePlot. Does increasing the number of terms improve the fit? Can you tell the difference between the fits after a certain point? How is this reflected in the R-squared values? the adjusted R-squared values?