Statistics 242 -- Applied Regression Analysis

## Topics

1. Hypothesis Testing in Multiple Regression
• Overall F-test for all regression coefficients
• Extra Sum of Squares F-tests for tests involving 1 or more parameters
• t-tests for individual parameters
2. Interpretation of estimates
3. Multiple R-Squared vs adjusted R-squared
4. Estimated mean, confidence intervals and prediction intervals

## Assignment

1. Conceptual Exercises 1-8 (Be prepared to discuss in lab)
2. Exercise 15
3. Exercise 16: parts a, c-e (skip b), and also turn in a plot with a
• scatter plot of the data
• fitted regression model (add the equation)
• 95% confidence intervals for the mean
• 95% prediction intervals for future observations
4. Exercise 17 (explain why R-squared is not useful for selecting among these models)
5. For the energetic student: Exercises 20-22. Do not turn in.
For each problem, answer the questions asked in the text and any listed below. Write one brief paragraph (max) summarizing what you have found for each problem.

## Commands for Specific Problems

### Exercise 15

In this exercise we will conduct an Extra Sum of Squares F-test to see if the mean intervals adjusted for duration also depend on the date of observation. We will have 3 ways to get the F-stat 1) by hand from the ANOVA tables (you should know this for use with other packages that do not do calculate the test statistic directly and for on exams :-), 2) from an ANOVA table with sequential Sum of Squares and 3) using the Model Comparison option (easy to use in S-Plus, but not available in all Stat packages). You should make sure that you understand 1). We will also show how to add separate regression lines to the plot.
1. Download the Old Faithful data, Ex0724.asc again, unless you have saved the dataframe previously.

2. Open the Commands Window (Select Command Window from the Windows menu)

3. To have S-Plus create all of our dummy variables automatically for us, we need to change one option on how S-Plus handles factors. In the command window, enter

options(contrasts="contr.treatment")

If we have a categorical or factor variable with k levels, this will tell S-Plus to create k-1 dummy or indicator variables, where the first dummy variable is an indicator for the 2nd level, the 2nd dummy variable is an indicator of the 3rd level, and the last dummy variable is an indicator of the last level. The first level corresponds to having all dummy variables equal to 0.

4. Now let's tell S-Plus to treat date as a factor rather than as a continuous variable. Go to the Data menu and select Change Data Type. Select the column for "date", and then for Column Type select "factor". Click on OK.

5. Fit the regression model of interval on duration and date. Go to the Statistics menu, select Regression, and then Linear. In the dialog box, create the model formula as

interval ~ duration + date

It is important that you put date last so that it appears last in the sequential sum of squares in the ANOVA table. In the box for Save Model Object as enter Full.lm (ie this is the full model). Under the Results page/tab, check the box for ANOVA and save the fitted values in the dataframe. Click OK to run the regression. (verify that residuals are ok, then discard plots)

In the Report Window, you will have the ANOVA table for the Full Model. Verify that the df for date are 7 and not 1. Because we are treating it as a factor we should have 1 df for every dummy variable or 8 - 1. If it is treated as a continuous variable, the df would be 1 and something likely screwed up with the options() command earlier.

6. Now fit the reduced model. Go to the linear regression menu and create the formula as

interval ~ duration

Save the Model Object as Reduced.lm. Under the Results menu, click the ANOVA box, but not the fitted values box. Click OK. The ANOVA table for the reduced model will be in the Report Window.

7. Now to answer the question in the text! To test whether there is any difference in mean intervals due to date, construct the Extra Sum of Squares F-test using the Residual SS from these two tables. You can do the calculations in the command window to get the F-statistic, but in your write up, you should show the steps involved in the calculation i.e., show how you calculate the F-stat based on the full and reduced model ANOVA tables.

8. To get a p-value, use the pf() function in the command window, i.e for an F-stat = 5 with 4 and 10 df, the p-value is
1 - pf(5,4,10)

9. Is your p-value and F-stat different from the F-stat and p-value for the row corresponding to date in the sequential ANOVA table for the Full Model? They should agree if you added date LAST in the model formula. i.e. it is looking at the Extra Sum of Squares for adding date, after taking into account duration. Note: you can always add up the sequential Sum of Squares provided that they are in the right order to get the Extra Sum of Squares, i.e. for the F-test in class we could have added the Sum of Squares from I and I:log.duration to get the Extra Sum of Squares as long as the model was written in the order log.duration + I + I:log.duration

10. S-Plus can automatically get the F-stat and p-value for the Extra Sum of Squares test by using the Model Comparison option. Go to the Statistics Menu and select Compare Models. Select the two models, Full.lm and Reduced.lm (use a ctrl-click if they are not in order) Click on OK. The F-stat and p-value should agree with what you have calculated by hand. The variable listed under Test specifies which variables are being dropped from the model under Ho; the df represent the change in df's between the two models (the number of parameters being set to 0 under Ho - note - - this is sometimes negative because of the order - use the absolute value to the numerator df for the F-test. The absolute value of the Sum of Squares represents the Extra Sum of Squares due to the added variables. The F-stat is the {Extra SS/number of parameters that equal 0 under Ho}/MSE from the Full model.

11. Do the results make sense visually? Make a scatter plot with the simple linear regression model (y = interval, x = duration) using the graph menu. Now, add the regression lines for each date: Go to the Insert menu and select Plot and choose Line Plot. For the x-axis select duration, but for the y-axis select the column for your fitted values, fit. Under the Subset Rows with option, enter date==1 (this will only plot the cases where date is equal to 1). In the Sort/Smooth tab, select Sort X,Y by X to have the data sorted so that the lines are connected in the right order. Click OK; you should see the fitted regression line for that date added to the plot. Repeat for the other 7 dates, i.e. use date==2, date==3,... date==8. From the Insert Menu, add a Legend, Title, or any other information (such as the equation for the simple linear regression model).
12. Provide an interpretation and confidence interval for each coefficient in the Full model and the Reduced model. Are the confidence intervals consistent with the results of the Extra Sum of Squares test? Explain.

### Exercise 16

1. Download the Galileo data, Case1001.asc for Case Study 10.1. Read the background material on the problem before lab.

2. Fit the quadratic regression model to answer parts a, c-e. For c-e we need predictions when height = 500 punti. A computational trick is to subtract off the value 500 from height. This way, the estimated mean at 500 punti is the intercept, and the standard error of the intercept is the standard error of the mean. To do this calculation by hand is otherwise pretty painful :-) For part a you may turn in results based on fitting the model using (height - 500).

From the statistics menu, bring up the regression dialog. For the model formula enter:

distance ~ I(height-500) + I((height-500)^2)

The I() function serves to "protect" the meaning of the expressions inside the parentheses. Some symbols have different meanings in a model formula than they do normally, i.e. X1*X2 means fit the model with X1 + X2 + X1:X2 or the main effects + interaction. In the model formula, "height - 1" would mean to fit the model with height and no intercept; using I(height - 1) would fit the model with an intercept and subtract one from each height value. So if we want to do the transformations on the fly, rather than creating a transformed variable in the dataframe use I() to be careful. Note, this I() is a function which is not the same as the indicator variable we created for the class example. Under the Predict Tab, check off the values for Predictions, Confidence Intervals and Standard Errors (standard errors for the means at the observed height values). Specify Case1001 for the New data and the dataframe for saving results. Under the Results tab, check off the box for correlation Matrix of Estimates. Click OK.

3. You should be able to identify the estimates, their standard errors, and the estimate of sigma^2. The table or matrix of variances and covariances has to be assembled with a little work. The estimated variances of the coefficient estimates are the squares of their standard errors. The estimated covariance between two coefficient estimates is the correlation between them times the product of their standard errors, i.e., correlation(beta1-hat, beta2-hat)*SE(beta1-hat)*SE(beta2-hat). The other parts for c-e can be obtained from the output (review pages 264-268).

4. Write out the least squares regression equation for the model distance ~ I(height-500) + I((height-500)^2) and rewrite it so that the mean is a function of height and height^2 (e.g., not (height - 500) and verify that the fitted equation is the same as what is reported on page 265. i.e. the coefficient multiplying height and height^2 should be the same as on page 265, so that even though we are using a different "parameterization" of the model, we get the same fitted values.

5. Use the Transform menu to calculate se.pred, the standard error of predicted values using the se.fit column. Also use the transform option to create the upper and lower prediction intervals. The formula on page 185 still applies, although SE.fit has changed.

6. Create a scatter plot of the data. Use the Insert Plot feature as described earlier to add
• fitted mean (from the quadratic regression)
• 95% confidence intervals for the mean (the values created by S-Plus)
• 95% prediction intervals for the mean (what you just calculated above)

7. For the energetic student (do not turn in) For height = 100 punti, verify by hand using the estimated variance covariance matrix that the standard error of the mean is the same as that calculated and added to the dataframe using S-Plus. (see section 10.4.3 for the expression)

### Exercise 17

This exercise looks at the effect of adding variables on R-squared and adjusted R-squared.
1. Fit the simple linear regression on height, and save the fitted values. Rename the column say fit1. Double-Click on the space for the column name to enter a name.

2. You do not need to refit the quadratic model if you have the R-squared and mean squared residuals. Rename the fitted values from Exercise 16 to fit2.

3. To fit the cubic regression, use the model formula

distance ~ height + I(height^2) + I(height^3)

Save the fitted values and rename then as fit3.

4. For the next model add I(height^4) to the formula above. Save the fitted values as fit4. Repeat until you have the output for all models.

5. The multiple R-squared in the Report Window is the what the text calls R-squared, regression sum of squares/ total Sum of squares. Using the expression on 276 calculate the adjusted R-squared for each model. Note: total mean square is just the sample variance that can be obtained from the Statistics, Data Summaries, Summary Statistics menu or can be obtained from the sequential ANOVA table by adding up all of the Sum of Squares (for each variable plus the residual). The total df = n-1. Create a table of R-squared and Adjusted R-squared.

6. Create a scatter plot of the data. Add the fitted values for each model using Insert Plot > LinePlot. Does increasing the number of terms improve the fit? Can you tell the difference between the fits after a certain point? How is this reflected in the R-squared values? the adjusted R-squared values?