Assignment:

Using the data for this case, provided in the spreadsheet file entitled CROQPAIN.xls, do the following:

(a) Examine the operating earnings regression model output obtained from the 60 stores, as shown in Table 6.27. Try to improve the model by eliminating certain independent variables or by making any other changes that you think make good sense. You should strive for a model that is simple (i.e., has few independent variables), and that does not violate any of the basic assumptions of multiple regression (e.g., multicollinearity, heteroscedasticity), but nevertheless has good predictive power.

Solution. We will try to improve the model by taking various steps.

Please find below the step-wise solution for improving the model-

1. Exploratory Data Analysis to check the Outliers: -

We performed the Exploratory Data Analysis, where we see the scatter plots for all the variables to see if they are containing any outliers or not.

For example, we can see the scatter plot of K below-

[pic 1]

We can see that the above 3-4 points are in the different range from most of the values. So, these can be considered the outliers.

2. Outlier Analysis through inter-quartile range-

We calculated the first and third quartile for each variable, and then calculated the inter-quartile range. From that interquartile range, we calculated the lower and upper bound for each variable, and any outlier will lie outside that range.

[pic 2]

3. Winsorization Method to perform outlier treatment-

Instead of deleting the outliers for different variables, we replaced them by the nearest upper bound or lower bound variable so that it lies within the allowed range of lower and upper bound.

4. Checking the Correlation factor between all the variables-

We checked the Pearson’s correlation value between all the variables to see how much each variable is correlated with each other. The significance value is taken as 0.05 and it is for 2-tailed distribution.

[pic 3]

We can see from here that the variable EARN is well correlated with the variable INC, and the significance value is 0.00, which lies in the acceptable region. Similarly, we can see the correlation values for rest of the combinations. It is just for a basic idea of relation of these variables.

5. Checking of the multicollinearity and performing the linear regression-

“EARN” has been considered the dependent variable, and rest of the variables excluding “STORE” have been considered as the independent variables. Now, Linear Regression has been performed on the remaining independent variables to check multicollinearity between them. VIF (Variance Inflation Factor) is obtained from this operation. It measures the impact of collinearity among the variables and their dependency on another variable in a regression model. Independent variables with VIF Values greater than 3 were removed (3 has been considered as the benchmark to ensure minimum risk).

Here is the screenshot of the output[pic 4]

[pic 5]

Here, we can see that the significance F value of ANOVA is 0.000, which shows that the model is statistically significant.

Now, we will look into the coefficient table. Variables with coefficient equal to zero, significance value greater than 0.05, and VIF values greater than 3 will be removed to remove multicollinearity and insignificant variables. So, the remaining independent variables are – SIZE, INC, NREST, and PRICE.

Now, we will again perform the linear regression with EARN as dependent variables, and SIZE, INC, NREST, and PRICE as independent variables.

[pic 6]

We can see that the overall significance value of ANOVA is 0.00, but the PRICE variable is having significance value as 0.066, which is greater than 0.05. So, we will remove this variable, and further build the model.

The new output will be –

[pic 7]

Here, every parameter lies in the allowable range. So, we can conclude this output. And the linear regression model will be as –

EARN = -352.218 + 0.748*SIZE + 11.506*INC + 1.642*NREST

(b) Michel thinks that a good way to validate the model obtained with data from the 60 stores is to see how a similar model, obtained from the 50 stores opened before 1994, would have performed in predicting the performance of the last ten stores opened. Step back one year prior to the opening of the last ten restaurants. Amend the model you have developed using only the data from the first fifty stores. Using Croq’Pain’s performance ratio target of 26%, which of the ten stores would you have opened in 1994?