For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

(b) The number of predictors p is extremely large, and the number of observations n is small.

(c) The relationship between the predictors and response is highly non-linear.

(d) The variance of the error terms, i.e. σ2 = Var(ϵ), is extremely high.

(a) better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained

(b) worse - a flexible method would overfit the small number of observations

(c) better - with more degrees of freedom, a flexible model would obtain a better fit

(d) worse - flexible methods fit to the noise in the error terms and increase variance

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

(a) regression. inference. quantitative output of CEO salary based on CEO firm's features. n - 500 firms in the US p - profit, number of employees, industry

(b) classification. prediction. predicting new product's success or failure. n - 20 similar products previously launched p - price charged, marketing budget, comp. price, ten other variables

(c) regression. prediction. quantitative output of % change n - 52 weeks of 2012 weekly data p - % change in US market, % change in British market, % change in German market

We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

(b) Explain why each of the five curves has the shape displayed in part (a).

(a) See 3a.jpg.

(b) all 5 lines >= 0

i. (squared) bias - decreases monotonically because increases in flexibility yield a closer fit

ii. variance - increases monotonically because increases in flexibility yield overfit

iii. training error - decreases monotonically because increases in flexibility yield a closer fit

iv. test error - concave up curve because increase in flexibility yields a closer fit before it overfits

v. Bayes (irreducible) error - defines the lower limit, the test error is bounded below by the irreducible error due to variance in the error (epsilon) in the output values (0 <= value). When the training error is lower than the irreducible error, overfitting has taken place. The Bayes error rate is defined for classification problems and is determined by the ratio of data points which lie at the 'wrong' side of the decision boundary, (0 <= value < 1).

You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your

answer.

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(c) Describe three real-life applications in which cluster analysis might be useful.

(a)

i. stock market price direction, prediction, response: up, down, input: yesterday's price movement % change, two previous day price movement % change, etc.

ii. illness classification, inference, response: ill, healthy, input: resting heart rate, resting breath rate, mile run time

iii. car part replacement, prediction, response: needs to be replace, good, input: age of part, mileage used for, current amperage

(b)

i. CEO salary. inference. predictors: age, industry experience, industry, years of education. response: salary.

ii. car part replacement. inference. response: life of car part. predictors: age of part, mileage used for, current amperage.

iii. illness classification, prediction, response: age of death, input: current age, gender, resting heart rate, resting breath rate, mile run time.

(c)

i. cancer type clustering. diagnose cancer types more accurately.

ii. Netflix movie recommendations. recommend movies based on users who have watched and rated similar movies.

iii. marketing survey. clustering of demographics for a product(s) to see which clusters of consumers buy which products.

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias. The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance. A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results. A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

(b) What is our prediction with K = 1? Why?

(c) What is our prediction with K = 3? Why?

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

(b) Green. Observation #5 is the closest neighbor for K = 1.

(c) Red. Observations #2, 5, 6 are the closest neighbors for K = 3. 2 is Red, 5 is Green, and 6 is Red. (d) Small. A small K would be flexible for a non-linear decision boundary, whereas a large K would try to fit a more linear boundary because it takes more points into consideration.

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f. A non-parametric approach does not assume a functional form for f and so requires a very large number of observations to accurately estimate f. The advantages of a parametric approach to regression or classification are the simplifying of modeling f to a few parameters and not as many observations are required compared to a non-parametric approach. The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.

Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

In Table 3.4, the null hypothesis for "TV" is that in the presence of radio ads and newspaper ads, TV ads have no effect on sales. Similarly, the null hypothesis for "radio" is that in the presence of TV and newspaper ads, radio ads have no effect on sales. (And there is a similar null hypothesis for "newspaper".) The low p-values of TV and radio suggest that the null hypotheses are false for TV and radio. The high p-value of newspaper suggests that the null hypothesis is true for newspaper.

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier and KNN regression methods are closely related in formula. However, the final result of KNN classifier is the classification output for Y (qualitative), where as the output for a KNN regression predicts the quantitative value for f(X)

Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get

(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.

ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.

iii. For a fixed value of IQ and GPA, high school graduates earn

more, on average, than college graduates provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

(b) Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

Y = 50 + 20(gpa) + 0.07(iq) + 35(gender) + 0.01(gpa * iq) - 10 (gpa * gender)

(a) Y = 50 + 20 k_1 + 0.07 k_2 + 35 gender + 0.01(k_1 * k_2) - 10 (k_1 * gender) male: (gender = 0) 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2) female: (gender = 1) 50 + 20 k_1 + 0.07 k_2 + 35 + 0.01(k_1 * k_2) - 10 (k_1) Once the GPA is high enough, males earn more on average. => iii.

(b) Y(Gender = 1, IQ = 110, GPA = 4.0) = 50 + 20 * 4 + 0.07 * 110 + 35 + 0.01 (4 * 110) - 10 * 4 = 137.1 (c) False. We must examine the p-value of the regression coefficient to determine if the interaction term is statistically significant or not.

(a) I would expect the polynomial regression to have a lower training RSS than the linear regression because it could make a tighter fit against data that matched with a wider irreducible error (Var(epsilon)).

(b) Converse to (a), I would expect the polynomial regression to have a higher test RSS as the overfit from training would have more error than the linear regression.

(c) Polynomial regression has lower train RSS than the linear fit because of higher flexibility: no matter what the underlying true relationshop is the more flexible model will closer follow points and reduce train RSS. An example of this beahvior is shown on Figure~2.9 from Chapter 2.

(d) There is not enough information to tell which test RSS would be lower for either regression given the problem statement is defined as not knowing "how far it is from linear". If it is closer to linear than cubic, the linear regression test RSS could be lower than the cubic regression test RSS. Or, if it is closer to cubic than linear, the cubic regression test RSS could be lower than the linear regression test RSS. It is dues to bias-variance tradeoff: it is not clear what level of flexibility will fit data better.

Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (¯x, ¯y).

y = B_0 + B_1 x from (3.4): B_0 = avg(y) - B_1 avg(x) right hand side will equal 0 if (avg(x), avg(y)) is a point on the line 0 = B_0 + B_1 avg(x) - avg(y) 0 = (avg(y) - B_1 avg(x)) + B_1 avg(x) - avg(y) 0 = 0

We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

If the Bayes decision boundary is linear, we expect QDA to perform better on the training set because it's higher flexiblity will yield a closer fit. On the test set, we expect LDA to perform better than QDA because QDA could overfit the linearity of the Bayes decision boundary.

If the Bayes decision bounary is non-linear, we expect QDA to perform better both on the training and test sets.

We expect the test prediction accuracy of QDA relative to LDA to improve, in general, as the the sample size n

increases because a more flexibile method will yield a better fit as more samples can be fit and variance is offset by the larger sample sizes.

False. With fewer sample points, the variance from using a more flexible method, such as QDA, would lead to overfit, yielding a higher test rate than LDA.

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18 %. Based on these results, which method should we prefer to use for classification of new observations?

Why?

Logistic regression: 20% training error rate, 30% test error rate KNN(K=1): average error rate of 18%

For KNN with K=1, the training error rate is 0% because for any training observation, its nearest neighbor will be the response itself. So, KNN has a test error rate of 36%. I would choose logistic regression because of its lower test error rate of 30%.

1−1/n

In bootstrap, we sample with replacement so each observation in the bootstrap sample has the same 1/n (independent) chance of equaling the jth observation. Applying the product rule for a total of n observations gives us (1−1/n)n

.

Pr(in)=1−Pr(out)=1−(1−1/5)5=1−(4/5)5=67.2%

Pr(in)=1−Pr(out)=1−(1−1/100)10=1−(99/100)100=63.4%

1−(1−1/10000)10000=63.2%

We now review k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

i. The validation set approach?

ii. LOOCV?

k-fold cross-validation is implemented by taking the set of n observations and randomly splitting into k non-overlapping groups. Each of these groups acts as a validation set and the remainder as a training set. The test error is estimated by averaging the k resulting MSE estimates.

i. The validation set approach is conceptually simple and easily implemented as you are simply partitioning the existing training data into two sets. However, there are two drawbacks: (1.) the estimate of the test error rate can be highly variable depending on which observations are included in the training and validation sets. (2.) the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

ii. LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the most computationally intense method since the model must be fit n times. Also, LOOCV has higher variance, but lower bias, than k-fold CV.

Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our prediction.

If we suppose using some statistical learning method to make a prediction for the response Y for a particular value of the predictor X we might estimate the standard deviation of our prediction by using the bootstrap approach. The bootstrap approach works by repeatedly sampling observations (with replacement) from the original data set B times, for some large value of B, each time fitting a new model and subsequently obtaining the RMSE of the estimates for all B models.

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . ,p predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

(b) Which of the three models with k predictors has the smallest test RSS?

(c) True or False:

i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.

iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.

iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

Best subset selection has the smallest training RSS because the other two methods determine models with a path dependency on which predictors they pick first as they iterate to the k'th model.

Best subset selection may have the smallest test RSS because it considers more models then the other methods. However, the other models might have better luck picking a model that fits the test data better.

i. True. ii. True. iii. False. iv. False. v. False.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(b) Repeat (a) for ridge regression relative to least squares.

(c) Repeat (a) for non-linear methods relative to least squares.

iii. Less flexible and better predictions because of less variance, more bias

Same as lasso. iii.

ii. More flexible, less bias, more variance

(iv) Steadily decreases: As we increase s

from 0, all β 's increase from 0 to their least square estimate values. Training error for 0β

s is the maximum and it steadily decreases to the Ordinary Least Square RSS

(ii) Decrease initially, and then eventually start increasing in a U shape: When s=0

, all β s are 0, the model is extremely simple and has a high test RSS. As we increase s, beta s assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually, as beta

s approach their full blown OLS values, they start overfitting to the training data, increasing test RSS.

(iii) Steadily increase: When s=0

, the model effectively predicts a constant and has almost no variance. As we increase s, the models includes more β s and their values start increasing. At this point, the values of β

s become highly dependent on training data, thus increasing the variance.

(iv) Steadily decrease: When s=0

, the model effectively predicts a constant and hence the prediction is far from actual value. Thus bias is high. As s increases, more β

s become non-zero and thus the model continues to fit training data better. And thus, bias decreases.

e

(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of s

, remains constant.

(iii) Steadily increase: As we increase λ

from 0, all β 's decrease from their least square estimate values to 0. Training error for full-blown-OLS β s is the minimum and it steadily increases as β s are reduced to 0

(ii) Decrease initially, and then eventually start increasing in a U shape: When λ=0

, all β s have their least square estimate values. In this case, the model tries to fit hard to training data and hence test RSS is high. As we increase λ, beta s start reducing to zero and some of the overfitting is reduced. Thus, test RSS initially decreases. Eventually, as beta s approach 0

, the model becomes too simple and test RSS increases.

(iv) Steadily decreases: When λ=0

, the β s have their least square estimate values. The actual estimates heavily depend on the training data and hence variance is high. As we increase λ, β s start decreasing and model becomes simpler. In the limiting case of λ approaching infinity, all beta

s reduce to zero and model predicts a constant and has no variance.

(iii) Steadily increases: When λ=0

, β s have their least-square estimate values and hence have the least bias. As λ increases, β s start reducing towards zero, the model fits less accurately to training data and hence bias increases. In the limiting case of λ

approaching infinity, the model predicts a constant and hence bias is maximum.

(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of λ

Last changeda year ago