Buffl

Fragen Buch

AL
by Anna L.

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.


(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.


(a) regression. inference. quantitative output of CEO salary based on CEO firm's features. n - 500 firms in the US p - profit, number of employees, industry

(b) classification. prediction. predicting new product's success or failure. n - 20 similar products previously launched p - price charged, marketing budget, comp. price, ten other variables

(c) regression. prediction. quantitative output of % change n - 52 weeks of 2012 weekly data p - % change in US market, % change in British market, % change in German market

You will now think of some real-life applications for statistical learning.


(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your

answer.

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(c) Describe three real-life applications in which cluster analysis might be useful.

(a)

i. stock market price direction, prediction, response: up, down, input: yesterday's price movement % change, two previous day price movement % change, etc.

ii. illness classification, inference, response: ill, healthy, input: resting heart rate, resting breath rate, mile run time

iii. car part replacement, prediction, response: needs to be replace, good, input: age of part, mileage used for, current amperage


(b)

i. CEO salary. inference. predictors: age, industry experience, industry, years of education. response: salary.

ii. car part replacement. inference. response: life of car part. predictors: age of part, mileage used for, current amperage.

iii. illness classification, prediction, response: age of death, input: current age, gender, resting heart rate, resting breath rate, mile run time.

(c)

i. cancer type clustering. diagnose cancer types more accurately.

ii. Netflix movie recommendations. recommend movies based on users who have watched and rated similar movies.

iii. marketing survey. clustering of demographics for a product(s) to see which clusters of consumers buy which products.

Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get


(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.

ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.

iii. For a fixed value of IQ and GPA, high school graduates earn

more, on average, than college graduates provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

(b) Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.


Y = 50 + 20(gpa) + 0.07(iq) + 35(gender) + 0.01(gpa * iq) - 10 (gpa * gender)

(a) Y = 50 + 20 k_1 + 0.07 k_2 + 35 gender + 0.01(k_1 * k_2) - 10 (k_1 * gender) male: (gender = 0) 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2) female: (gender = 1) 50 + 20 k_1 + 0.07 k_2 + 35 + 0.01(k_1 * k_2) - 10 (k_1) Once the GPA is high enough, males earn more on average. => iii.


(b) Y(Gender = 1, IQ = 110, GPA = 4.0) = 50 + 20 * 4 + 0.07 * 110 + 35 + 0.01 (4 * 110) - 10 * 4 = 137.1 (c) False. We must examine the p-value of the regression coefficient to determine if the interaction term is statistically significant or not.





We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

a.

If the Bayes decision boundary is linear, we expect QDA to perform better on the training set because it's higher flexiblity will yield a closer fit. On the test set, we expect LDA to perform better than QDA because QDA could overfit the linearity of the Bayes decision boundary.

b.

If the Bayes decision bounary is non-linear, we expect QDA to perform better both on the training and test sets.

c.

We expect the test prediction accuracy of QDA relative to LDA to improve, in general, as the the sample size n

increases because a more flexibile method will yield a better fit as more samples can be fit and variance is offset by the larger sample sizes.

d.

False. With fewer sample points, the variance from using a more flexible method, such as QDA, would lead to overfit, yielding a higher test rate than LDA.

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . ,p predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

(b) Which of the three models with k predictors has the smallest test RSS?

(c) True or False:

i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.

iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.

iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

a

Best subset selection has the smallest training RSS because the other two methods determine models with a path dependency on which predictors they pick first as they iterate to the k'th model.

b (*)

Best subset selection may have the smallest test RSS because it considers more models then the other methods. However, the other models might have better luck picking a model that fits the test data better.

c

i. True. ii. True. iii. False. iv. False. v. False.



a

(iii) Steadily increase: As we increase λ

from 0, all β 's decrease from their least square estimate values to 0. Training error for full-blown-OLS β s is the minimum and it steadily increases as β s are reduced to 0

.

b

(ii) Decrease initially, and then eventually start increasing in a U shape: When λ=0

, all β s have their least square estimate values. In this case, the model tries to fit hard to training data and hence test RSS is high. As we increase λ, beta s start reducing to zero and some of the overfitting is reduced. Thus, test RSS initially decreases. Eventually, as beta s approach 0

, the model becomes too simple and test RSS increases.

c

(iv) Steadily decreases: When λ=0

, the β s have their least square estimate values. The actual estimates heavily depend on the training data and hence variance is high. As we increase λ, β s start decreasing and model becomes simpler. In the limiting case of λ approaching infinity, all beta

s reduce to zero and model predicts a constant and has no variance.

d

(iii) Steadily increases: When λ=0

, β s have their least-square estimate values and hence have the least bias. As λ increases, β s start reducing towards zero, the model fits less accurately to training data and hence bias increases. In the limiting case of λ

approaching infinity, the model predicts a constant and hence bias is maximum.

e

(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of λ

, remains constant.

Author

Anna L.

Information

Last changed