Which of the following modeling techniques performs Feature Selection?
Linear Discriminant Analysis
Least Squares
Linear Regression with Forward Selection
Support Vector Machines
We perform best subset and forward stepwise selection on a single dataset. For both approaches, we obtain p+1 models, containing 0,1,2,…,p predictors. Which of the two models with k predictors is guaranteed to have training RSS no larger than the other model?
Best Subset
Forward Stepwise
They always have the same training RSS
Not enough information is given to know
We perform best subset and forward stepwise selection on a single dataset. For both approaches, we obtain p+1 models, containing 0,1,2,…,p predictors. Which of the two models with k predictors has the smallest test RSS?
They always have the same test RSS
We know that Best Subset selection will always have the lowest training RSS (that is how it is defined). That said, we don't know which model will perform better on a test set.
You are fitting a linear model to data assumed to have Gaussian errors. The model has up to p=5 predictors and n=100 observations. Which of the following is most likely true of the relationship between Cp and AIC in terms of using the statistic to select a number of predictors to include?
Cp will select a model with more predictors AIC
Cp will select a model with fewer predictors AIC
Cp will select the same model as AIC
Not enough information is given to decide
You are doing a simulation in order to compare the effect of using Cross-Validation or a Validation set. For each iteration of the simulation, you generate new data and then use both Cross-Validation and a Validation set in order to determine the optimal number of predictors. Which of the following is most likely?
The Cross-Validation method will result in a higher variance of optimal number of predictors
The Validation set method will result in a higher variance of optimal number of predictors
Both methods will produce results with the same variance of optimal number of predictors
You perform ridge regression on a problem where your third predictor, x3, is measured in dollars. You decide to refit the model after changing x3 to be measured in cents. Which of the following is true?
β^3 and y^ will remain the same.
β^3 will change but y^ will remain the same.
β^3 will remain the same but y^ will change.
β^3 and y^ will both change
β^3 and y^ will both change.
Which of the following is NOT a benefit of the sparsity imposed by the Lasso?
Sparse models are generally more easy to interperet
The Lasso does variable selection by default
Using the Lasso penalty helps to decrease the bias of the fits
Using the Lasso penalty helps to decrease the variance of the fits
Which of the following would be the worst metric to use to select λ in the Lasso?
Cross-Validated error
Validation set error
RSS
We compute the principal components of our p predictor variables. The RSS in a simple linear regression of Y onto the largest principal component will always be no larger than the RSS in a simple regression of Y onto the second largest principal component. True or False?
True
False
You are working on a regression problem with many variables, so you decide to do Principal Components Analysis first and then fit the regression to the first 2 principal components. Which of the following would you expect to happen?
A subset of the features will be selected
Model Bias will decrease relative to the full least squares model
Variance of fitted values will decrease relative to the full least squares model
Model interpretability will improve relative to the full least squares model
While some forms of dimensional reduction will cause the first or fourth to occur, that is not the case with PCA. When using dimensional reduction we restrict ourselves to simpler models. Thus, we expect bias to increase and variance to decrease.
You are analyzing a dataset where each observation is an age, height, length, and width of a particular turtle. You want to know if the data can be well described by fewer than four dimensions (maybe for plotting), so you decide to do Principal Component Analysis. Which of the following is most likely to be the loadings of the first Principal Component?
(1, 1, 1, 1)
(.5, .5, .5, .5)
(.71, -.71, 0, 0)
(1, -1, -1, -1)
Suppose we a data set where each data point represents a single student's scores on a math test, a physics test, a reading comprehension test, and a vocabulary test.
We find the first two principal components, which capture 90% of the variability in the data, and interpret their loadings. We conclude that the first principal component represents overall academic ability, and the second represents a contrast between quantitative ability and verbal ability.
What loadings would be consistent with that interpretation? Choose all that apply.
(0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
(0.5, 0.5, 0.5, 0.5) and (0, 0, -0.71, -0.71)
(0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5)
(0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)
(0.71, 0.71, 0, 0) and (0, 0, 0.71, 0.71)
(0.71, 0, -0.71, 0) and (0, 0.71, 0, -0.71)
True or False: If we use k-means clustering, will we get the same cluster assignments for each point, whether or not we standardize the variables.
True or False: If we cut the dendrogram at a lower point, we will tend to get more clusters (and cannot get fewer clusters).
More flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than the decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.
Explanation: Ridge regression and lasso’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases leading to decreased variance but increased bias. The relationship between λ and variance and bias in this regression method is the key holder to understanding the relationship. When there is small change in the training data, the least squares coefficient produces a large change and larger value for variance. Whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, between these two methods, ridge regression works best in situations where the least squares estimates have high variance. The big difference between ridge and lasso is that lasso performs variance selection and makes it easier to interpret.
ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.
Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
Regression - CEO salary is continuous.
Inference - we are looking to understand the relationship between the predictors on CEO salary.
n = 500, p = 3 (profit, number of employees, industry).
Classification - products are either success or failure.
Prediction - primarily concerned with whether product will succeed or fail.
n = 20, p = 13 (price charged for product, marketing budget, competition price, +10 other variables).
Regression - percentage change in USD/Euro exchange rate over time is continuous.
Prediction - we are seeking to predict % change in USD/Euro exchange rate.
n = 52, p = 3 (% change US, % change British market, % change German market).
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
There is a tradeoff between prediction accuracy and model interpretability. A very flexible approach tends to have greater prediction accuracy, less bias and assumptions regarding the function, and it does well with non-linear relationships. The disadvantages of a very flexible approach are less interpretability, requires more variables/parameters, and has increased variance and possibility of overfitting the data.
A flexible approach may be preferred over a less flexible approach when prediction accuracy is the primary concern, when there are a large number of variables, and when the relationship appears to be non-linear in nature.
A less flexible approach may be preferred when we are seeking to prioritize interpretability or inference about our predictor variables’ impact on the response, or when the relationships of the data seem linear. In certain cases, a less flexible approach can have greater predictive power due to the potential of overfitting in more flexible methods.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
A parametric statistical learning approach assumes the functional form of f and then fits/trains the model (e.g., regression), whereas a non-parametric approach does not make explicit assumptions about the functional form of f (e.g., thin plate splines) which gives the potential to fit a wider range of model shapes.
The advantages of a parametric approach to regression or classification are that it can estimate a function with less variables/parameters more easily and are generally more interpretable.
The disadvantages of a parametric approach to regression or classification are that because assumptions are made about the functional form of f, it tends to not match the unknown form of f and often has less predictive power than non-parametric approaches.
Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN (K-Nearest Neighbors) is a simple, non-parametric method that can be used for both classification and regression. However, there are differences in the way the method is applied and the results it produces in each case.
KNN Classifier: The KNN classifier is a supervised learning algorithm used for classification. In KNN classification, the goal is to predict the class label of a new data point based on the class labels of its nearest neighbors in the training data. The algorithm works by calculating the distances between the new data point and all the points in the training set, then selecting the K nearest neighbors (based on a distance metric such as Euclidean distance), and finally assigning the new point to the class label that is most common among its K nearest neighbors.
KNN Regression: The KNN regression is a supervised learning algorithm used for regression. In KNN regression, the goal is to predict a continuous target value for a new data point based on the values of its nearest neighbors in the training data. The algorithm works similarly to the classification method, but instead of choosing the most common class label, the average target value of the K nearest neighbors is used to make the prediction.
So, in summary, the main difference between KNN classifier and KNN regression is the type of prediction being made. In classification, the goal is to predict a class label based on a majority vote of the K nearest neighbors, whereas in regression the goal is to predict a continuous target value based on the average of the target values of the K nearest neighbors.
This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
With 3.38, since the numerators are equal when we set the two regression equations equal to each other, we just have to make the denominators equal, so the sum of the x2 have to equal the sum of the y2.
sum{y_i^2} = sum{x_i^2}
PCA is technique for...
feature extraction
variance normalisation
dimensionality reduction
data augmentation
When performing PCA we want to:
find the most meaningful basis
estimate the number of dimensions
find orthogonal vectors
find the components of the dataset
Every observation (i.e. a vector with dimensionality m) in the dataset can be represented as:
linear combination of some unit vectors
a set of orthonormal vectors
linear combination of some basis vectors
unit vectors
If p_1 and p_2 are both principal components vectors, what statements are correct about them?
variance along p_1 is bigger than variance along p_2
variance along p_2 is bigger than variance along p_1
p_1 is parallel to p_2
p_1 is orthogonal to p_2
One of the key ideas for solving PCA with eigenvalue decomposition is that a symmetric matrix can be diagonalized by an orthogonal matrix of its eigenvectors.
PCA with SVD is based on the idea that any matrix can be decomposed into a product of orthogonal matrix, identity matrix and another orthogonal matrix.
true
false
Which of the following are supervised learning problems? More than one box can be checked.
⃝ Predict whether a website user will click on an ad.
⃝ Find clusters of genes that interact with each other.
⃝ Find stocks that are likely to rise.
⃝ Classify a handwritten digit as 0-9 from labeled examples.
Predict whether a website user will click on an ad.
⃝
You’ve just finished training a decision tree for spam classification, and it is getting abnormally bad
performance on your test set, but good performance on your training set. Your implementation has no
bugs. What could be causing the problem?
⃝ You have too few trees in your ensemble.)
⃝ Your bagging implementation is randomly sampling sample points without replacement.
⃝ Your decision trees are too deep.
⃝ You are randomly sampling too many features when you choose a split.
You have too few trees in your ensemble.)
LASSO regression relative to Ordinary Least Squares Regression will give
⃝ improved prediction accuracy when its increase in bias is less than its decrease of the irreducible
error.
⃝ improved prediction accuracy when its increase in variance is less than its increase in bias.
⃝ improved prediction accuracy when its increase in variance is less than its decrease in bias.
⃝ improved prediction accuracy when its increase in bias is less than its decrease in
variance.
⃝ Using the Lasso penalty helps to decrease the bias of the fits.
⃝ Sparse models are generally more easy to interpret.
⃝ Using the Lasso penalty helps to decrease the variance of the fits.
⃝ The Lasso does variable selection by default.
Using the Lasso penalty helps to decrease the bias of the fits.
In neural networks, the ReLu activation function
⃝ saturates for high and low values of the input.
⃝ always outputs values between 0 and 1.
⃝ is only applied to units in the output layer.
⃝ always has a non-negative gradient.
always has a non-negative gradient.
Bootstrap aggregation (bagging)
⃝ selects random subsamples of the sample observations with replacement.
⃝ is ineffective with classification.
⃝ reduces the bias relative to the base learner.
⃝ reduces the variance relative to the base learner.
reduces the variance relative to the base learner.
7. Which of the following tools would be well suited for predicting if a student will get an A in a class based
on the student’s height, and parents’ income? Select all that apply:
⃝ Linear Discriminant Analysis
⃝ Logistic Regression
⃝ Linear Regression
⃝ Random Guess
8. A fitted model with more predictors will necessarily have a lower Training Set Error than a model with
fewer predictors.
⃝ True
⃝ False
If we use ten-fold cross-validation as a means of model selection, the cross-validation estimate of test
error is:
⃝ biased downward.
⃝ potentially any of the answers.
⃝ biased upward.
⃝ unbiased.
potentially any of the answers.
Which of the following is the best example of a Qualitative Variable?
⃝ Color
⃝ Age
⃝ Speed
⃝ Height
Color
Choose the correct answer - suppose you have a dataset which can be trained with 100% accuracy of a Decision tree with depth 8.
Depth 5 will be high variance and low bias
Depth 2 will be low variance and low bias
Depth 6 will be low variance and high bias
Depth 10 will be high variance and high bias
What will happen when you add a new feature in linear regression model?
increase the r-square value
decrease the r-square value
no-effect
depends on the feature
What is one reason not to use the same data for both your training set and your testing set?
data will overfit the model.
data will choose wrong algorithm.
not have enough data for both.
data will underfit the model.
What is an example of a commercial application for a machine learning system?
data entry system
product recommendation system
massive data repository
data warehouse system
What type of Machine Learning Algorithm is suitable for predicting the dependent variable with two different values?
Logistic Regression
Linear Regression
Multiple Linear Regression
Polynomial Regression
The entropy of a given dataset is zero. This statement implies what?
further splitting is required
no further splitting is required
Need some other information to decide splitting
None of the Mentioned
Consider a Confusion Matrix of a classifier where True Positive is 61, False Positive is 8, True Negative is 38 and False Negative is 5 then what is the correct statement
Accuracy is 81%
Misclassification Rate is 19%
Type-I Error is 8
Type-II Error is 13
Suppose there is a basket and it is filled with some fresh fruits. The task is to arrange the same type of fruits at one place. but there is no information about those fruits beforehand, it’s the first time that the fruits are being seen or discovered. What kind of machine learning technique it is?
Reinforcement Learning
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Which of the following is a disadvantage of decision trees?
Factor analysis
Decision trees are robust to outliers
Decision trees are prone to be overfit
Decision Tree requires less effort for data preparation
Which statement about outliers is true?
outliers should be identified and removed from a dataset.
outliers should be part of the training dataset but should not be present in the test data.
outliers should be part of the test dataset but should not be present in the training data.
The nature of the problem determines how outliers are used.
If the given dataset contains 100 observations out of 50 belongs to class1 and other 50 belongs to class2. What will be the entropy of the given dataset?
0
1
-1
0.5
Simple regression is a ............. relationship between 2 or more variables
Linear
Non-Linear
Categorical
Systematical
The process of making the system able to learn is called?
Training
Testing
Labelling
Classifying
What is the purpose of performing cross-validation?
To assess the predictive performance of the models
To judge how the trained model performs outside the sample on test data
Both A and B
None of the above
A.
data mining.
B.
artificial intelligence
C.
big data computing
D.
internet of things
A. data mining.
descriptive model
predictive model
reinforcement learning
all of the above
B. predictive model
unsupervised learning
supervised learning
active learning
forward feature selection
backword feature selection
both a and b??
none of the above
. forward feature selection
A. true
A. scalable
B. accuracy
C. fast
D. all of the above
D
What characterize unlabeled examples in machine learning
A. there is no prior knowledge
B. there is no confusing knowledge
C. there is prior knowledge
D. there is plenty of confusing knowledge
What does dimensionality reduction reduce?
A. stochastics
B. collinerity
C. performance
D. entropy
B
Data used to build a data mining model.
A. training data
B. validation data
C. test data
D. hidden data
A
The problem of finding hidden structure in unlabeled data is called…
A. supervised learning
B. unsupervised learning
C. reinforcement learning
D. none of the above
Of the Following Examples, Which would you address using an supervised
learning Algorithm?
A. given email labeled as spam or not spam, learn a spam filter
B. given a set of news articles found on the web, group them into set of articles about the same
story.
C. given a database of customer data, automatically discover market segments and group
customers into different market segments.
D. find the patterns in market basket analysis
Dimensionality Reduction Algorithms are one of the possible ways to reduce
the computation time required to build a model
B. false
You are given reviews of few netflix series marked as positive, negative and
neutral. Classifying reviews of a new netflix series is an example of
C. semisupervised learning
D. reinforcement learning
Which of the following is a good test dataset characteristic?
A. large enough to yield meaningful results
B. is representative of the dataset as a whole
C. both a and b
C
Following are the types of supervised learning
A. classification
B. regression
C. subgroup discovery
Type of matrix decomposition model is
A. descriptive model
C. logical model
Following is powerful distance metrics used by Geometric model
A. euclidean distance
B. manhattan distance
C. both a and b??
D. square distance
The output of training process in machine learning is
A. machine learning model
B. machine learning algorithm
C. null
D. accuracy
A feature F1 can take certain value: A, B, C, D, E, & F and represents grade of
students from a college. Here feature type is
A. nominal
B. ordinal
C. categorical
D. boolean
PCA is
A. forward feature selection
B. backword feature selection
C. feature extraction
Dimensionality reduction algorithms are one of the possible ways to reduce the
computation time required to build a model.
Which of the following techniques would perform better for reducing
dimensions of a data set?
A. removing columns which have too many missing values
B. removing columns which have high variance in data
C. removing columns with dissimilar data trends
D. none of these
Supervised learning and unsupervised clustering both require which is correct
according to the statement.
A. output attribute.
B. hidden attribute.
C. input attribute.
D. categorical attribute
What characterize is hyperplance in geometrical model of machine learning?
A. a plane with 1 dimensional fewer than number of input attributes
B. a plane with 2 dimensional fewer than number of input attributes
C. a plane with 1 dimensional more than number of input attributes
D. a plane with 2 dimensional more than number of input attributes
a) Should not set it to zero since otherwise it will cause overfitting
b) Should not set it to zero since otherwise (stochastic) gradient descent will explore a very small space
c) Should set it to zero since otherwise it causes a bias
d) Should set it to zero in order to preserve symmetry across all neurons
b
a) The number of hidden nodes
b) The learning rate
c) The initial choice of weights
d) The use of a constant-term unit input
a
You've just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?
a) Your decision trees are too shallow.
b) You need to increase the learning rate.
c) You are overfitting.
d) None of the above.
___________ refers to a model that can neither model the training data nor generalize to new data.
a) good fitting
b) overfitting
c) underfitting
d) all of the above
c
Linear Regression with Forward Selection correct
[] Cp will select a model with more predictors AIC
[] Cp will select a model with fewer predictors AIC
[] Cp will select the same model as AIC
[] Not enough information is given to decide
[] The Cross-Validation method will result in a higher variance of optimal number of predictors
[] The Validation set method will result in a higher variance of optimal number of predictors
[] Both methods will produce results with the same variance of optimal number of predictors
The Validation set method will result in a higher variance of optimal number of predictors correct
You perform ridge regression on a problem where your third predictor, x3, is measured in dollars. You decide to refit the model after changing x3 to be measured in cents. Which of the following is true?:
[] β^3 and y^ will remain the same.
[] β^3 will change but y^ will remain the same.
[] β^3 will remain the same but y^ will change.
[] β^3 and y^ will both change.
β^3 and y^ will both change. correct
[] Sparse models are generally more easy to interperet
[] The Lasso does variable selection by default
[] Using the Lasso penalty helps to decrease the bias of the fits
[] Using the Lasso penalty helps to decrease the variance of the fits
Using the Lasso penalty helps to decrease the bias of the fits correct
[] Cross-Validated error
[] Validation set error
[] RSS
We compute the principal components of our p predictor variables. The RSS in a simple linear regression of Y onto the largest principal component will always be no larger than the RSS in a simple regression of Y onto the second largest principal component. True or False? (You may want to watch 6.10 as well before answering - sorry!)
[] True
[] False
You are working on a regression problem with many variables, so you decide to do Principal Components Analysis first and then fit the regression to the first 2 principal components. Which of the following would you expect to happen?:
[] A subset of the features will be selected
[] Model Bias will decrease relative to the full least squares model
[] Variance of fitted values will decrease relative to the full least squares model
[] Model interpretability will improve relative to the full least squares model
A fitted model with more predictors will necessarily have a lower Training Set Error than a model with fewer predictors.
When we fit a model to data, which is typically larger?
[] Test Error [] Training Error
Test Error
What are reasons why test error could be LESS
than training error?
[] By chance, the test
set has easier cases than the training set.
[] The model is highly complex, so training
error systematically overestimates test error
[] The model is not very complex, so training
By chance, the test
Suppose that we perform forward stepwise regression
and use cross-validation to choose the best model
size. Using the full data set to choose the
sequence of models is the WRONG way to do
cross-validation (we need to redo the model selection
step within each training fold). If we do
cross-validation the WRONG way, which of the
following is true?
[] The selected model will probably be too complex
[] The selected model will probably be too simple
The selected model will probably be too complex
correct
If we use ten-fold cross-validation as a
means of model selection, the cross-validation
estimate of test error is:
[] biased upward
[] biased downward
[] unbiased
[] potentially any of the above
potentially any of the above
Why can't we use the standard bootstrap for
some time series data?
[] The data points in most time series aren't i.i.d.
[] Some points will be used twice in the same sample [] The standard bootstrap doesn't accurately mimic the real-world data-generating mechanism
The data points in most time series aren't i.i.d.
The standard bootstrap doesn't accurately mimic the real-world data-generating mechanism
In the expression Sales ≈ f(TV, Radio, Newspaper), "Sales" is the:
Response
Training Data
Independent Variable
Feature
Identify the type of learning in which labeled training data is used.
Semi supervised
supervised
reinforcment
unsupervised
Identify whether true or false: In PCA the number of input dimensions is equal to principal components.
Which of the following machine learning algorithm is based upon the idea of bagging?
Last changed2 years ago