Übungen und R Knowledge

Buffl

Data Science

by Felix S.

Aapply

Ü4 Task 1

“Make a list with the variables a = 1, b = 2:5 und c = 5:50. Apply the sum()

function on each element of the list.”

# TASK 1

# Create the requested list with the variables a, b and c.

list_apply <- list(a = 1, b = 2:5, c = 5:50)

# Apply the sum function:

# There are two possible ways: Input is list -> sapply (Output: array) OR lapply (list).

# Way 1: Input: list -> Output: array

output_sapply <- sapply(list_apply, FUN = sum)

# Alternative Way 1: Since FUN is already in the second place according to the function

# definition in the documentation, FUN does not have to be specified explicitly again.

# REASON: If no parameters are explicitly specified in the function, R automatically assumes

# the order according to the definition in the documentation.

output_sapply <- sapply(list_apply, sum) # FUN is placed on

####

# Way 2: Input: list -> Output: list

output_lapply <- lapply(list_apply, FUN = sum)

# Alternative Way 2:

output_lapply <- lapply(list_apply, sum)

Was ist der Unterschied zwischen [,1] und [1,]

[1,]:
- Extrahiert die gesamte erste Zeile eines Data Frame oder einer Matrix.
- Beispiel: df[1,] gibt die erste Zeile von df zurück.
[,1]:
- Extrahiert die gesamte erste Spalte eines Data Frame oder einer Matrix.
- Beispiel: df[,1] gibt die erste Spalte von df zurück.

Split Aapply

Ü4 Task 2

We are interested in the differences of our male and female consumer. Please compute

the rank of the foot length for each consumer according to his or her gender. You also

have to plot the different subsets. (Hint: A scatterplot is appropriate for this purpose)

# TASK 2

# Import "footsize.csv" for this task:

footsize <- read.csv("footsize.csv")

### Split - Apply - Combine

# split: Split the dataset by the available genders (binary: f / m)

pieces <- split(footsize, footsize$gender)

# Just nice to know:

pieces["f"]

pieces["m"]

## apply:

# Declare a temporary list. This list will be filled with the results later on.

# Produce a vector of the given length (-> length(pieces)) and mode (-> "list")

# Length is 2; pieces.transformed is a list (as defined by mode: see "?vector" for more info)

pieces.transformed <- vector(mode="list", length=length(pieces))

for(i in 1:length(pieces)){ # Start of the function: Iterate over the two sets male and female

# Get the element i from the list "pieces" which is also

# a list containing information about a specific gender i.

# Assign the specific gender list to the variable "piece".

piece <- pieces[[i]]

# Plot a scatterplot by using the length as x-axis and footsize as y-axis:

plot(piece$length, piece$footsize) # plot each set

# Calculate the Rank and add (or "bind") it as a new column by using cbind().

# Translated into lists data-type, it means adding a new list of the same length

# as all the other lists in the list. Use the ties-method "first":

# For more information about ranks: https://en.wikipedia.org/wiki/Ranking

# For more information about rank()-function: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rank

# Assign the result to the variable "piece".

piece <- cbind(piece, rank = rank(piece$length, ties.method = "first"))

# Add the new information to index i of the list "pieces.transformed"

pieces.transformed[[i]] <- piece

} # End of the function

# combine the results to a new dataframe by binding the lists by rows

# and assign it to the variable "pieces.preprocessed"

pieces.preprocessed <- do.call("rbind", pieces.transformed)

Create a plot

Ü4 Task 3

“Plot the variables length and footsize from the footsize.csv dataset.

Use this symbol ( ) for plotting and draw the points in blue. (Hint: shape = 7)“

# TASK 3

### Using ggplot

library(ggplot)

footsize <- read.csv("footsize.csv")

ggplot(footsize, aes(x = length, y = footsize)) + geom_point(shape = 7, col = "blue")

# Additional (but not requested: Make points larger @size):

ggplot(footsize, aes(x = length, y = footsize)) + geom_point(shape = 7, col = "blue", size = 2.5)

Create a plot

Ü4 Task 4

“What happens if you specify a variable name in the arguments color or shape?

Can you find a way to show the three variables gender, length, and footsize

from the footsize.csv dataset in one ggplot object?”

# TASK 4

library(ggplot)

footsize <- read.csv("footsize.csv")

ggplot(footsize, aes(x = length, y = footsize, col = gender)) + geom_point()

# If you specify a variable name in the arguments color or shape, R is going to

# set the color or shape by the corresponding variable. It is going to use the

# shape / color to seperate by categories / different unique values in the variable.

# or another possibility: Differentiating by shape

ggplot(footsize, aes(x = length, y = footsize, shape = gender, size=3)) + geom_point()

# or another possibility: Differentiating by shape and color

ggplot(footsize, aes(x = length, y = footsize, shape = gender, size=3)) + geom_point()

Create a plot

Ü4 Task 5

Please reload the “students.csv”. Use the plyr package to calculate the

median age for each “survey_time”. Save the result in a list.

Create the following plot with the ggplot2 package (the point size is 2):

# TASK 5

# Compound tasks

library(plyr)

library(ggplot)

# Import dataset "students.csv"

students <- read.csv("students.csv")

# Input: dataframe -> Output: list => dlply

stud.list <- dlply(students, "survey_time", transform, m.age = median(age))

ggplot(students, aes(x = shoe_size, y = size, color = gender, shape = survey_time))+

geom_point(size=2) +

scale_color_manual(values=c("blue","red"))+ # Specify your own set of colors (we know there are only two levels: female (w) and male (m) -> so we need to set two colors

scale_shape_manual(values=c(3:6)) # Specify your own set of shapes (we know there are 4 levels: SuSe18, SuSe 19, WiSe1819, WiSe1920 -> so we need to set four shapes (task choice: 3,4,5,6)

Conjoint Analysis

Ü5 Task 2

“Please load the tea data set and get an overview of the data. Answer the following questions:

2.1 Which tea profile has the highest average rating? What kind of tea is that?

2.2 Draw a boxplot of all ratings to see how ratings are distributed generally.

Extra:

Now install the package conjoint and run a traditional conjoint analysis on the tea dataset for the first respondent as shown on the last slide. Try to explain what the estimates imply. Can you find out whether black, green or red tea is preferred by this participant?“

# Exercise 5 Task 2

# 2.1

library(conjoint)

library(ggplot2)

data(tea)

##2.1##

# step1

sapply(tprefm, mean) #mean from every tprefm

# step2

max(sapply(tprefm, mean)) #highest mean

# step3

which.max(max(sapply(tprefm, mean))) #name of the highest one

tprof[1,]

## 2.2 ##

ggplot(tpref) + geom_boxplot(aes(y=tpref[,1]))

# nice-to-know

mean(as.numeric(tprefm[1,])) # mean of the first respondent

extra -> use

library(conjoint)

caModel(y=tprefm[1,], x=tprof)

Conjoint Analysis

Ü6 Task 1

“Please run a complete conjoint analysis and try to estimate the average importance of the factors for the tea dataset. Which attribute is the “most important” one?“

#z = matrix of level names

Conjoint(y=tpref, x=tprof, z=tlevn)

head(data.cbc)

K means Clustering

Ü8 Business Case 1

“The Junglivet Whiskey Company is interested in the geographical and spatial distribution of the Whisky market in Scottland. They plan to put the new 10 years Junglivet on the market. He is very body and smoky. For that purpose, cluster the whiskies and find the most body and smoky group of whiskies. And plot them with ggplot on a map of scottland to support the marketing and sales team.Try to structure your work according the process modell from the lecture. Before you build the cluster, try to understand the data. Please take a look if it is true that the most smoky and body Whisky destilleries are from Islay. Use the whiskies.csv file for your analysis.“

# Exercise 8 Task 1

whiskies <- read.csv("whiskies.csv", row.names=1, stringsAsFactors=FALSE)

summary(whiskies)

View(whiskies)

# we need to rescale before applying kmeans

str(whiskies)

#Data Understanding

library(ggplot2)

ggplot(whiskies, aes(Body)) + geom_bar()

ggplot(whiskies, aes(Smoky)) + geom_bar()

# Preparation:

sum(is.na(whiskies)) # no missing values

#example (has nothing to do with exercise)

fit <- kmeans(scale(whiskies[,1:5]), 3)

# preparation for kmeans

whiskies_k <- scale(whiskies[,1:12]) # rescale (prerequisite) for kmeans

summary(whiskies_k)

ssplot <- function(data, maxCluster=9){

SSw <- vector()

for (i in 2:maxCluster) {

set.seed(675)

SSw[i] <- sum(kmeans(data, centers=i)$withinss)

}

plot(1:maxCluster, SSw, type="b", xlab="Number of Clusters", ylab="Within groups of sum of squares")

}

ssplot(whiskies_k)

whiskies <- read.csv("whiskies.csv", row.names=1, stringsAsFactors=FALSE)

whiskies_k <- scale(whiskies[,1:12])

set.seed(675)

fit <- kmeans(whiskies_k, centers=5)

whiskies <- data.frame(whiskies, fit$cluster)

fit$centers

taste.centers <- as.data.frame(fit$centers)

ggplot(taste.centers , aes(x = Body, y = Smoky, label=rownames(taste.centers))) +

geom_point() +

geom_label(size=10)

# ending 06/06/2024

whiskies$fit.cluster==3

body_smoky_whiskies <- subset(whiskies, fit.cluster == 3)[,1:13]

body_smoky_whiskies

whiskies_r <- whiskies[c(1:12, 16)]

candidates <- by(whiskies_r[-13], whiskies_r[13], function(data){

# go through each subset describing a cluster

# calculate the sum of squared deviations.

dists <- sapply(data, function(x) (x - mean(x))^2)

dists <- rowSums(dists)

# return the names of the whiskey distillery most representative of the

# cluster in question (for which the distance from the mean is the smallest)

rownames(data)[dists == min(dists)]

})

whiskies[row.names(whiskies) %in% unlist(candidates), c(1:12,16)]

# nice-to-know (ggmap not relevant for exam!)

library(ggmap)

library(tmaptools)

whiskyMap <- ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Scotland")$bbox))), zoom = 6), darken = 0.5)

whiskyMap + geom_point(data = whiskies,

aes(x = lon,

y = lat,

colour = fit.cluster,

size = 3))

# Islay

whiskyMap <- ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Islay")$bbox))), zoom = 10), darken = 0.7)

location.mat <- matrix(c(xmin=-6.6, ymin=55.55, xmax=-5.9, ymax=55.95), nrow=1)

whiskyMap <- ggmap(get_stamenmap(location.mat, zoom = 10), darken = 0.7)

whiskyMap + geom_text(data = whiskies,

aes(x = lon,

y = lat,

label = rownames(whiskies),

color = fit.cluster,

size=4))

K means Clustering

Ü8 Business Case 2

“After your excellent analytical market report about the geographical distribution of the whisky market in Scotland, the marketing team is interested in knowledge about the

consumers. For that purpose, they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers.

You will find the relevant information in the “Scotch” dataset from the package

“bayesm”. Install the package and load the dataset in R.

1. Use your data exploration skill to gather all relevant information about the dataset and the distribution of the preferred whisky brands: View the dataset. What is shown in rows and columns? What do the entries stand for?

2. How many consumers bought more than 18 different whiskies per year? What is the most popular whisky brand and how many consumers bought it?

3. Plot the number of consumers who bought each whisky brand“

# Task 2

library(bayesm)

data(Scotch)

# Whisky The package bayesm includes the dataset Scotch, which reports which

# brands of whisky 2218 respondents consumed in the previous year.

# row = consumers

# columns: whisky brands

# entries: 1 = consumer bought this brand (=1), else(=0)

# How many consumers bought more than 18 different whiskies per year?

sum(rowSums(Scotch) > 18)

# What is the most popular whisky brand and how many consumers bought it?

which.max(colSums(Scotch))

#Plot the number of consumers who bought each whisky brand

number.whiskies <- colSums(Scotch)

plot(number.whiskies)

K means Clustering (relevant??)

Ü8 Business Case 3

“After your excellent analytical market report about the geographical distribution of the whisky market in Scotland, the marketing team is interested in knowledge about the

consumers. For that purpose they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers.

You will find the relevant information in the “Scotch” dataset from the package “bayesm”.

Install the package and load the dataset in R.

1. Build a market segmentation model with k-means:

▪ Your market team is wondering how many clusters there are. Find a suitable

parameter k using AIC and conduct a k-means clustering

2. Find out the sizes of the clusters

3. Create a matrix with the cluster centers. What do the cluster centers stand for?

4. Have a look specifically at cluster 1. Which whisky brand was consumed the most?

5. Try to make a few statements about your clusters.“

# Task 3

# wss + 2 * k * d

library(plyr)

d <- ncol(Scotch)

wss <- aaply(2:30, 1, function(k) {

set.seed(174)

fit <- kmeans(Scotch, k)

fit$tot.withinss

})

aic <- aaply(1:29, 1, function(k){

wss.k <- wss[k]

wss.k + 2*k*d

})

plot(aic, type="b", ylab="AIC", xlab="Number of Clusters")

# size of clusters (2)

set.seed(174)

cl <- kmeans(Scotch, 5)

cl$size

#** Task 3.3.

# Add cluster assignments to the scotch dataset as a new column named "cluster"

Scotch$cluster <- cl$cluster

cl.eval <- function(x){

# Create a matrix containing the cluster centers.

# For this, the required matrix dimension is predictable:

# The number of rows must match that of the cluster column

# in the data frame and the number of columns must match

# that of the scotch data frame.

clMeans <- matrix(nrow=max(x$cluster), ncol=ncol(x))

# Take over the column names of the data set for the

# column names of the matrix

colnames(clMeans) <- colnames(x)

# Create an

# Create an empty vector that later contains the

# percentages of the most consumed whisky per cluster.

clMax <- vector()

# Create an empty vector that will later contain

# the name of the most consumed whisky per cluster.

clWhichMax <- vector()

# Go through each cluster

for(i in 1:max(x$cluster)){

# Add all rows of the respective cluster in the

# iteration to the new variable clx.

clx <- x[x$cluster==i,]

# We want to inherit all columns except the column

# with the cluster assignments "Cluster" from clx.

# For this we locate the column location and remove

# it from the new dataframe "clx".

del <- which( colnames(clx)=="cluster" )

clx <- clx[,-del]

# Now we start to fill out the columns of the matrix:

# 1.) add the cluster means column-wise.

# the last row of each column represents the (nth) cluster number

clMeans[i,] <- c(as.vector( colMeans(clx) ),i)

# 2.) Save the buying rate (percentage) of most consumed whisky per cluster

clMax[i] <- max(colMeans(clx))

# 3.) Save the name of the most consumed whisky per cluster

clWhichMax[i] <- names( which.max(colMeans(clx)) )

}

# Return the list with our results saved in three different entries.

return(list(t(clMeans),clMax,clWhichMax))

}

# run the evaluation

results <- cl.eval(Scotch)

results

results[[1]]

results[[2]]

results[[3]]

HC Clustering

Ü9 Task 1

“the marketing team is interested in knowledge about the consumers. For that purpose they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers

“Use hierarchical clustering to cluster the whisky consumers, plot a dendogram, cut the clustering, evaluate”

# Task 1

library(bayesm)

data(Scotch)

# Hierarchical clustering

set.seed(174)

clusters <- hclust(dist(Scotch))

plot(clusters)

abline(h=3.9, col="green", lty="dashed")

clusterCut <- cutree(clusters, 5)

table(clusterCut) # Determine cluster sizes

Scotch$cluster <- clusterCut

cl.eval <- function(x){

# Create a matrix containing the cluster centers.

# For this, the required matrix dimension is predictable:

# The number of rows must match that of the cluster column

# in the data frame and the number of columns must match

# that of the scotch data frame.

clMeans <- matrix(nrow=max(x$cluster), ncol=ncol(x))

# Take over the column names of the data set for the

# column names of the matrix

colnames(clMeans) <- colnames(x)

# Create an

# Create an empty vector that later contains the

# percentages of the most consumed whisky per cluster.

clMax <- vector()

# Create an empty vector that will later contain

# the name of the most consumed whisky per cluster.

clWhichMax <- vector()

# Go through each cluster

for(i in 1:max(x$cluster)){

# Add all rows of the respective cluster in the

# iteration to the new variable clx.

clx <- x[x$cluster==i,]

# We want to inherit all columns except the column

# with the cluster assignments "Cluster" from clx.

# For this we locate the column location and remove

# it from the new dataframe "clx".

del <- which( colnames(clx)=="cluster" )

clx <- clx[,-del]

# Now we start to fill out the columns of the matrix:

# 1.) add the cluster means column-wise.

# the last row of each column represents the (nth) cluster number

clMeans[i,] <- c(as.vector( colMeans(clx) ),i)

# 2.) Save the buying rate (percentage) of most consumed whisky per cluster

clMax[i] <- max(colMeans(clx))

# 3.) Save the name of the most consumed whisky per cluster

clWhichMax[i] <- names( which.max(colMeans(clx)) )

}

# Return the list with our results saved in three different entries.

return(list(t(clMeans),clMax,clWhichMax))

}

cl.eval(Scotch)

Association Rule

Ü9 Task 2

“Load the Scotch dataset again:

▪ Create a submatrix of the Scotch dataset which only consists of the whisky brands Chivas.Regal and Dewar.s.White.Label

▪ We consider an association rule in which we treat Chivas.Regal as LHS which implies the buying behavior of Dewar.s.White.Label (RHS)

▪ Calculate the support (percentage of customers who bought both whiskies)

▪ Calculate the confidence of this association rule“

##### Association Rules

# Task 2

library(bayesm)

data(Scotch)

# For general info: The two scotches "Chivas.Regal" and "Dewar.s.White.Label" are bought the most.

colSums(Scotch)

scotch.ar <- Scotch[,1:2]

rs <- rowSums(scotch.ar)

sum(rs==2) / nrow(scotch.ar) # Support = 0.0825

sub <- scotch.ar[scotch.ar[,1]==1,]

mean(sub[,2]) # confidence = 0.2270

# slides

library(arules)

library(arulesViz)

shopping.data <- read.transactions("scottish-supermarket.csv", format = "basket", sep=",")

itemFrequencyPlot(shopping.data, topN=20, type="absolute")

summary(shopping.data) # nice-to-know

rules <- apriori(data=shopping.data, parameter=list(supp=0.001, conf=0.001),

appearance = list(default="lhs", rhs="whisky"),

control=list(verbose=F))

inspect(rules[1:5])

rules <- sort(rules, decreasing=T, by="confidence")

inspect(rules[1:5])

plot(rules[1:5], method="graph")

Different Methods on Classification

Ü10
- CART
- C5.0
- RF

“use Dataset Iris, build classification model, use it to classify, evaluate it“

CART

library(rpart)

fit <- rpart(Species~., data=iris) # build

predictions <- predict(fit, iris[,1:4], type="class") # classify

table(predictions, iris$Species) # evaluate

C5.0

library(C50)

fit <- C5.0(Species~., data=iris) # build

summary(fit)

predictions <- predict(fit, iris[,1:4]) # classify

table(predictions, iris$Species) # evaluate

library(randomForest)

fit <- randomForest(Species~., data=iris)

predictions <- predict(fit, iris[,1:4])

table(predictions, iris$Species)

Random Forest

Ü10 Task 1

“In the StudIP folder you will find the whisky dataset scotch.csv. Load the data into your R workspace and delete the first column (with the names) or take it as row names. Then build a classification tree with the Random Forest algorithm to predict the region of the whiskies

How many predictions were correct?“

# Task 1

# Tasks

library(randomForest)

scotchs <- read.csv("scotch.csv", row.names = 1, header = T, sep=",")

fit <- randomForest(region ~ ., data = scotchs) # build

predictions <- predict(fit, scotchs[,-72]) # classify

table(predictions, scotchs$region) # evaluate

# All 109 classifications were correct.

C5.0

Ü10 Task 2

“Please reload the footsize.csv in your R workspace. We want to build a classification tree for the gender with a training and a test data set

1. Please create a test data set with the first 33 observations and take the remaining 67 observations as training data set

2. Build a classification tree with the C5.0 algorithm with the training data set and predict the gender of the observations in the test data set.

3. Use the error matrix from table() to calculate the error rate for the test data set. (Hint: the summary only shows the error rate for the training data set)

4. Repeat steps 1-3, but this time use every third observation as test data set and the other observations as training data set.

5. What could be the reason that the error rate in the classification tree of step 4 is lower than in the first classification tree.

6. Show that on average males are taller than females in the footsize data set. Then have a look how length is used in the decision tree of step 4. What is odd? What could be the reason for the unintuitive classification rule?“

# Task 2

# Footsize dataset

library(C50)

foot <- read.csv("footsize.csv")

# 1. Build train and test data set

foot.train <- foot[-(1:33),]

foot.test <- foot[1:33,]

# 2. Build the tree to predict gender

fit <- C5.0(gender ~ ., data = foot.train) # build

summary(fit)

# 3. Calculate the error rate

predictions <- predict(fit, foot.test[,-1]) # classify (filter out first column)

treetab <- table(predictions, foot.test[,1]) # evaluate

treetab

sum(treetab[1,2], treetab[2,1]) / sum(treetab)

# Error rate 15.15 %

# An error rate below 10% would have been great.

# This may be due to the unbalanced training and test data set.

# We have 50 f and 17 m (table(foot.train$gender)) // 0 f and 33 m (table(foot.test$gender))

table(foot.train$gender)

table(foot.test$gender)

# 4. Change test and training data set

# Task citation:

# „but this time use every third observation

# as test data set and the other observations

# as training data set.“

# Think about the first tutorials here.

# If we pass a vector which is smaller in

# length than another vector/list, then this

# vector is applied cyclically! And exactly

# this cyclic component we make ourselves of use!

# Every third observation is used I the test data

# set and the rest as train data set. By this we

# try to balance the data set.

foot.train <- foot[c(T,T,F),]

foot.test <- foot[c(F,F,T),]

fit <- C5.0(gender ~ ., data = foot.train)

summary(fit)

predictions <- predict(fit, foot.test[,-1])

treetab <- table(predictions, foot.test[,1])

sum(treetab[1,2], treetab[2,1]) / sum(treetab)

# Error rate 9.09 %

# 5.

# The first split in training and test data was not

# representative, because we only have males in the test data.

# The second time we used a balanced partition, therefore

# we get a better classification tree.

# 6. tapply: create group summaries based on factor levels.

tapply(foot$length, foot$gender, mean) # f=172.7049 m=186.7503

summary(fit)

# Between a footsize of 40.23 and 42.48, the classification

# tree detects shorter persons as male and taller persons

# as female, but the mean values suggest the opposite.

# Reason: Men are generally slightly larger than women in

# the data set. Tall men are therefore most likely to be

# in the branch with footsizes over 42.48. Small women,

# on the other hand, are very likely to be in the branch

# with footsizes below 40.23. In the area in between there

# are mainly tall women and small men. The query is

# therefore reversed here.

SVM

Ü11
Requirements
- library("e1071")
  library(GGally)
  library(ggplot2)
  library(caret)
Task:
- Use Iris Dataset
- get in touch with the structure of the variables
- plot a ggpair
- Tune the parameter (Training / Testing)
- Modelling parameter
- fetch the best svm model
- predict species
- which kernel perform best

structure and plot overview

str(iris)

head(iris)

ggpairs(iris, aes(color = Species, alpha = 0.4))

Tune Parameter:

set.seed(42)

#Create balanced splits of the data. The y argument is a #factor, so the random sampling occurs within each class #and should preserve the overall class distribution of the #data.

train_val_limiter <-CreateDataPartition(iris$Species,p=0.8,times=1,list=FALSE)

# Locate training dataset by indexing

train_validation <- iris[train_val_limiter,]

# Locate test dataset by indexing

test <- iris[- train_val_limiter,]

modelling parameter

tmodel = tune(svm, Species~.,

data= train_validation,

ranges=list(kernel=c("linear","polynomial","radial"),

cost = 10^(-2:2))

)

summary(tmodel)

fetch best svm model

mymodel = tmodel$best.model

summary(mymodel)

mymodel$tot.nSV # Result: 11

svm_fitted_model <- svm(Species ~ ., data=train, kernel="linear", cost= 100)

plot(x, data, formula, slice = list(), color.palette = cm.colors)

plot(svm_fitted_model, data=train_validation, Petal.Width~Petal.Length,

slice = list(Sepal.Width=3, Sepal.Length=4),

color.palette = cm.colors)

predict species

predict_species = predict(svm_fitted_model, test)

confusion_matrix = table(Predicted = predict_species, Actual = test$Species)

error_rate = 1-sum(diag(confusion_matrix)/sum(confusion_matrix))

which kernel perform best

#linear

# Error rate: 0.06666667 # 6.67% / 11 Support Vectors / (2; 3; 6)

#polynomial

svm_fitted_model <- svm(Species ~ ., data=test, kernel="polynomial",

cost= 10)

# Error rate: 0.10000000 # 10.00% / 21 Support Vectors / (1; 11; 9)

#Radial

svm_fitted_model <- svm(Species ~ ., data=test, kernel="radial",

cost= 1)

# Error rate: 0.13333333 # 13.33% / 44 Support Vectors / (6; 19; 19)

Join Course

Preview

Author

Felix S.

Information

Last changed
a year ago

Report course