Buffl

Übungen und R Knowledge

FS
von Felix S.

Split Aapply

  • Ü4 Task 2

We are interested in the differences of our male and female consumer. Please compute

the rank of the foot length for each consumer according to his or her gender. You also

have to plot the different subsets. (Hint: A scatterplot is appropriate for this purpose)

# TASK 2

# Import "footsize.csv" for this task:

footsize <- read.csv("footsize.csv")

### Split - Apply - Combine

# split: Split the dataset by the available genders (binary: f / m)

pieces <- split(footsize, footsize$gender)

# Just nice to know:

pieces["f"]

pieces["m"]

## apply:

# Declare a temporary list. This list will be filled with the results later on.

# Produce a vector of the given length (-> length(pieces)) and mode (-> "list")

# Length is 2; pieces.transformed is a list (as defined by mode: see "?vector" for more info)

pieces.transformed <- vector(mode="list", length=length(pieces))

for(i in 1:length(pieces)){ # Start of the function: Iterate over the two sets male and female

# Get the element i from the list "pieces" which is also

# a list containing information about a specific gender i.

# Assign the specific gender list to the variable "piece".

piece <- pieces[[i]]

# Plot a scatterplot by using the length as x-axis and footsize as y-axis:

plot(piece$length, piece$footsize) # plot each set

# Calculate the Rank and add (or "bind") it as a new column by using cbind().

# Translated into lists data-type, it means adding a new list of the same length

# as all the other lists in the list. Use the ties-method "first":

# For more information about ranks: https://en.wikipedia.org/wiki/Ranking

# For more information about rank()-function: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rank

# Assign the result to the variable "piece".

piece <- cbind(piece, rank = rank(piece$length, ties.method = "first"))

# Add the new information to index i of the list "pieces.transformed"

pieces.transformed[[i]] <- piece

} # End of the function

# combine the results to a new dataframe by binding the lists by rows

# and assign it to the variable "pieces.preprocessed"

pieces.preprocessed <- do.call("rbind", pieces.transformed)

K means Clustering

  • Ü8 Business Case 1

“The Junglivet Whiskey Company is interested in the geographical and spatial distribution of the Whisky market in Scottland. They plan to put the new 10 years Junglivet on the market. He is very body and smoky. For that purpose, cluster the whiskies and find the most body and smoky group of whiskies. And plot them with ggplot on a map of scottland to support the marketing and sales team.Try to structure your work according the process modell from the lecture. Before you build the cluster, try to understand the data. Please take a look if it is true that the most smoky and body Whisky destilleries are from Islay. Use the whiskies.csv file for your analysis.“


# Exercise 8 Task 1

whiskies <- read.csv("whiskies.csv", row.names=1, stringsAsFactors=FALSE)

summary(whiskies)

View(whiskies)

# we need to rescale before applying kmeans

str(whiskies)

#Data Understanding

library(ggplot2)

ggplot(whiskies, aes(Body)) + geom_bar()

ggplot(whiskies, aes(Smoky)) + geom_bar()

# Preparation:

sum(is.na(whiskies)) # no missing values

#example (has nothing to do with exercise)

fit <- kmeans(scale(whiskies[,1:5]), 3)

# preparation for kmeans

whiskies_k <- scale(whiskies[,1:12]) # rescale (prerequisite) for kmeans

summary(whiskies_k)

ssplot <- function(data, maxCluster=9){

SSw <- vector()

for (i in 2:maxCluster) {

set.seed(675)

SSw[i] <- sum(kmeans(data, centers=i)$withinss)

}

plot(1:maxCluster, SSw, type="b", xlab="Number of Clusters", ylab="Within groups of sum of squares")

}

ssplot(whiskies_k)

whiskies <- read.csv("whiskies.csv", row.names=1, stringsAsFactors=FALSE)

whiskies_k <- scale(whiskies[,1:12])

set.seed(675)

fit <- kmeans(whiskies_k, centers=5)

whiskies <- data.frame(whiskies, fit$cluster)

fit$centers

taste.centers <- as.data.frame(fit$centers)

ggplot(taste.centers , aes(x = Body, y = Smoky, label=rownames(taste.centers))) +

geom_point() +

geom_label(size=10)

# ending 06/06/2024

whiskies$fit.cluster==3

body_smoky_whiskies <- subset(whiskies, fit.cluster == 3)[,1:13]

body_smoky_whiskies

whiskies_r <- whiskies[c(1:12, 16)]

candidates <- by(whiskies_r[-13], whiskies_r[13], function(data){

# go through each subset describing a cluster

# calculate the sum of squared deviations.

dists <- sapply(data, function(x) (x - mean(x))^2)

dists <- rowSums(dists)

# return the names of the whiskey distillery most representative of the

# cluster in question (for which the distance from the mean is the smallest)

rownames(data)[dists == min(dists)]

})

whiskies[row.names(whiskies) %in% unlist(candidates), c(1:12,16)]

# nice-to-know (ggmap not relevant for exam!)

library(ggmap)

library(tmaptools)

whiskyMap <- ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Scotland")$bbox))), zoom = 6), darken = 0.5)

whiskyMap + geom_point(data = whiskies,

aes(x = lon,

y = lat,

colour = fit.cluster,

size = 3))

# Islay

whiskyMap <- ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Islay")$bbox))), zoom = 10), darken = 0.7)

location.mat <- matrix(c(xmin=-6.6, ymin=55.55, xmax=-5.9, ymax=55.95), nrow=1)

whiskyMap <- ggmap(get_stamenmap(location.mat, zoom = 10), darken = 0.7)

whiskyMap + geom_text(data = whiskies,

aes(x = lon,

y = lat,

label = rownames(whiskies),

color = fit.cluster,

size=4))


K means Clustering

  • Ü8 Business Case 2


“After your excellent analytical market report about the geographical distribution of the whisky market in Scotland, the marketing team is interested in knowledge about the

consumers. For that purpose, they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers.

You will find the relevant information in the “Scotch” dataset from the package

“bayesm”. Install the package and load the dataset in R.

1. Use your data exploration skill to gather all relevant information about the dataset and the distribution of the preferred whisky brands: View the dataset. What is shown in rows and columns? What do the entries stand for?

2. How many consumers bought more than 18 different whiskies per year? What is the most popular whisky brand and how many consumers bought it?

3. Plot the number of consumers who bought each whisky brand“

# Task 2

library(bayesm)

data(Scotch)

# Whisky The package bayesm includes the dataset Scotch, which reports which

# brands of whisky 2218 respondents consumed in the previous year.

# row = consumers

# columns: whisky brands

# entries: 1 = consumer bought this brand (=1), else(=0)

# How many consumers bought more than 18 different whiskies per year?

sum(rowSums(Scotch) > 18)

# What is the most popular whisky brand and how many consumers bought it?

which.max(colSums(Scotch))

#Plot the number of consumers who bought each whisky brand

number.whiskies <- colSums(Scotch)

plot(number.whiskies)


K means Clustering (relevant??)

  • Ü8 Business Case 3

“After your excellent analytical market report about the geographical distribution of the whisky market in Scotland, the marketing team is interested in knowledge about the

consumers. For that purpose they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers.

You will find the relevant information in the “Scotch” dataset from the package “bayesm”.

Install the package and load the dataset in R.

1. Build a market segmentation model with k-means:

▪ Your market team is wondering how many clusters there are. Find a suitable

parameter k using AIC and conduct a k-means clustering

2. Find out the sizes of the clusters

3. Create a matrix with the cluster centers. What do the cluster centers stand for?

4. Have a look specifically at cluster 1. Which whisky brand was consumed the most?

5. Try to make a few statements about your clusters.“


# Task 3

# wss + 2 * k * d

library(plyr)

d <- ncol(Scotch)

wss <- aaply(2:30, 1, function(k) {

set.seed(174)

fit <- kmeans(Scotch, k)

fit$tot.withinss

})

aic <- aaply(1:29, 1, function(k){

wss.k <- wss[k]

wss.k + 2*k*d

})

plot(aic, type="b", ylab="AIC", xlab="Number of Clusters")

# size of clusters (2)

set.seed(174)

cl <- kmeans(Scotch, 5)

cl$size

#** Task 3.3.

# Add cluster assignments to the scotch dataset as a new column named "cluster"

Scotch$cluster <- cl$cluster

cl.eval <- function(x){

# Create a matrix containing the cluster centers.

# For this, the required matrix dimension is predictable:

# The number of rows must match that of the cluster column

# in the data frame and the number of columns must match

# that of the scotch data frame.

clMeans <- matrix(nrow=max(x$cluster), ncol=ncol(x))

# Take over the column names of the data set for the

# column names of the matrix

colnames(clMeans) <- colnames(x)

# Create an

# Create an empty vector that later contains the

# percentages of the most consumed whisky per cluster.

clMax <- vector()

# Create an empty vector that will later contain

# the name of the most consumed whisky per cluster.

clWhichMax <- vector()

# Go through each cluster

for(i in 1:max(x$cluster)){

# Add all rows of the respective cluster in the

# iteration to the new variable clx.

clx <- x[x$cluster==i,]

# We want to inherit all columns except the column

# with the cluster assignments "Cluster" from clx.

# For this we locate the column location and remove

# it from the new dataframe "clx".

del <- which( colnames(clx)=="cluster" )

clx <- clx[,-del]

# Now we start to fill out the columns of the matrix:

# 1.) add the cluster means column-wise.

# the last row of each column represents the (nth) cluster number

clMeans[i,] <- c(as.vector( colMeans(clx) ),i)

# 2.) Save the buying rate (percentage) of most consumed whisky per cluster

clMax[i] <- max(colMeans(clx))

# 3.) Save the name of the most consumed whisky per cluster

clWhichMax[i] <- names( which.max(colMeans(clx)) )

}

# Return the list with our results saved in three different entries.

return(list(t(clMeans),clMax,clWhichMax))

}

# run the evaluation

results <- cl.eval(Scotch)

results

results[[1]]

results[[2]]

results[[3]]

HC Clustering

  • Ü9 Task 1

“the marketing team is interested in knowledge about the consumers. For that purpose they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers


“Use hierarchical clustering to cluster the whisky consumers, plot a dendogram, cut the clustering, evaluate”

# Task 1

library(bayesm)

data(Scotch)

# Hierarchical clustering

set.seed(174)

clusters <- hclust(dist(Scotch))

plot(clusters)

abline(h=3.9, col="green", lty="dashed")

clusterCut <- cutree(clusters, 5)

table(clusterCut) # Determine cluster sizes

Scotch$cluster <- clusterCut

cl.eval <- function(x){

# Create a matrix containing the cluster centers.

# For this, the required matrix dimension is predictable:

# The number of rows must match that of the cluster column

# in the data frame and the number of columns must match

# that of the scotch data frame.

clMeans <- matrix(nrow=max(x$cluster), ncol=ncol(x))

# Take over the column names of the data set for the

# column names of the matrix

colnames(clMeans) <- colnames(x)

# Create an

# Create an empty vector that later contains the

# percentages of the most consumed whisky per cluster.

clMax <- vector()

# Create an empty vector that will later contain

# the name of the most consumed whisky per cluster.

clWhichMax <- vector()

# Go through each cluster

for(i in 1:max(x$cluster)){

# Add all rows of the respective cluster in the

# iteration to the new variable clx.

clx <- x[x$cluster==i,]

# We want to inherit all columns except the column

# with the cluster assignments "Cluster" from clx.

# For this we locate the column location and remove

# it from the new dataframe "clx".

del <- which( colnames(clx)=="cluster" )

clx <- clx[,-del]

# Now we start to fill out the columns of the matrix:

# 1.) add the cluster means column-wise.

# the last row of each column represents the (nth) cluster number

clMeans[i,] <- c(as.vector( colMeans(clx) ),i)

# 2.) Save the buying rate (percentage) of most consumed whisky per cluster

clMax[i] <- max(colMeans(clx))

# 3.) Save the name of the most consumed whisky per cluster

clWhichMax[i] <- names( which.max(colMeans(clx)) )

}

# Return the list with our results saved in three different entries.

return(list(t(clMeans),clMax,clWhichMax))

}

cl.eval(Scotch)

C5.0

  • Ü10 Task 2

“Please reload the footsize.csv in your R workspace. We want to build a classification tree for the gender with a training and a test data set

1. Please create a test data set with the first 33 observations and take the remaining 67 observations as training data set

2. Build a classification tree with the C5.0 algorithm with the training data set and predict the gender of the observations in the test data set.

3. Use the error matrix from table() to calculate the error rate for the test data set. (Hint: the summary only shows the error rate for the training data set)

4. Repeat steps 1-3, but this time use every third observation as test data set and the other observations as training data set.

5. What could be the reason that the error rate in the classification tree of step 4 is lower than in the first classification tree.

6. Show that on average males are taller than females in the footsize data set. Then have a look how length is used in the decision tree of step 4. What is odd? What could be the reason for the unintuitive classification rule?“

# Task 2

# Footsize dataset

library(C50)

foot <- read.csv("footsize.csv")

# 1. Build train and test data set

foot.train <- foot[-(1:33),]

foot.test <- foot[1:33,]

# 2. Build the tree to predict gender

fit <- C5.0(gender ~ ., data = foot.train) # build

summary(fit)

# 3. Calculate the error rate

predictions <- predict(fit, foot.test[,-1]) # classify (filter out first column)

treetab <- table(predictions, foot.test[,1]) # evaluate

treetab

sum(treetab[1,2], treetab[2,1]) / sum(treetab)

# Error rate 15.15 %

# An error rate below 10% would have been great.

# This may be due to the unbalanced training and test data set.

# We have 50 f and 17 m (table(foot.train$gender)) // 0 f and 33 m (table(foot.test$gender))

table(foot.train$gender)

table(foot.test$gender)

# 4. Change test and training data set

# Task citation:

# „but this time use every third observation

# as test data set and the other observations

# as training data set.“

#

# Think about the first tutorials here.

# If we pass a vector which is smaller in

# length than another vector/list, then this

# vector is applied cyclically! And exactly

# this cyclic component we make ourselves of use!

# Every third observation is used I the test data

# set and the rest as train data set. By this we

# try to balance the data set.

foot.train <- foot[c(T,T,F),]

foot.test <- foot[c(F,F,T),]

fit <- C5.0(gender ~ ., data = foot.train)

summary(fit)

predictions <- predict(fit, foot.test[,-1])

treetab <- table(predictions, foot.test[,1])

sum(treetab[1,2], treetab[2,1]) / sum(treetab)

# Error rate 9.09 %

# 5.

# The first split in training and test data was not

# representative, because we only have males in the test data.

# The second time we used a balanced partition, therefore

# we get a better classification tree.

# 6. tapply: create group summaries based on factor levels.

tapply(foot$length, foot$gender, mean) # f=172.7049 m=186.7503

summary(fit)

# Between a footsize of 40.23 and 42.48, the classification

# tree detects shorter persons as male and taller persons

# as female, but the mean values suggest the opposite.

# Reason: Men are generally slightly larger than women in

# the data set. Tall men are therefore most likely to be

# in the branch with footsizes over 42.48. Small women,

# on the other hand, are very likely to be in the branch

# with footsizes below 40.23. In the area in between there

# are mainly tall women and small men. The query is

# therefore reversed here.


SVM

  • Ü11

  • Requirements

    • library("e1071")

      library(GGally)

      library(ggplot2)

      library(caret)

  • Task:

    • Use Iris Dataset

    • get in touch with the structure of the variables

    • plot a ggpair

    • Tune the parameter (Training / Testing)

    • Modelling parameter

    • fetch the best svm model

    • predict species

    • which kernel perform best


  • structure and plot overview

str(iris)

head(iris)

ggpairs(iris, aes(color = Species, alpha = 0.4))


  • Tune Parameter:

set.seed(42)

#Create balanced splits of the data. The y argument is a #factor, so the random sampling occurs within each class #and should preserve the overall class distribution of the #data.

train_val_limiter <-CreateDataPartition(iris$Species,p=0.8,times=1,list=FALSE)


# Locate training dataset by indexing

train_validation <- iris[train_val_limiter,]


# Locate test dataset by indexing

test <- iris[- train_val_limiter,]


  • modelling parameter

tmodel = tune(svm, Species~.,

data= train_validation,

ranges=list(kernel=c("linear","polynomial","radial"),

cost = 10^(-2:2))

)

summary(tmodel)

  • fetch best svm model

mymodel = tmodel$best.model

summary(mymodel)

mymodel$tot.nSV # Result: 11

svm_fitted_model <- svm(Species ~ ., data=train, kernel="linear", cost= 100)

plot(x, data, formula, slice = list(), color.palette = cm.colors)

plot(svm_fitted_model, data=train_validation, Petal.Width~Petal.Length,

slice = list(Sepal.Width=3, Sepal.Length=4),

color.palette = cm.colors)


  • predict species

predict_species = predict(svm_fitted_model, test)

confusion_matrix = table(Predicted = predict_species, Actual = test$Species)

error_rate = 1-sum(diag(confusion_matrix)/sum(confusion_matrix))


  • which kernel perform best

#linear

# Error rate: 0.06666667 # 6.67% / 11 Support Vectors / (2; 3; 6)

#polynomial

svm_fitted_model <- svm(Species ~ ., data=test, kernel="polynomial",

cost= 10)

# Error rate: 0.10000000 # 10.00% / 21 Support Vectors / (1; 11; 9)

#Radial

svm_fitted_model <- svm(Species ~ ., data=test, kernel="radial",

cost= 1)

# Error rate: 0.13333333 # 13.33% / 44 Support Vectors / (6; 19; 19)

Author

Felix S.

Informationen

Zuletzt geändert