Aapply
Ü4 Task 1
“Make a list with the variables a = 1, b = 2:5 und c = 5:50. Apply the sum()
function on each element of the list.”
# TASK 1
# Create the requested list with the variables a, b and c.
list_apply <- list(a = 1, b = 2:5, c = 5:50)
# Apply the sum function:
# There are two possible ways: Input is list -> sapply (Output: array) OR lapply (list).
# Way 1: Input: list -> Output: array
output_sapply <- sapply(list_apply, FUN = sum)
# Alternative Way 1: Since FUN is already in the second place according to the function
# definition in the documentation, FUN does not have to be specified explicitly again.
# REASON: If no parameters are explicitly specified in the function, R automatically assumes
# the order according to the definition in the documentation.
output_sapply <- sapply(list_apply, sum) # FUN is placed on
####
# Way 2: Input: list -> Output: list
output_lapply <- lapply(list_apply, FUN = sum)
# Alternative Way 2:
output_lapply <- lapply(list_apply, sum)
Was ist der Unterschied zwischen [,1] und [1,]
[1,]:
[1,]
Extrahiert die gesamte erste Zeile eines Data Frame oder einer Matrix.
Beispiel: df[1,] gibt die erste Zeile von df zurück.
df[1,]
df
[,1]:
[,1]
Extrahiert die gesamte erste Spalte eines Data Frame oder einer Matrix.
Beispiel: df[,1] gibt die erste Spalte von df zurück.
df[,1]
Split Aapply
Ü4 Task 2
We are interested in the differences of our male and female consumer. Please compute
the rank of the foot length for each consumer according to his or her gender. You also
have to plot the different subsets. (Hint: A scatterplot is appropriate for this purpose)
# TASK 2
# Import "footsize.csv" for this task:
footsize <- read.csv("footsize.csv")
### Split - Apply - Combine
# split: Split the dataset by the available genders (binary: f / m)
pieces <- split(footsize, footsize$gender)
# Just nice to know:
pieces["f"]
pieces["m"]
## apply:
# Declare a temporary list. This list will be filled with the results later on.
# Produce a vector of the given length (-> length(pieces)) and mode (-> "list")
# Length is 2; pieces.transformed is a list (as defined by mode: see "?vector" for more info)
pieces.transformed <- vector(mode="list", length=length(pieces))
for(i in 1:length(pieces)){ # Start of the function: Iterate over the two sets male and female
# Get the element i from the list "pieces" which is also
# a list containing information about a specific gender i.
# Assign the specific gender list to the variable "piece".
piece <- pieces[[i]]
# Plot a scatterplot by using the length as x-axis and footsize as y-axis:
plot(piece$length, piece$footsize) # plot each set
# Calculate the Rank and add (or "bind") it as a new column by using cbind().
# Translated into lists data-type, it means adding a new list of the same length
# as all the other lists in the list. Use the ties-method "first":
# For more information about ranks: https://en.wikipedia.org/wiki/Ranking
# For more information about rank()-function: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rank
# Assign the result to the variable "piece".
piece <- cbind(piece, rank = rank(piece$length, ties.method = "first"))
# Add the new information to index i of the list "pieces.transformed"
pieces.transformed[[i]] <- piece
} # End of the function
# combine the results to a new dataframe by binding the lists by rows
# and assign it to the variable "pieces.preprocessed"
pieces.preprocessed <- do.call("rbind", pieces.transformed)
Create a plot
Ü4 Task 3
“Plot the variables length and footsize from the footsize.csv dataset.
Use this symbol ( ) for plotting and draw the points in blue. (Hint: shape = 7)“
# TASK 3
### Using ggplot
library(ggplot)
ggplot(footsize, aes(x = length, y = footsize)) + geom_point(shape = 7, col = "blue")
# Additional (but not requested: Make points larger @size):
ggplot(footsize, aes(x = length, y = footsize)) + geom_point(shape = 7, col = "blue", size = 2.5)
Ü4 Task 4
“What happens if you specify a variable name in the arguments color or shape?
Can you find a way to show the three variables gender, length, and footsize
from the footsize.csv dataset in one ggplot object?”
# TASK 4
ggplot(footsize, aes(x = length, y = footsize, col = gender)) + geom_point()
# If you specify a variable name in the arguments color or shape, R is going to
# set the color or shape by the corresponding variable. It is going to use the
# shape / color to seperate by categories / different unique values in the variable.
# or another possibility: Differentiating by shape
ggplot(footsize, aes(x = length, y = footsize, shape = gender, size=3)) + geom_point()
# or another possibility: Differentiating by shape and color
Ü4 Task 5
Please reload the “students.csv”. Use the plyr package to calculate the
median age for each “survey_time”. Save the result in a list.
Create the following plot with the ggplot2 package (the point size is 2):
# TASK 5
# Compound tasks
library(plyr)
# Import dataset "students.csv"
students <- read.csv("students.csv")
# Input: dataframe -> Output: list => dlply
stud.list <- dlply(students, "survey_time", transform, m.age = median(age))
ggplot(students, aes(x = shoe_size, y = size, color = gender, shape = survey_time))+
geom_point(size=2) +
scale_color_manual(values=c("blue","red"))+ # Specify your own set of colors (we know there are only two levels: female (w) and male (m) -> so we need to set two colors
scale_shape_manual(values=c(3:6)) # Specify your own set of shapes (we know there are 4 levels: SuSe18, SuSe 19, WiSe1819, WiSe1920 -> so we need to set four shapes (task choice: 3,4,5,6)
Conjoint Analysis
Ü5 Task 2
“Please load the tea data set and get an overview of the data. Answer the following questions:
2.1 Which tea profile has the highest average rating? What kind of tea is that?
2.2 Draw a boxplot of all ratings to see how ratings are distributed generally.
Extra:
Now install the package conjoint and run a traditional conjoint analysis on the tea dataset for the first respondent as shown on the last slide. Try to explain what the estimates imply. Can you find out whether black, green or red tea is preferred by this participant?“
# Exercise 5 Task 2
# 2.1
library(conjoint)
library(ggplot2)
data(tea)
##2.1##
# step1
sapply(tprefm, mean) #mean from every tprefm
# step2
max(sapply(tprefm, mean)) #highest mean
# step3
which.max(max(sapply(tprefm, mean))) #name of the highest one
tprof[1,]
## 2.2 ##
ggplot(tpref) + geom_boxplot(aes(y=tpref[,1]))
# nice-to-know
mean(as.numeric(tprefm[1,])) # mean of the first respondent
extra -> use
caModel(y=tprefm[1,], x=tprof)
Ü6 Task 1
“Please run a complete conjoint analysis and try to estimate the average importance of the factors for the tea dataset. Which attribute is the “most important” one?“
#z = matrix of level names
Conjoint(y=tpref, x=tprof, z=tlevn)
head(data.cbc)
K means Clustering
Ü8 Business Case 1
“The Junglivet Whiskey Company is interested in the geographical and spatial distribution of the Whisky market in Scottland. They plan to put the new 10 years Junglivet on the market. He is very body and smoky. For that purpose, cluster the whiskies and find the most body and smoky group of whiskies. And plot them with ggplot on a map of scottland to support the marketing and sales team.Try to structure your work according the process modell from the lecture. Before you build the cluster, try to understand the data. Please take a look if it is true that the most smoky and body Whisky destilleries are from Islay. Use the whiskies.csv file for your analysis.“
# Exercise 8 Task 1
whiskies <- read.csv("whiskies.csv", row.names=1, stringsAsFactors=FALSE)
summary(whiskies)
View(whiskies)
# we need to rescale before applying kmeans
str(whiskies)
#Data Understanding
ggplot(whiskies, aes(Body)) + geom_bar()
ggplot(whiskies, aes(Smoky)) + geom_bar()
# Preparation:
sum(is.na(whiskies)) # no missing values
#example (has nothing to do with exercise)
fit <- kmeans(scale(whiskies[,1:5]), 3)
# preparation for kmeans
whiskies_k <- scale(whiskies[,1:12]) # rescale (prerequisite) for kmeans
summary(whiskies_k)
ssplot <- function(data, maxCluster=9){
SSw <- vector()
for (i in 2:maxCluster) {
set.seed(675)
SSw[i] <- sum(kmeans(data, centers=i)$withinss)
}
plot(1:maxCluster, SSw, type="b", xlab="Number of Clusters", ylab="Within groups of sum of squares")
ssplot(whiskies_k)
whiskies_k <- scale(whiskies[,1:12])
fit <- kmeans(whiskies_k, centers=5)
whiskies <- data.frame(whiskies, fit$cluster)
fit$centers
taste.centers <- as.data.frame(fit$centers)
ggplot(taste.centers , aes(x = Body, y = Smoky, label=rownames(taste.centers))) +
geom_point() +
geom_label(size=10)
# ending 06/06/2024
whiskies$fit.cluster==3
body_smoky_whiskies <- subset(whiskies, fit.cluster == 3)[,1:13]
body_smoky_whiskies
whiskies_r <- whiskies[c(1:12, 16)]
candidates <- by(whiskies_r[-13], whiskies_r[13], function(data){
# go through each subset describing a cluster
# calculate the sum of squared deviations.
dists <- sapply(data, function(x) (x - mean(x))^2)
dists <- rowSums(dists)
# return the names of the whiskey distillery most representative of the
# cluster in question (for which the distance from the mean is the smallest)
rownames(data)[dists == min(dists)]
})
whiskies[row.names(whiskies) %in% unlist(candidates), c(1:12,16)]
# nice-to-know (ggmap not relevant for exam!)
library(ggmap)
library(tmaptools)
whiskyMap <- ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Scotland")$bbox))), zoom = 6), darken = 0.5)
whiskyMap + geom_point(data = whiskies,
aes(x = lon,
y = lat,
colour = fit.cluster,
size = 3))
# Islay
whiskyMap <- ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Islay")$bbox))), zoom = 10), darken = 0.7)
location.mat <- matrix(c(xmin=-6.6, ymin=55.55, xmax=-5.9, ymax=55.95), nrow=1)
whiskyMap <- ggmap(get_stamenmap(location.mat, zoom = 10), darken = 0.7)
whiskyMap + geom_text(data = whiskies,
label = rownames(whiskies),
color = fit.cluster,
size=4))
Ü8 Business Case 2
“After your excellent analytical market report about the geographical distribution of the whisky market in Scotland, the marketing team is interested in knowledge about the
consumers. For that purpose, they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers.
You will find the relevant information in the “Scotch” dataset from the package
“bayesm”. Install the package and load the dataset in R.
1. Use your data exploration skill to gather all relevant information about the dataset and the distribution of the preferred whisky brands: View the dataset. What is shown in rows and columns? What do the entries stand for?
2. How many consumers bought more than 18 different whiskies per year? What is the most popular whisky brand and how many consumers bought it?
3. Plot the number of consumers who bought each whisky brand“
# Task 2
library(bayesm)
data(Scotch)
# Whisky The package bayesm includes the dataset Scotch, which reports which
# brands of whisky 2218 respondents consumed in the previous year.
# row = consumers
# columns: whisky brands
# entries: 1 = consumer bought this brand (=1), else(=0)
# How many consumers bought more than 18 different whiskies per year?
sum(rowSums(Scotch) > 18)
# What is the most popular whisky brand and how many consumers bought it?
which.max(colSums(Scotch))
#Plot the number of consumers who bought each whisky brand
number.whiskies <- colSums(Scotch)
plot(number.whiskies)
K means Clustering (relevant??)
Ü8 Business Case 3
consumers. For that purpose they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers.
You will find the relevant information in the “Scotch” dataset from the package “bayesm”.
Install the package and load the dataset in R.
1. Build a market segmentation model with k-means:
▪ Your market team is wondering how many clusters there are. Find a suitable
parameter k using AIC and conduct a k-means clustering
2. Find out the sizes of the clusters
3. Create a matrix with the cluster centers. What do the cluster centers stand for?
4. Have a look specifically at cluster 1. Which whisky brand was consumed the most?
5. Try to make a few statements about your clusters.“
# Task 3
# wss + 2 * k * d
d <- ncol(Scotch)
wss <- aaply(2:30, 1, function(k) {
set.seed(174)
fit <- kmeans(Scotch, k)
fit$tot.withinss
aic <- aaply(1:29, 1, function(k){
wss.k <- wss[k]
wss.k + 2*k*d
plot(aic, type="b", ylab="AIC", xlab="Number of Clusters")
# size of clusters (2)
cl <- kmeans(Scotch, 5)
cl$size
#** Task 3.3.
# Add cluster assignments to the scotch dataset as a new column named "cluster"
Scotch$cluster <- cl$cluster
cl.eval <- function(x){
# Create a matrix containing the cluster centers.
# For this, the required matrix dimension is predictable:
# The number of rows must match that of the cluster column
# in the data frame and the number of columns must match
# that of the scotch data frame.
clMeans <- matrix(nrow=max(x$cluster), ncol=ncol(x))
# Take over the column names of the data set for the
# column names of the matrix
colnames(clMeans) <- colnames(x)
# Create an
# Create an empty vector that later contains the
# percentages of the most consumed whisky per cluster.
clMax <- vector()
# Create an empty vector that will later contain
# the name of the most consumed whisky per cluster.
clWhichMax <- vector()
# Go through each cluster
for(i in 1:max(x$cluster)){
# Add all rows of the respective cluster in the
# iteration to the new variable clx.
clx <- x[x$cluster==i,]
# We want to inherit all columns except the column
# with the cluster assignments "Cluster" from clx.
# For this we locate the column location and remove
# it from the new dataframe "clx".
del <- which( colnames(clx)=="cluster" )
clx <- clx[,-del]
# Now we start to fill out the columns of the matrix:
# 1.) add the cluster means column-wise.
# the last row of each column represents the (nth) cluster number
clMeans[i,] <- c(as.vector( colMeans(clx) ),i)
# 2.) Save the buying rate (percentage) of most consumed whisky per cluster
clMax[i] <- max(colMeans(clx))
# 3.) Save the name of the most consumed whisky per cluster
clWhichMax[i] <- names( which.max(colMeans(clx)) )
# Return the list with our results saved in three different entries.
return(list(t(clMeans),clMax,clWhichMax))
# run the evaluation
results <- cl.eval(Scotch)
results
results[[1]]
results[[2]]
results[[3]]
HC Clustering
Ü9 Task 1
“the marketing team is interested in knowledge about the consumers. For that purpose they gathered the purchase history for 21 brands of whisky over a one year time period from 2218 consumers
“Use hierarchical clustering to cluster the whisky consumers, plot a dendogram, cut the clustering, evaluate”
# Task 1
# Hierarchical clustering
clusters <- hclust(dist(Scotch))
plot(clusters)
abline(h=3.9, col="green", lty="dashed")
clusterCut <- cutree(clusters, 5)
table(clusterCut) # Determine cluster sizes
Scotch$cluster <- clusterCut
cl.eval(Scotch)
Association Rule
Ü9 Task 2
“Load the Scotch dataset again:
▪ Create a submatrix of the Scotch dataset which only consists of the whisky brands Chivas.Regal and Dewar.s.White.Label
▪ We consider an association rule in which we treat Chivas.Regal as LHS which implies the buying behavior of Dewar.s.White.Label (RHS)
▪ Calculate the support (percentage of customers who bought both whiskies)
▪ Calculate the confidence of this association rule“
##### Association Rules
# For general info: The two scotches "Chivas.Regal" and "Dewar.s.White.Label" are bought the most.
colSums(Scotch)
scotch.ar <- Scotch[,1:2]
rs <- rowSums(scotch.ar)
sum(rs==2) / nrow(scotch.ar) # Support = 0.0825
sub <- scotch.ar[scotch.ar[,1]==1,]
mean(sub[,2]) # confidence = 0.2270
# slides
library(arules)
library(arulesViz)
shopping.data <- read.transactions("scottish-supermarket.csv", format = "basket", sep=",")
itemFrequencyPlot(shopping.data, topN=20, type="absolute")
summary(shopping.data) # nice-to-know
rules <- apriori(data=shopping.data, parameter=list(supp=0.001, conf=0.001),
appearance = list(default="lhs", rhs="whisky"),
control=list(verbose=F))
inspect(rules[1:5])
rules <- sort(rules, decreasing=T, by="confidence")
plot(rules[1:5], method="graph")
Different Methods on Classification
Ü10
CART
C5.0
RF
“use Dataset Iris, build classification model, use it to classify, evaluate it“
library(rpart)
fit <- rpart(Species~., data=iris) # build
predictions <- predict(fit, iris[,1:4], type="class") # classify
table(predictions, iris$Species) # evaluate
library(C50)
fit <- C5.0(Species~., data=iris) # build
summary(fit)
predictions <- predict(fit, iris[,1:4]) # classify
library(randomForest)
fit <- randomForest(Species~., data=iris)
predictions <- predict(fit, iris[,1:4])
table(predictions, iris$Species)
Random Forest
Ü10 Task 1
“In the StudIP folder you will find the whisky dataset scotch.csv. Load the data into your R workspace and delete the first column (with the names) or take it as row names. Then build a classification tree with the Random Forest algorithm to predict the region of the whiskies
How many predictions were correct?“
# Tasks
scotchs <- read.csv("scotch.csv", row.names = 1, header = T, sep=",")
fit <- randomForest(region ~ ., data = scotchs) # build
predictions <- predict(fit, scotchs[,-72]) # classify
table(predictions, scotchs$region) # evaluate
# All 109 classifications were correct.
Ü10 Task 2
“Please reload the footsize.csv in your R workspace. We want to build a classification tree for the gender with a training and a test data set
1. Please create a test data set with the first 33 observations and take the remaining 67 observations as training data set
2. Build a classification tree with the C5.0 algorithm with the training data set and predict the gender of the observations in the test data set.
3. Use the error matrix from table() to calculate the error rate for the test data set. (Hint: the summary only shows the error rate for the training data set)
4. Repeat steps 1-3, but this time use every third observation as test data set and the other observations as training data set.
5. What could be the reason that the error rate in the classification tree of step 4 is lower than in the first classification tree.
6. Show that on average males are taller than females in the footsize data set. Then have a look how length is used in the decision tree of step 4. What is odd? What could be the reason for the unintuitive classification rule?“
# Footsize dataset
foot <- read.csv("footsize.csv")
# 1. Build train and test data set
foot.train <- foot[-(1:33),]
foot.test <- foot[1:33,]
# 2. Build the tree to predict gender
fit <- C5.0(gender ~ ., data = foot.train) # build
# 3. Calculate the error rate
predictions <- predict(fit, foot.test[,-1]) # classify (filter out first column)
treetab <- table(predictions, foot.test[,1]) # evaluate
treetab
sum(treetab[1,2], treetab[2,1]) / sum(treetab)
# Error rate 15.15 %
# An error rate below 10% would have been great.
# This may be due to the unbalanced training and test data set.
# We have 50 f and 17 m (table(foot.train$gender)) // 0 f and 33 m (table(foot.test$gender))
table(foot.train$gender)
table(foot.test$gender)
# 4. Change test and training data set
# Task citation:
# „but this time use every third observation
# as test data set and the other observations
# as training data set.“
#
# Think about the first tutorials here.
# If we pass a vector which is smaller in
# length than another vector/list, then this
# vector is applied cyclically! And exactly
# this cyclic component we make ourselves of use!
# Every third observation is used I the test data
# set and the rest as train data set. By this we
# try to balance the data set.
foot.train <- foot[c(T,T,F),]
foot.test <- foot[c(F,F,T),]
fit <- C5.0(gender ~ ., data = foot.train)
predictions <- predict(fit, foot.test[,-1])
treetab <- table(predictions, foot.test[,1])
# Error rate 9.09 %
# 5.
# The first split in training and test data was not
# representative, because we only have males in the test data.
# The second time we used a balanced partition, therefore
# we get a better classification tree.
# 6. tapply: create group summaries based on factor levels.
tapply(foot$length, foot$gender, mean) # f=172.7049 m=186.7503
# Between a footsize of 40.23 and 42.48, the classification
# tree detects shorter persons as male and taller persons
# as female, but the mean values suggest the opposite.
# Reason: Men are generally slightly larger than women in
# the data set. Tall men are therefore most likely to be
# in the branch with footsizes over 42.48. Small women,
# on the other hand, are very likely to be in the branch
# with footsizes below 40.23. In the area in between there
# are mainly tall women and small men. The query is
# therefore reversed here.
SVM
Ü11
Requirements
library("e1071")
library(GGally)
library(caret)
Task:
Use Iris Dataset
get in touch with the structure of the variables
plot a ggpair
Tune the parameter (Training / Testing)
Modelling parameter
fetch the best svm model
predict species
which kernel perform best
structure and plot overview
str(iris)
head(iris)
ggpairs(iris, aes(color = Species, alpha = 0.4))
Tune Parameter:
set.seed(42)
#Create balanced splits of the data. The y argument is a #factor, so the random sampling occurs within each class #and should preserve the overall class distribution of the #data.
train_val_limiter <-CreateDataPartition(iris$Species,p=0.8,times=1,list=FALSE)
# Locate training dataset by indexing
train_validation <- iris[train_val_limiter,]
# Locate test dataset by indexing
test <- iris[- train_val_limiter,]
modelling parameter
tmodel = tune(svm, Species~.,
data= train_validation,
ranges=list(kernel=c("linear","polynomial","radial"),
cost = 10^(-2:2))
)
summary(tmodel)
fetch best svm model
mymodel = tmodel$best.model
summary(mymodel)
mymodel$tot.nSV # Result: 11
svm_fitted_model <- svm(Species ~ ., data=train, kernel="linear", cost= 100)
plot(x, data, formula, slice = list(), color.palette = cm.colors)
plot(svm_fitted_model, data=train_validation, Petal.Width~Petal.Length,
slice = list(Sepal.Width=3, Sepal.Length=4),
color.palette = cm.colors)
predict_species = predict(svm_fitted_model, test)
confusion_matrix = table(Predicted = predict_species, Actual = test$Species)
error_rate = 1-sum(diag(confusion_matrix)/sum(confusion_matrix))
#linear
# Error rate: 0.06666667 # 6.67% / 11 Support Vectors / (2; 3; 6)
#polynomial
svm_fitted_model <- svm(Species ~ ., data=test, kernel="polynomial",
cost= 10)
# Error rate: 0.10000000 # 10.00% / 21 Support Vectors / (1; 11; 9)
#Radial
svm_fitted_model <- svm(Species ~ ., data=test, kernel="radial",
cost= 1)
# Error rate: 0.13333333 # 13.33% / 44 Support Vectors / (6; 19; 19)
Last changed3 months ago