rep(6:9, 2)
6 7 8 9 6 7 8 9
c(3 != sqrt(9), TRUE == (3 > 8))
FALSE FALSE
Let x < c(1, 6, 3, 2). What is the output of sort(x)
x < c(1, 6, 3, 2)
sort(x)
1 2 3 6
# REMEMBER: the function sort(x) sorts the elements of a vector and returns them, # the function order(x), instead, outputs the ordered indexes of the elements of the input
Let x < c(1, 6, 3, 2). What is the output of sort(x, decreasing = TRUE)
sort(x, decreasing = TRUE)
6 3 2 1
Let x < c(1, 6, 3, 2). What is the output of order(x)
order(x)
1 4 3 2
Let x < c(1, 6, 3, 2). What is the output of order(x, decreasing = TRUE)
order(x, decreasing = TRUE)
2 3 4 1
What is the sum of the first 100 positive integers? The formula for the sum of integers from 1 to n is n(n + 1)/2. Define n = 100 and then use R to compute the sum from 1 to 100 using the formula. What is the value of the sum?
n < 100
sum_n < n*(n+1)/2
sum_n < n
*
(n
+
1)
/
2
sum_n
n < 1000
x < seq(1, n)
x <
seq
(1, n)
sum(x)
sum
(x)
Based on the result, what do you think the functions seq and sum do? You can use help.
vector.
seq creates a vector of consecutive numbers and sum adds them up.
In math and programming, we say that we evaluate a function when we replace the argument with a given value. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the logarithm, in base 10, of the square root of 100.
log10(sqrt(100))
log10
(
sqrt
(100))
log(sqrt(100), base=10)
> 1
Assuming x is numeric, which of the following will always return the value stored in x? You can try out examples and use the help system if you want.
log(exp(x))
log
exp
(x))
What is the outcome of the following sum 1 + 1/22 + 1/32 + · · · + 1/1002? Thanks to Euler, we know it approximates to π2/6. Compute the sum and check that it is close to the approximation. Note that R has a variable for π stored as pi.
sum(1/seq(1:100)^2)
(1
/seq
:
100)
^
2)
## [1] 1.634984
pi^2/6
pi
6
## [1] 1.644934
data(murders)
data
(murders)
str(murders)
str
str to examine the structure of the murders object. Which of the following best describes the variables represented in this data frame?
What are the column names used by the data frame?
# We can use the function names()
names(murders)
names
#alternatively we can also use the function colnames()
colnames(murders)
colnames
Klasse eines Objekts (z.B Spalte im dataframe )
a < murders$abb # Access the column of a dataframe with $
a < murders
$
abb
class(a) # get the class, also possible to use str()
class
(a)
We saw that the region column stores a factor. You can corroborate this by typing:
???
With one line of code, use the function levels and length to determine the number of regions defined by this dataset.
class(murders$region)
(murders
region)
length(levels(murders$region))
length
levels
region))
function table
table
5. The function table takes one or multiple vectors and returns the frequency of each element. For example:
x < c('DC', 'Alabama', 'Florida', 'Florida', 'DC', 'DC')
c
('DC', 'Alabama', 'Florida', 'Florida', 'DC', 'DC')
table(x)
## x
## Alabama DC Florida
## 1 3 2
You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.
table(murders$region)
## Northeast South North Central West
## 9 17 12 13
function c
Use the function c to create a vector with the average high temperatures in January for Beijing, Lagos, Paris, Rio de Janeiro, San Juan, and Toronto, which are 35, 88, 42, 84, 81, and 30 degrees Fahrenheit. Call the object temp.unction names()
temp < c(35, 88, 42, 84, 81, 30)
temp <
(35, 88, 42, 84, 81, 30)
temp
## [1] 35 88 42 84 81 30
city < c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city <
("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city
function(names)
names(temp) < city
(temp) < city
## Beijing Lagos Paris Rio de Janeiro San Juan
## 35 88 42 84 81
## Toronto
## 30
access the temperature of the first three cities on the list.
access the temperature of Paris and San Juan.
temp[1:3]
temp[1
3]
## Beijing Lagos Paris
## 35 88 42
temp[c("Paris", "San Juan")]
temp[
("Paris", "San Juan")]
## Paris San Juan
## 42 81
: operator
37:73
37
73
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 6
62 63 64 65 66 67 68 69 70 71 72 73
Create a vector containing all the positive odd numbers smaller than 100.
Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the vector have? Hint: use seq and length
seq(1, 99, 2)
(1, 99, 2)
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
## [26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
v < seq(6, 55, 4/7)
v <
(6, 55, 4
7)
length(v)
(v)
86
What is the class of the following object a < seq(1, 10, 0.5)?
What is the class of the following object a < seq(1, 10)?
The class of class(a<1) is numeric, not integer. R defaults to numeric and to force an integer, you
need to add the letter L. Confirm that the class of 1L is integer.
Define the following character vector:
x < c("1", "3", "5")
("1", "3", "5")
and coerce it to get integers.
a < seq(1, 10, 0.5)
a <
(1, 10, 0.5)
class(a)
## [1] "numeric"
a < seq(1, 10)
(1, 10)
## [1] "integer"
class(1L)
(1L)
x_int < as.integer(x)
x_int <
as.integer
x_int
minimum value
pop < murders$population
pop < murders
population
pop < sort(pop) # sorts ascending by default
pop <
sort
(pop
pop[1] # smallest value at the first position
pop[1]
ODER
min(pop) # same value as above
min
(pop)
find the index of the entry with the smallest population size.
Now we know what the smallest population is and the row which contains it. What is the name of this state?
smallest_idx < order(pop) # order() outputs the indexes of the ordered vector, not the smallest_idx[1] # index with the smallest population
smallest_idx <
order
smallest_idx[1]
## [1] 51
pop[smallest_idx[1]] # smallest population value
pop[smallest_idx[1]]
## [1] 563626
which.min(murders$population) # which.min() outputs the index containing the minimal value
which.min
population)
murders$state[which.min(murders$population)]
murders
state[
population)]
## [1] "Wyoming"
create a data frame
city < c("Beijing", "Lagos", "Paris", "Rio de Janeiro","San Juan", "Toronto")
("Beijing", "Lagos", "Paris", "Rio de Janeiro","San Juan", "Toronto")
city_temps < data.frame(name = city, temperature = temp)
city_temps <
data.frame
(name = city, temperature = temp)
rank function
Use the rank function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df.
ranks < rank(murders$population)
ranks <
rank
my_df < data.frame(state=murders$state, rank=ranks)
my_df <
(state=murders
state, rank=ranks)
head(my_df)
head
(my_df)
find NAs
ind < is.na(na_example) # is.na() outputs a Boolean
nd <
is.na
(na_example)
# is.na() outputs a Boolean
table(ind)
(ind)
## FALSE TRUE
## 855 145
# alternatively we can just sum the Boolean
# sum(ind)
# or subset the vector to contain only the NA's and obtain its length
# length(ind[ind])
# to obtain the number of NA's.
compute the average again, but only for the entries that are not NA.
mean(na_example, na.rm=TRUE)
## [1] 2.301754
# Can also be achieved using our ind vector:
ind < is.na(na_example)
mean(na_example[!ind])
Wie erstellt man das?
awesome.dt < data.table(x = order(1:6, decreasing = T), y = rep(c(TRUE, FALSE), each = 3))
Find the last 10 flights arriving in LAX on Christmas Eve (24. December). Don’t worry about sorting for now, just find the last 10 entries in the table.
flights[MONTH == 12 & DAY == 24 & DESTINATION_AIRPORT == "LAX"] %>% tail(n=10)
Let iris.dt < data.table(iris). What happens if we run iris.dt[Species != "setosa"  Sepal.Length <= 5, .N, by = Species]?
Get the number of rows for each unique value of Species, for all the rows where Species is not setosa or with Sepal.Length less or equal to 5.
Calculate the total number of outbound flights in the summer months
flights[MONTH %in% 6:8, .N, by = ORIGIN_AIRPORT]
Download the datasets needed for the exercise from Moodle, extract them, and put them in a folder called extdata.
check class of all columns
change class
data_folder_name < "extdata"
users_dt < fread(file.path("extdata","BXUsers.csv"))
books_dt < fread(file.path("extdata","BXBooks.csv"))
ratings_dt < fread(file.path("extdata", "BXBookRatings.csv"))
sapply(users_dt, class)
## Change the type of Age to be numeric
users_dt[, Age := as.numeric(Age)]
Return the first 5 and last 5 observations of the table ratings_dt.
ratings_dt
## running `ratings_dt` for a data table automatically displays the following
# head(ratings_dt, n=5)
# tail(ratings_dt, n=5)
gsub() function
s < 'R is fun'
gsub('R', 'DataViz', s)
gsub
('R', 'DataViz', s)
## [1] "DataViz is fun"
colnames(users_dt) < gsub("", "_", colnames(users_dt))
(users_dt) <
("", "_",
(users_dt))
colnames(books_dt) < gsub("", "_", colnames(books_dt))
(books_dt) <
(books_dt))
colnames(ratings_dt) < gsub("", "_", colnames(ratings_dt))
(ratings_dt) <
(ratings_dt))
Delete the columns
books_dt[, c("Image_URL_S", "Image_URL_M", "Image_URL_L"):=NULL]
books_dt[,
("Image_URL_S", "Image_URL_M", "Image_URL_L")
:=
NULL]
Create a table book_dt_2 that contains all the books published between 1900 and 2019 (inclusive) from the table books_dt.
book_dt_2 < books_dt[Year_Of_Publication >= 1900 & Year_Of_Publication <= 2019]
book_dt_2 < books_dt[Year_Of_Publication
>=
1900
&
Year_Of_Publication
<=
2019]
How many different authors are included in the table books_dt?
How many different authors are included for each year of publication between 2000 and 2010 (inclusive) in books_dt?
books_dt[, uniqueN(Book_Author)]
uniqueN
(Book_Author)]
## [1] 98540
# uniqueN() is the same as length(unique())
books_dt[
Year_Of_Publication >= 2000 & Year_Of_Publication <= 2010,
2000
2010,
uniqueN(Book_Author),
(Book_Author),
by = Year_Of_Publication
]
In how many observations is the age information missing in the ratings table users_dt?
What is the maximum rating value in the ratings table?
What is the most common rating value larger than 0?
Which are the book identifiers (ISBN) with the highest ratings?
Reorder the ratings table according to the rating value of each book in descending order
users_dt[is.na(Age), .N] # or users_dt[, sum(is.na(Age))]
users_dt[
(Age), .N]
ratings_dt[, max(Book_Rating, na.rm=TRUE)]
ratings_dt[,
max
(Book_Rating, na.rm=TRUE)]
ratings_dt[Book_Rating>0, .N, by=Book_Rating][N==max(N)]
ratings_dt[Book_Rating
>
0, .N, by=Book_Rating][N
==max
(N)]
ratings_dt[Book_Rating == max(Book_Rating, na.rm=TRUE), "ISBN"] %>% head
== max
(Book_Rating, na.rm=TRUE), "ISBN"]
%>%
# ratings_dt < ratings_dt[order(Book_Rating)]
# or
setorder(ratings_dt, Book_Rating)
setorder
(ratings_dt,

Book_Rating)
ifelse() function
observations are considered to be a high ranking
ratings_dt[, High_Rating := ifelse(Book_Rating > 7, 1, 0)]
ratings_dt[, High_Rating
:= ifelse
(Book_Rating
7, 1, 0)]
ratings_dt[, sum(High_Rating)] # absolute
(High_Rating)]
## [1] 219361
ratings_dt[, sum(High_Rating)/.N] # relative
(High_Rating)
.N]
## [1] 0.2166789
Which users did not give any rating to any book? Filter these users out from users_dt. Hint: There’s no need to merge users_dt with ratings_dt, we are simply interested in the users that are not in ratings_dt.
What is the most common age of users who rated at least one book?
On average, how many books did a user rate?
What is the title of the first published book with the highest ranking?
In which year was a book with the largest number of ratings last published?
Add to the table ratings_dt the highest ranking that each book received as a new column called
Max_Book_Ranking.
users_who_rated < ratings_dt[,User_ID]
users_dt[! (User_ID %in%users_who_rated)]
!
(User_ID
%in%
users_who_rated)]
users_dt[User_ID%in%users_who_rated & !is.na(Age), .N, by=Age][N==max(N)]
users_dt[User_ID
users_who_rated
& !is.na
(Age), .N, by=Age][N
## Age N
## 1: 26 1558
ratings_dt[, .N, by=User_ID][, mean(N, na.rm=TRUE)]
ratings_dt[, .N, by=User_ID][,
mean
(N, na.rm=TRUE)]
## [1] 11.2414
ratings_dt[order(Year_Of_Publication, Book_Rating), .(Book_Title, Year_Of_Publication, Book_Rating)] %>% head(1)
ratings_dt[
(Year_Of_Publication,
Book_Rating), .(Book_Title, Year_Of_Publication, Book_Rating)]
%>% head
(1)
## Book_Title Year_Of_Publication Book_Rating
## 1: Darcys Utopia 0 10
ratings_dt[, Rating_Count:=.N, by=ISBN] ratings_dt[ Rating_Count == max(Rating_Count), max(Year_Of_Publication)]
ratings_dt[, Rating_Count
.N, by=ISBN] ratings_dt[ Rating_Count
(Rating_Count),
(Year_Of_Publication)]
ratings_dt[, Max_Book_Rating := max(Book_Rating), by=ISBN] ratings_dt
ratings_dt[, Max_Book_Rating
:= max
(Book_Rating), by=ISBN] ratings_dt
Subset the ratings_dt ratings table to contain only books written by the following authors:
How many ratings has each author from the previous exercise 9? What is their max and average ranking?
authors < c("Agatha Christie", "William Shakespeare", "Stephen King", "Ann M. Martin",
authors <
("Agatha Christie", "William Shakespeare", "Stephen King", "Ann M. Martin",
"Carolyn Keene", "Francine Pascal",…)
ratings_dt_sub < ratings_dt[Book_Author %in% authors]
ratings_dt_sub < ratings_dt[Book_Author
authors]
ratings_dt_sub[, .(mean(Book_Rating), max(Book_Rating), .N), by=Book_Author]
ratings_dt_sub[, .(
(Book_Rating),
(Book_Rating), .N), by=Book_Author]
Using the summer_olympic_medals.xlsx file, which athlete won most bronze medals?
library(readxl)
ibrary
(readxl)
oly_file < file.path("extdata","summer_olympic_medals.xlsx")
oly_file <
file.path
("extdata","summer_olympic_medals.xlsx")
oly_df < read_xlsx(oly_file, sheet='ALL MEDALISTS')
oly_df <
read_xlsx
(oly_file, sheet='ALL MEDALISTS')
head(oly_df)
(oly_df)
# There are different solutions for this
oly_dt < as.data.table(oly_df)
oly_dt <
as.data.table
bronze < oly_dt[Medal == "Bronze",]
bronze < oly_dt[Medal
==
"Bronze",]
# 4. Using .N command from data.table. More to this in Data Table lecture
bronze[, N := .N, by = Athlete]
bronze[, N
.N, by = Athlete]
bronze[N == max(N), unique(Athlete)]
bronze[N
(N),
unique
(Athlete)]
Are the columns Gender and Event_gender consistent? Find inconsistent gender entries.
# There was a male Bronzemedal winner in ladies marathon in 2000.
oly_dt[, unique(Gender)]
oly_dt[,
(Gender)]
## [1] "Men" "Women"
oly_dt[, unique(Event_gender)]
(Event_gender)]
## [1] "M" "X" "W"
oly_dt[Gender == "Men" & !Event_gender %in% c("M", "X")]
oly_dt[Gender
"Men"
& !
Event_gender
%in% c
("M", "X")]
## City Edition Sport Discipline Athlete NOC Gender Event
## 1: Sydney 2000 Athletics Athletics CHEPCHUMBA, Joyce KEN Men marathon
## Event_gender Medal
## 1: W Bronze
oly_dt[Gender == "Women" & !Event_gender %in% c("W", "X")]
"Women"
("W", "X")]
## Empty data.table (0 rows and 10 cols): City,Edition,Sport,Discipline,Athlete,NOC...
3. Which country won most medals? Which country has the highest ratio of silver medals? Use the data in the country summary sheet starting at row 147 of the summer_olympic_medals.xlsx file
Which countries did participate, but without winning medals? Assume, that all countries listed in the IOC COUNTRY CODES sheet participated. Hint you can quick fix the column names with make.names and find set differences with setdiff.
# There is also a summary sheet for nations
nation_medal_df < read_excel(oly_file, sheet='COUNTRY TOTALS', range="A147:F286") nation_medal_dt < as.data.table(nation_medal_df)
nation_medal_df <
read_excel
(oly_file, sheet='COUNTRY TOTALS', range="A147:F286") nation_medal_dt <
(nation_medal_df)
head(nation_medal_dt)
(nation_medal_dt)
# Remove Grand.Total row
nation_medal_dt < nation_medal_dt[!is.na(Country)]
nation_medal_dt < nation_medal_dt[
!is.na
(Country)]
# Get max
nation_medal_dt[`Grand Total` == max(`Grand Total`, na.rm = T)]
nation_medal_dt[`Grand Total`
(`Grand Total`, na.rm = T)]
# Get highest ratio
nation_medal_dt[, silver.ratio := Silver/`Grand Total`, by = Country]
nation_medal_dt[, silver.ratio
Silver
`Grand Total`, by = Country]
nation_medal_dt[silver.ratio == max(silver.ratio, na.rm = TRUE), Country]
nation_medal_dt[silver.ratio
(silver.ratio, na.rm = TRUE), Country]
participants < read_excel(oly_file, sheet='IOC COUNTRY CODES', range="A1:C202") participants.dt < as.data.table(participants)
participants <
(oly_file, sheet='IOC COUNTRY CODES', range="A1:C202") participants.dt <
(participants)
head(participants.dt)
(participants.dt)
## make sure to have proper variable names
colnames(participants.dt) < make.names(colnames(participants.dt))
(participants.dt) <
make.names
(participants.dt))
no_medals < setdiff(participants.dt[, Int.Olympic.Committee.code], nation_medal_dt[, NOC])
no_medals <
setdiff
(participants.dt[, Int.Olympic.Committee.code], nation_medal_dt[, NOC])
length(no_medals)
(no_medals)
## [1] 78
participants.dt[Int.Olympic.Committee.code %in% no_medals, Country]
participants.dt[Int.Olympic.Committee.code
no_medals, Country]
Is the following dataset tidy?
3. What transformations are required to tidy the following data?
4. What transformations are required to tidy the following data?
#Yes, each row (observation) is a car model and all columns contain different #variables, therefore, it is tidy.
#No, the column names are values not variable names.
2. melt only. Tidy form:
melt, unite and cast.
5. How do you perform the data table merge pictured here?
4. Right, all.y = TRUE
1. Visually examine the dataset AirPassengers. Which of the following is true:
AirPassengers is not tidy: to be tidy we would have to wrangle it to have three columns (year, month
and value), then each passenger count would have a row.
Visually examine the dataset ChickWeight
ChickWeight is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.
Visually examine the dataset spanish_vowels. Is the data set tidy?
# No it is not tidy because the label contains multiple values (participant, sex, vowel).
product_dt into a long format using data.table commands.
Transform the table from the long format back into a wide format. Check that it is equal to the original data.table.
dcast(long_dt, ... ~ product)
## name producta productb
## 1: John Doe NA 12
## 2: John Johnson. 5 1
## 3: Marry Doe 3 1
How many rows does the inner merge of dt1 and dt2 have?
How many rows does the left merge of dt1 and dt2 have?
How many rows does the outer merge of dt1 and dt2 have?
inner_dt < merge(dt1, dt2, by='carname')
inner_dt
inner_dt[, .N]
## [1] 6
left_dt < merge(dt1, dt2, by='carname', all.x = T)
left_dt
left_dt[, .N]
## [1] 21
outer_dt < merge(dt1, dt2, by='carname', all = T)
outer_dt
outer_dt[, .N]
## [1] 25
messy?
How would a tidy version of it look like? Do not give the code, only describe how the tidy table would look like.
Create a tidy version of the weather dataset.
## Why is it messy?
## 1. Variables are stored as columns (days)
## 2. A single entity is scattered across many cells (date)
## 3. Element column is not a variable.
## Tidy version: id, date, tmin, tmax
## wide > long
dt < melt(messy_dt, id.vars = c("id", "year", "month", "element"), variable.name = "day",
value.name = 'temp')
# you can ignore the warning message
dt[, day := as.integer(gsub("d", "", day))]
dt[, day :
=
as.integer(gsub("d", "", day))]
dt[, date := paste(year, month, day, sep = "")]
dt[, date :
paste(year, month, day, sep = "")]
# option using paste dt[, c("year", "month", "day") := NULL]
# option using paste
dt[, c("year", "month", "day") :
# remove reduntant columns
# alternatively one can use unite from the tidyr package
# dt < unite(dt, "date", c("year", "month", "day"), sep = "", remove = TRUE)
dt[, element := tolower(element)] # TMAX > tmax
dt[, element :
tolower(element)]
dt < dcast(dt, ... ~ element, value.var = "temp") # long > wide
dt < dcast(dt, ... ~ element, value.var = "temp")
dt < dt[!(is.na(tmax) & is.na(tmin))] # remove entries with both NA values, # na.omit(dt) would also do the job
dt < dt[!(is.na(tmax) & is.na(tmin))]
head(dt)
Create a list containing all file paths in the folder
name the list elements by the filenames
Read in the data from all files into one table.
Is the data tidy? If not, tidy it up.
files < list.files("extdata/babynames", full.names = TRUE)
# See one file
head(fread(files[1]))
names(files) < basename(files)
# read all files at once into a list of data.tables
tables < lapply(files, fread)
# bind all tables into one using rbindlist,
# keeping the list names (the filenames) as an id column.
dt < rbindlist(tables, idcol = 'filename')
# The data is not tidy because one column contains both year and sex
dt < separate(dt, col = "filename", into = c("year", "sex"), extra = "drop")
Come up with a strategy, how you can transform the two tables shown above into the single table shown below.
Write code that implements your strategy to transform the two tables into the one shown above.
#  melt each table.
#  merge them by the strain column
#  and convert the character columns into factors.
gt < fread('extdata/eqtl/genotype.txt')
dim(gt)
## [1] 158 1001
head(gt[,1:5])
function qnorm()
# qnorm can be used to find different types of quantiles qnorm(seq(0.25,0.75,0.25)) # quartiles of the normal
Last changed2 years ago