undefined

Buffl

DataVis

by Lea H.

rep(6:9, 2)

6 7 8 9 6 7 8 9

c(3 != sqrt(9), TRUE == (3 > 8))

FALSE FALSE

Let x <- c(1, 6, 3, 2). What is the output of sort(x)

1 2 3 6

# REMEMBER: the function sort(x) sorts the elements of a vector and returns them, # the function order(x), instead, outputs the ordered indexes of the elements of the input

Let x <- c(1, 6, 3, 2). What is the output of sort(x, decreasing = TRUE)

6 3 2 1

# REMEMBER: the function sort(x) sorts the elements of a vector and returns them, # the function order(x), instead, outputs the ordered indexes of the elements of the input

Let x <- c(1, 6, 3, 2). What is the output of order(x)

1 4 3 2

# REMEMBER: the function sort(x) sorts the elements of a vector and returns them, # the function order(x), instead, outputs the ordered indexes of the elements of the input

Let x <- c(1, 6, 3, 2). What is the output of order(x, decreasing = TRUE)

2 3 4 1

# REMEMBER: the function sort(x) sorts the elements of a vector and returns them, # the function order(x), instead, outputs the ordered indexes of the elements of the input

What is the sum of the first 100 positive integers? The formula for the sum of integers from 1 to n is n(n + 1)/2. Define n = 100 and then use R to compute the sum from 1 to 100 using the formula. What is the value of the sum?

n <- 100

sum_n <- n*(n+1)/2

sum_n

n <- 1000

x <- seq(1, n)

sum(x)

Based on the result, what do you think the functions seq and sum do? You can use help.

vector.

seq creates a vector of consecutive numbers and sum adds them up.

In math and programming, we say that we evaluate a function when we replace the argument with a given value. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the logarithm, in base 10, of the square root of 100.

log10(sqrt(100))

log(sqrt(100), base=10)

-> 1

Assuming x is numeric, which of the following will always return the value stored in x? You can try out examples and use the help system if you want.

log(exp(x))

What is the outcome of the following sum 1 + 1/22 + 1/32 + · · · + 1/1002? Thanks to Euler, we know it approximates to π2/6. Compute the sum and check that it is close to the approximation. Note that R has a variable for π stored as pi.

sum(1/seq(1:100)^2)

## [1] 1.634984

pi^2/6

## [1] 1.644934

data(murders)

str(murders)

str to examine the structure of the murders object. Which of the following best describes the variables represented in this data frame?

What are the column names used by the data frame?

# We can use the function names()

names(murders)

#alternatively we can also use the function colnames()

colnames(murders)

Klasse eines Objekts (z.B Spalte im dataframe )

a <- murders$abb # Access the column of a dataframe with $

class(a) # get the class, also possible to use str()

We saw that the region column stores a factor. You can corroborate this by typing:

???

With one line of code, use the function levels and length to determine the number of regions defined by this dataset.

???

We saw that the region column stores a factor. You can corroborate this by typing:

class(murders$region)

With one line of code, use the function levels and length to determine the number of regions defined by this dataset.

length(levels(murders$region))

function table

5. The function table takes one or multiple vectors and returns the frequency of each element. For example:

x <- c('DC', 'Alabama', 'Florida', 'Florida', 'DC', 'DC')

table(x)

## x

## Alabama DC Florida

## 1 3 2

You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.

table(murders$region)

## Northeast South North Central West

## 9 17 12 13

function c

Use the function c to create a vector with the average high temperatures in January for Beijing, Lagos, Paris, Rio de Janeiro, San Juan, and Toronto, which are 35, 88, 42, 84, 81, and 30 degrees Fahrenheit. Call the object temp.unction names()

temp <- c(35, 88, 42, 84, 81, 30)

temp

## [1] 35 88 42 84 81 30

city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")

city

function(names)

names(temp) <- city

temp

## Beijing Lagos Paris Rio de Janeiro San Juan

## 35 88 42 84 81

## Toronto

## 30

access the temperature of the first three cities on the list.

access the temperature of Paris and San Juan.

temp[1:3]

## Beijing Lagos Paris

## 35 88 42

temp[c("Paris", "San Juan")]

## Paris San Juan

## 42 81

: operator

37:73

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 6

62 63 64 65 66 67 68 69 70 71 72 73

Create a vector containing all the positive odd numbers smaller than 100.

Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the vector have? Hint: use seq and length

seq(1, 99, 2)

## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

## [26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

v <- seq(6, 55, 4/7)

length(v)

What is the class of the following object a <- seq(1, 10, 0.5)?

What is the class of the following object a <- seq(1, 10)?

The class of class(a<-1) is numeric, not integer. R defaults to numeric and to force an integer, you

need to add the letter L. Confirm that the class of 1L is integer.

Define the following character vector:

x <- c("1", "3", "5")

and coerce it to get integers.

What is the class of the following object a <- seq(1, 10, 0.5)?

a <- seq(1, 10, 0.5)

class(a)

## [1] "numeric"

What is the class of the following object a <- seq(1, 10)?

a <- seq(1, 10)

class(a)

## [1] "integer"

The class of class(a<-1) is numeric, not integer. R defaults to numeric and to force an integer, you

need to add the letter L. Confirm that the class of 1L is integer.

class(1L)

## [1] "integer"

Define the following character vector:

x <- c("1", "3", "5")

and coerce it to get integers.

x_int <- as.integer(x)

x_int

minimum value

pop <- murders$population

pop <- sort(pop) # sorts ascending by default

pop[1] # smallest value at the first position

ODER

min(pop) # same value as above

find the index of the entry with the smallest population size.

Now we know what the smallest population is and the row which contains it. What is the name of this state?

pop <- murders$population

smallest_idx <- order(pop) # order() outputs the indexes of the ordered vector, not the smallest_idx[1] # index with the smallest population

## [1] 51

pop[smallest_idx[1]] # smallest population value

## [1] 563626

ODER

which.min(murders$population) # which.min() outputs the index containing the minimal value

## [1] 51

Now we know what the smallest population is and the row which contains it. What is the name of this state?

murders$state[which.min(murders$population)]

## [1] "Wyoming"

create a data frame

temp <- c(35, 88, 42, 84, 81, 30)

city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro","San Juan", "Toronto")

city_temps <- data.frame(name = city, temperature = temp)

rank function

Use the rank function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df.

ranks <- rank(murders$population)

my_df <- data.frame(state=murders$state, rank=ranks)

head(my_df)

find NAs

ind <- is.na(na_example) # is.na() outputs a Boolean

table(ind)

## FALSE TRUE

## 855 145

# alternatively we can just sum the Boolean

# sum(ind)

# or subset the vector to contain only the NA's and obtain its length

# length(ind[ind])

# to obtain the number of NA's.

compute the average again, but only for the entries that are not NA.

mean(na_example, na.rm=TRUE)

## [1] 2.301754

# Can also be achieved using our ind vector:

ind <- is.na(na_example)

mean(na_example[!ind])

## [1] 2.301754

Wie erstellt man das?

awesome.dt <- data.table(x = order(1:6, decreasing = T), y = rep(c(TRUE, FALSE), each = 3))

Find the last 10 flights arriving in LAX on Christmas Eve (24. December). Don’t worry about sorting for now, just find the last 10 entries in the table.

flights[MONTH == 12 & DAY == 24 & DESTINATION_AIRPORT == "LAX"] %>% tail(n=10)

Let iris.dt <- data.table(iris). What happens if we run iris.dt[Species != "setosa" | Sepal.Length <= 5, .N, by = Species]?

Get the number of rows for each unique value of Species, for all the rows where Species is not setosa or with Sepal.Length less or equal to 5.

Calculate the total number of outbound flights in the summer months

flights[MONTH %in% 6:8, .N, by = ORIGIN_AIRPORT]

Download the datasets needed for the exercise from Moodle, extract them, and put them in a folder called extdata.

check class of all columns

change class

data_folder_name <- "extdata"

users_dt <- fread(file.path("extdata","BX-Users.csv"))

books_dt <- fread(file.path("extdata","BX-Books.csv"))

ratings_dt <- fread(file.path("extdata", "BX-Book-Ratings.csv"))

sapply(users_dt, class)

## Change the type of Age to be numeric

users_dt[, Age := as.numeric(Age)]

Return the first 5 and last 5 observations of the table ratings_dt.

ratings_dt

## running `ratings_dt` for a data table automatically displays the following

# head(ratings_dt, n=5)

# tail(ratings_dt, n=5)

gsub() function

s <- 'R is fun'

gsub('R', 'DataViz', s)

## [1] "DataViz is fun"

colnames(users_dt) <- gsub("-", "_", colnames(users_dt))

colnames(books_dt) <- gsub("-", "_", colnames(books_dt))

colnames(ratings_dt) <- gsub("-", "_", colnames(ratings_dt))

Delete the columns

books_dt[, c("Image_URL_S", "Image_URL_M", "Image_URL_L"):=NULL]

Create a table book_dt_2 that contains all the books published between 1900 and 2019 (inclusive) from the table books_dt.

book_dt_2 <- books_dt[Year_Of_Publication >= 1900 & Year_Of_Publication <= 2019]

How many different authors are included in the table books_dt?

How many different authors are included for each year of publication between 2000 and 2010 (inclusive) in books_dt?

books_dt[, uniqueN(Book_Author)]

## [1] 98540

# uniqueN() is the same as length(unique())

books_dt[

Year_Of_Publication >= 2000 & Year_Of_Publication <= 2010,

uniqueN(Book_Author),

by = Year_Of_Publication

]

In how many observations is the age information missing in the ratings table users_dt?

What is the maximum rating value in the ratings table?

What is the most common rating value larger than 0?

Which are the book identifiers (ISBN) with the highest ratings?

Reorder the ratings table according to the rating value of each book in descending order

users_dt[is.na(Age), .N] # or users_dt[, sum(is.na(Age))]

ratings_dt[, max(Book_Rating, na.rm=TRUE)]

ratings_dt[Book_Rating>0, .N, by=Book_Rating][N==max(N)]

ratings_dt[Book_Rating == max(Book_Rating, na.rm=TRUE), "ISBN"] %>% head

# ratings_dt <- ratings_dt[order(-Book_Rating)]

# or

setorder(ratings_dt, -Book_Rating)

ratings_dt

ifelse() function

observations are considered to be a high ranking

ratings_dt[, High_Rating := ifelse(Book_Rating > 7, 1, 0)]

ratings_dt[, sum(High_Rating)] # absolute

## [1] 219361

ratings_dt[, sum(High_Rating)/.N] # relative

## [1] 0.2166789

Which users did not give any rating to any book? Filter these users out from users_dt. Hint: There’s no need to merge users_dt with ratings_dt, we are simply interested in the users that are not in ratings_dt.

What is the most common age of users who rated at least one book?

On average, how many books did a user rate?

What is the title of the first published book with the highest ranking?

In which year was a book with the largest number of ratings last published?

Add to the table ratings_dt the highest ranking that each book received as a new column called

Max_Book_Ranking.

users_who_rated <- ratings_dt[,User_ID]

users_dt[! (User_ID %in%users_who_rated)]

users_dt[User_ID%in%users_who_rated & !is.na(Age), .N, by=Age][N==max(N)]

## Age N

## 1: 26 1558

ratings_dt[, .N, by=User_ID][, mean(N, na.rm=TRUE)]

## [1] 11.2414

ratings_dt[order(Year_Of_Publication, -Book_Rating), .(Book_Title, Year_Of_Publication, Book_Rating)] %>% head(1)

## Book_Title Year_Of_Publication Book_Rating

## 1: Darcys Utopia 0 10

ratings_dt[, Rating_Count:=.N, by=ISBN] ratings_dt[ Rating_Count == max(Rating_Count), max(Year_Of_Publication)]

ratings_dt[, Max_Book_Rating := max(Book_Rating), by=ISBN] ratings_dt

Subset the ratings_dt ratings table to contain only books written by the following authors:

How many ratings has each author from the previous exercise 9? What is their max and average ranking?

authors <- c("Agatha Christie", "William Shakespeare", "Stephen King", "Ann M. Martin",

"Carolyn Keene", "Francine Pascal",…)

ratings_dt_sub <- ratings_dt[Book_Author %in% authors]

ratings_dt_sub[, .(mean(Book_Rating), max(Book_Rating), .N), by=Book_Author]

Using the summer_olympic_medals.xlsx file, which athlete won most bronze medals?

library(readxl)

oly_file <- file.path("extdata","summer_olympic_medals.xlsx")

oly_df <- read_xlsx(oly_file, sheet='ALL MEDALISTS')

head(oly_df)

# There are different solutions for this

oly_dt <- as.data.table(oly_df)

bronze <- oly_dt[Medal == "Bronze",]

# 4. Using .N command from data.table. More to this in Data Table lecture

bronze[, N := .N, by = Athlete]

bronze[N == max(N), unique(Athlete)]

Are the columns Gender and Event_gender consistent? Find inconsistent gender entries.

# There was a male Bronze-medal winner in ladies marathon in 2000.

oly_dt[, unique(Gender)]

## [1] "Men" "Women"

oly_dt[, unique(Event_gender)]

## [1] "M" "X" "W"

oly_dt[Gender == "Men" & !Event_gender %in% c("M", "X")]

## City Edition Sport Discipline Athlete NOC Gender Event

## 1: Sydney 2000 Athletics Athletics CHEPCHUMBA, Joyce KEN Men marathon

## Event_gender Medal

## 1: W Bronze

oly_dt[Gender == "Women" & !Event_gender %in% c("W", "X")]

## Empty data.table (0 rows and 10 cols): City,Edition,Sport,Discipline,Athlete,NOC...

3. Which country won most medals? Which country has the highest ratio of silver medals? Use the data in the country summary sheet starting at row 147 of the summer_olympic_medals.xlsx file

Which countries did participate, but without winning medals? Assume, that all countries listed in the IOC COUNTRY CODES sheet participated. Hint you can quick fix the column names with make.names and find set differences with setdiff.

# There is also a summary sheet for nations

nation_medal_df <- read_excel(oly_file, sheet='COUNTRY TOTALS', range="A147:F286") nation_medal_dt <- as.data.table(nation_medal_df)

head(nation_medal_dt)

# Remove Grand.Total row

nation_medal_dt <- nation_medal_dt[!is.na(Country)]

# Get max

nation_medal_dt[`Grand Total` == max(`Grand Total`, na.rm = T)]

# Get highest ratio

nation_medal_dt[, silver.ratio := Silver/`Grand Total`, by = Country]

nation_medal_dt[silver.ratio == max(silver.ratio, na.rm = TRUE), Country]

participants <- read_excel(oly_file, sheet='IOC COUNTRY CODES', range="A1:C202") participants.dt <- as.data.table(participants)

head(participants.dt)

## make sure to have proper variable names

colnames(participants.dt) <- make.names(colnames(participants.dt))

no_medals <- setdiff(participants.dt[, Int.Olympic.Committee.code], nation_medal_dt[, NOC])

length(no_medals)

## [1] 78

participants.dt[Int.Olympic.Committee.code %in% no_medals, Country]

Is the following dataset tidy?

3. What transformations are required to tidy the following data?

4. What transformations are required to tidy the following data?

#Yes, each row (observation) is a car model and all columns contain different #variables, therefore, it is tidy.

#No, the column names are values not variable names.

2. melt only. Tidy form:

melt, unite and cast.

5. How do you perform the data table merge pictured here?

4. Right, all.y = TRUE

1. Visually examine the dataset AirPassengers. Which of the following is true:

AirPassengers is not tidy: to be tidy we would have to wrangle it to have three columns (year, month
and value), then each passenger count would have a row.

Visually examine the dataset ChickWeight

ChickWeight is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.

Visually examine the dataset spanish_vowels. Is the data set tidy?

# No it is not tidy because the label contains multiple values (participant, sex, vowel).

product_dt into a long format using data.table commands.

Transform the table from the long format back into a wide format. Check that it is equal to the original data.table.

dcast(long_dt, ... ~ product)

## name producta productb

## 1: John Doe NA 12

## 2: John Johnson. 5 1

## 3: Marry Doe 3 1

How many rows does the inner merge of dt1 and dt2 have?
How many rows does the left merge of dt1 and dt2 have?
How many rows does the outer merge of dt1 and dt2 have?

inner_dt <- merge(dt1, dt2, by='carname')

inner_dt

inner_dt[, .N]

## [1] 6

left_dt <- merge(dt1, dt2, by='carname', all.x = T)

left_dt

left_dt[, .N]

## [1] 21

outer_dt <- merge(dt1, dt2, by='carname', all = T)

outer_dt

outer_dt[, .N]

## [1] 25

messy?

How would a tidy version of it look like? Do not give the code, only describe how the tidy table would look like.

Create a tidy version of the weather dataset.

## Why is it messy?

## 1. Variables are stored as columns (days)

## 2. A single entity is scattered across many cells (date)

## 3. Element column is not a variable.

## Tidy version: id, date, tmin, tmax

## wide -> long

dt <- melt(messy_dt, id.vars = c("id", "year", "month", "element"), variable.name = "day",

value.name = 'temp')

# you can ignore the warning message

dt[, day := as.integer(gsub("d", "", day))]

dt[, date := paste(year, month, day, sep = "-")]

# option using paste dt[, c("year", "month", "day") := NULL]

# remove reduntant columns

# alternatively one can use unite from the tidyr package

# dt <- unite(dt, "date", c("year", "month", "day"), sep = "-", remove = TRUE)

dt[, element := tolower(element)] # TMAX -> tmax

dt <- dcast(dt, ... ~ element, value.var = "temp") # long -> wide

dt <- dt[!(is.na(tmax) & is.na(tmin))] # remove entries with both NA values, # na.omit(dt) would also do the job

head(dt)

Create a list containing all file paths in the folder

name the list elements by the filenames

Read in the data from all files into one table.

Is the data tidy? If not, tidy it up.

files <- list.files("extdata/baby-names", full.names = TRUE)

# See one file

head(fread(files[1]))

names(files) <- basename(files)

# read all files at once into a list of data.tables

tables <- lapply(files, fread)

# bind all tables into one using rbindlist,

# keeping the list names (the filenames) as an id column.

dt <- rbindlist(tables, idcol = 'filename')

# The data is not tidy because one column contains both year and sex

dt <- separate(dt, col = "filename", into = c("year", "sex"), extra = "drop")

head(dt)

Come up with a strategy, how you can transform the two tables shown above into the single table shown below.

Write code that implements your strategy to transform the two tables into the one shown above.

# - melt each table.

# - merge them by the strain column

# - and convert the character columns into factors.

gt <- fread('extdata/eqtl/genotype.txt')

dim(gt)

## [1] 158 1001

head(gt[,1:5])

function qnorm()

# qnorm can be used to find different types of quantiles qnorm(seq(0.25,0.75,0.25)) # quartiles of the normal

Join Course

Preview

Author

Lea H.

Information

Last changed
2 years ago

Report course

R basics

Author

Lea H.

Information