2nd Week, 4 Data Aggregation and Group Operations

von Luca I.

What is data wrangling?

cleaning, structuring and enriching raw data into desired format -> better decision making in less time

Combining and Merging Datasets

Database Style DataFrame Join

-> default inner join, only keys present in both are used

How to join two datsets if the key column has different names?

-> default inner join, only keys present in both are used

How will the inner join using bith key columns willlook like?

Different join arguments

How does the left join here looks like?

apply b’s from df2 on each b in df1

How does the right join here looks like?

applys b’s from df1 on the b’s from each df2

Outer Join

How to concat these frames?

How to concat these two df’s on the columns axis

Giving names to the multi index

OPandas compared to SQL

Pandas are able to perform more flexible and powerful data aggregeation and transformation through split-apply-combine operations BUT requires more programming efforts

Create a group correspondance and sum together the columns by group

merge columns a,b,e,as red and and c,d as blue. Then create the sum across the columns

Methods to use with the GroupBy function

The Aggregate function, df.agg()

calculate the mean tip percentage on a Day and smoker level
calculate the standard deviation, the mean and the difference between the Highest and lowest tip pct.

The Aggregate function, df.agg()

calculate the standard deviation and the mean for the grouped object but rename mean to ‘foo’ and std to ‘bar’
Calculate the count, mean and max for the tip.pct and the totalbill column

Genreal split-apply-combine:

Apply vs. Applymap vs Map

Pivot table to summarize

Using PT: Groupb by day and smoker and then show the mean of the reminaing columns based on this aggregation level
Achieving the same, without PT

Pivot Table

add time and day as as index (row groups) and smoker as a column group. Then show the mean for tip_pct and size in the table
Do the same but instead of onla yhaving mean for yes and No also the mean for both categories