What is data wrangling?
cleaning, structuring and enriching raw data into desired format -> better decision making in less time
Combining and Merging Datasets
connecting rows based on keys
stack objects together along axis
pandas.merge() -> default sorted True, defaul how inner
pandas.concat()
Database Style DataFrame Join
combine df1 and df2 basey on the key
-> default inner join, only keys present in both are used
How to join two datsets if the key column has different names?
left_on = ‘lkey’ and right_on = ‘rkey’
How will the inner join using bith key columns willlook like?
Different join arguments
How does the left join here looks like?
apply b’s from df2 on each b in df1
How does the right join here looks like?
applys b’s from df1 on the b’s from each df2
Outer Join
How to concat these frames?
How to concat these two df’s on the columns axis
Giving names to the multi index
OPandas compared to SQL
Pandas are able to perform more flexible and powerful data aggregeation and transformation through split-apply-combine operations BUT requires more programming efforts
Split into groupsbased on one or more keys
function appield on each group -> new value
result combines
Get the mean of all values grouped by ‘key1’
Get the mean of the values in data2, grouped by key1 and key2
Create a group correspondance and sum together the columns by group
merge columns a,b,e,as red and and c,d as blue. Then create the sum across the columns
Methods to use with the GroupBy function
The Aggregate function, df.agg()
calculate the mean tip percentage on a Day and smoker level
calculate the standard deviation, the mean and the difference between the Highest and lowest tip pct.
calculate the standard deviation and the mean for the grouped object but rename mean to ‘foo’ and std to ‘bar’
Calculate the count, mean and max for the tip.pct and the totalbill column
1
2
Genreal split-apply-combine:
show the top 5 tip_pct grouped by smoker
for each combination of smoker + day, the highest total bill
Apply vs. Applymap vs Map
Pivot table to summarize
Using PT: Groupb by day and smoker and then show the mean of the reminaing columns based on this aggregation level
Achieving the same, without PT
Pivot Table
add time and day as as index (row groups) and smoker as a column group. Then show the mean for tip_pct and size in the table
Do the same but instead of onla yhaving mean for yes and No also the mean for both categories
‘columns’ = and ‘index = ‘
2 -> margins= True
Options for pivot tables
values is the first positional argument of pivot_table, so when you pass ['tip_pct', 'size'] without a name, pandas treats it as values=....
Cross-Tabulations
display the data in a way that you have nationality as index and the handedness as columns
Zuletzt geändertvor 21 Tagen