Panda Structures
Series:
one dimensional array-like object
containing sequence of value and associated array of data labels -> index
DataFrame:
rectangular table of data
contains ordered collection of columns
each column can have a different value type
rows and clumns have index
Operation on series:
change/ assign index
get specific value/ assign value per index
filtering
scalar multiplication
math functions
create series form dict
create new order for sdata
-> as aveiro index is not present in sdata is is assigned th NaN value
-> as Alentejo is not present in the new index order it is not in the new series
+
=
check for not null/ null values
rename entire series and index
get the whole series
get index
get the values
Whole Series:
Get Index:
Get Values:
Dataframe
dict -> dataframe
what happens if you pass a column that is not contained in the dict?
-> missing columns have NaN values
Selecting rows and columsn in DataFrame
either by dict-like notation or by attribute
DataFrame
retrieve by position or row (2002, Aveiro, Nan,Nan)
retrieve specific value (Aveiro)
Rows can also be retrieved by position or name with the special loc or iloc
First Index, than column
Assigning columns with values
each entry 16.5 / values 0 - 5
by index with specific values
In this case the debt column
values attribute with DataFrame
-> returns the data contained in the DataFrame as
a two-dimensional ndarray
Index and Column objects for Dataframes
Every Series and DataFrame has an Index object that holds the row labels (and a little metadata like a name).
df.columns is also an Index object—just for column labels.
Default row index is RangeIndex(0, 1, 2, …), but you can use anything hashable (strings, dates, tuples).
Index objects are immutable and thus can’t be modified by the user
Instead:
Reindexing:
-> creates new object with data conformed to new index
Reindexing coumns:
->
the columns keyword is used
dropping entries
drop Porto and coimbra row
drop two and two+four column
-> .drop() returns new object
-> to drop columns either use axis = 1 or axis = ‘columns’
using .drop() without creating new object
using in-place
Indexing, Selection and Filtering
get Aveiro and Lisboa row
get “two”, column
get “three” and “One” column in this order
• Slicing or selecting data with a boolean array:
get boolean array for all values in column three that are larger than 5
get inly Lisboa, Coimvra and Porto
Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects.
Operations between DataFrame and Series
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows.
-> would not work here as series indexes don’t match column index
index = match the series index based on row index
column (default), match series with column index
Function application and mapping
get the difference between max and min value for each column
convert each value in a row to a 2f
lambda is column wise, as different columns can have different values types (String, bollean, ec.) so it makes sense to automatically do it per column
apply: a function on one-dimensional arrays to each
column or row
default, one column across rows
columns, one row across columns
applymap to perform element-wise operation on each value
Sorting
by index
by column
column/ descending
Sorting Values
sorting values in column b
sorting values first based on b than on a
wie in Excel when man in Tabelle auf the Pfeil drückt
summary statistics for dataFrames
sum up values in each column
sum up value in rows
multiple summary statistics with one function
Available summary statistics
correlations and covariance
Unique values, value counts
unique values
count values and sort them
count values without sorting
-> array(['c', 'a', 'd', 'b'], dtype=object) (not sorted)
-> .value_counts() automatically sorts
Membership
DataFrames
get Mask array
use mask to filter dataset
1.
How is missing data in pandas represented?
check for missing data, respective values
the floating-point value NaN (Not a Number)
df.isnull()
Na handling methods
.dropna() -> drops rows
.dropna(axis=1) -> drops columns
.fillna() -> fill missing data
.isnull() -> Returns a Boolean mask (True for missing values).
.notnull(), opposite of isnull -> returns
-> All create new object
-> Can modify in place: only dropna, fillna w. inplace=True
-> Never in-place: isnull, notnull.
Filtering out missing data per row:
drop only rows with at least one missing value
drop only rows wer all values are NaN
df.data.dropna()
Filtering out null coluns
remove this column
fillna() function
replace NaN values with 0
replace NaN values differently across columns
fill NaN value with the value before
fillna arguments
Removing duplicates
booean series indicating which rows are duplicates
remove duplicates
remove duplicates in specific column
keeoming the last duplicate
Transforming Data using Function or Mapping
add new column animal, that maps the correct animal to the beef type
map the correct animal to the beef directly in the food column
Replacing Values
one values with another value
multiple values with one value
multiple values with multiple values
1
2
3
Renaming Axis Indexes
1. rename lisboa with cascais and three with threeandahalf
reanem porto with Braga and modify original dataset
Rename returns new DataFrame unless inplace = True
Discretization and binning
bin the age in intervals 18,25,35,60,100
get the interval number for each age number
get interval indexes
Assign 'Youth, YoungAdult, MiddleAged and Senior as bin names
4
Discretization and Binning
equal length bins based on min and max values in the data
based on quantiles
detecting and filtering outliers
filter out all values larger than 3 for a specific columns
filter out all values larger than 3 for the whole dataset
Computing indicator/ Dummy Variables:
Pandas String Manipulation
array indicating which email uses gmail domain
display the 0 to 4th element
Uppercase all mail adresses
replace gmail with yahoo
get length of string
checking if characters in strings are just letters
Zuletzt geändertvor 21 Tagen