undefined

Buffl

Data Curation

von Luca I.

Panda Structures

Series:

one dimensional array-like object
containing sequence of value and associated array of data labels -> index

DataFrame:

rectangular table of data
contains ordered collection of columns
each column can have a different value type
rows and clumns have index

Operation on series:

change/ assign index
get specific value/ assign value per index
filtering
scalar multiplication
math functions

Operation on series:

create series form dict

Operation on series:

create new order for sdata

-> as aveiro index is not present in sdata is is assigned th NaN value

-> as Alentejo is not present in the new index order it is not in the new series

Operation on series:

check for not null/ null values

Operation on series:

rename entire series and index

Operation on series:

get the whole series
get index
get the values

Whole Series:

Get Index:

Get Values:

Dataframe

dict -> dataframe

what happens if you pass a column that is not contained in the dict?

-> missing columns have NaN values

Selecting rows and columsn in DataFrame

either by dict-like notation or by attribute

DataFrame

retrieve by position or row (2002, Aveiro, Nan,Nan)
retrieve specific value (Aveiro)

Rows can also be retrieved by position or name with the special loc or iloc

First Index, than column

Assigning columns with values

each entry 16.5 / values 0 - 5
by index with specific values

In this case the debt column

values attribute with DataFrame

-> returns the data contained in the DataFrame as

a two-dimensional ndarray

Index and Column objects for Dataframes

Every Series and DataFrame has an Index object that holds the row labels (and a little metadata like a name).
df.columns is also an Index object—just for column labels.
Default row index is RangeIndex(0, 1, 2, …), but you can use anything hashable (strings, dates, tuples).
Index objects are immutable and thus can’t be modified by the user

Instead:

Reindexing:

Dataframe

-> creates new object with data conformed to new index

Reindexing coumns:

the columns keyword is used

dropping entries

DataFrame

drop Porto and coimbra row
drop two and two+four column

-> .drop() returns new object

-> to drop columns either use axis = 1 or axis = ‘columns’

using .drop() without creating new object

using in-place

Indexing, Selection and Filtering

get Aveiro and Lisboa row
get “two”, column
get “three” and “One” column in this order

• Slicing or selecting data with a boolean array:

get boolean array for all values in column three that are larger than 5
get inly Lisboa, Coimvra and Porto

Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects.

Operations between DataFrame and Series

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows.
-> would not work here as series indexes don’t match column index

index = match the series index based on row index

column (default), match series with column index

Function application and mapping

get the difference between max and min value for each column
convert each value in a row to a 2f

lambda is column wise, as different columns can have different values types (String, bollean, ec.) so it makes sense to automatically do it per column