Important Python commands

Buffl

RMBA

by Luca I.

Displaying first few rows of the dataset

-> first 5 rows

-> first 10 rows

Checking which data types our dataset contains
columns titles
information about one column (age)
displaying seleted columns
getting unique values for ine column
min and max
count abs and relfrequency of one variable

datset.dtypes (no brackets)
datset.columns -> Index([‘ID‘, ‘age‘, …],dtype = ‘object‘)
datset.age or datset[‘age’] (each entry + sum at the end with name, length and dtype)
dataset[[‘age‘],[‘ID‘]]
dataset.age.unique()
dataset.min() + dataset.max()
dataset.gender.value_counts() or dataset.gender.value_counts(normalize=True)*100

Creating plots

histogram
kerndel density plot

sns.histplot(dataset[‘clumnname’], fill = Tue, color = ‘salmon‘)
sns.kdeplot(dataset[‘clumnname’], fill = Tue, color = ‘salmon‘)

functions to get a

mean
median
quantile 0.25
quantile 0.75
mode

dataset.column.mean()
dataset.column.median()
dataset.column.quantile(0.25)
dataset.column.quantile(0.75)
dataset.colum.mode()

functions to get

Variance
Standarddeviation

dataset.columnname.var()
dataset.columnname.std()

functions to calculate

skewness
kurtosis

skew(dataset.columnname, axis = 0, bias = True)
kurtosis(dataset.columnname, axis = 0, bias = True)

log and ln transformed

log2
log10

np.log2(datsaet.column)
np.log10(datsaet.column)

create correlation matrix

spearman
pearson

matrix = dataset[[‘ColName’],[‘ColName2‘]].corr(method = ‘spearman‘)
-> ranked Correlation
matrix = dataset[[‘ColName’],[‘ColName2‘]].corr(method = ‘pearson‘)
-> linear correlation

conditional probability for

“an individual has a premium subscription, given that they prefer 'action' movies as their favorite genre?“

pd.crosstab(index=sampled_users['premSub'], columns=sampled_users['favGenre'], margins=True) ->

-> 3241/ 3998 = 0.86

ploting normal distribution

plt.plot(x, stats.norm.pdf(x, mu, std), label = ‘label’)

df.iloc[2,1]

df is the dataframe
-> 55000 (3rd row, 2nd column)

Difference between stats.norm.ppf (percentage point function) and stats.norm.cdf (cumulative distribution function)

stats.norm.cdf() -> takes a value -> returns a probability (area under the curve) -> XYZ%** of the observation is smaller than some value x" (we are looking for XYZ%)
stats.norm.ppf() -> takes a probability (i.e., arean under the curve) (e.g., 95%) -> returns a value of the normal distribution (e.g., 1.96) -> ""XYZ% of the observation is smaller than some value x" (we are looking for x)

How to formulate a one sample test in Python:

one sided
two sided

One sided:

t_statistic_1s_g, p_value_1s_g = stats.ttest_1samp (a=sample.income, popmean=50000, alternative="greater")

t_statistic_1s_g, p_value_1s_g = stats.ttest_1samp (a=sample.income, popmean=50000, alternative="less")

Two sided:

t_statistic_1s_g, p_value_1s_g = stats.ttest_1samp (a=sample.income, popmean=50000, alternative="two-sided")

How to formulate two sample test in Python:

1 sided
2 sided

One-sided:

t_statistic_1s_g, p_value_1s_g = stats.ttest_ind(sample1.value, sample2.value, alternative="greater")

Two-sided:

t_statistic_1s_g, p_value_1s_g = stats.ttest_ind(sample1.value, sample2.value, alternative="two-sided")

compute the t_score with python

t_score = stats.t.ppf(1 - alpha/2, df=degrees_of_freedom)

Difference in Python formula: one sample vs. two sample

stats.ttest_1samp(a=sampled_users.satisfaction, popmean=4, alternative="two-sided")
stats.ttest_ind(sample_portuguese.satisfaction, sample_german.satisfaction, alternative="greater")

linear regression

logistic regression

smf.ols(formula= 'avgHoursWatched ~ satisfaction + income', data=sampled_users)
smf.logit('premSub ~ income + avgHoursWatched + satisfaction', data=sampled_users_logit).fit()

Kmeans cluster

kmeans = KMeans(n_clusters=k, random_state=42)

cluster labels

cluster_labels = fcluster(Z, t=4, criterion='maxclust')

Discretize variables on a scale from 1 to 5 (use rounding to generate discrete data)

def discretize_data(data, min_val=1, max_val=5):

correlation heatmap

sns.heatmap(data_summary.corr(), annot=True, cmap='coolwarm', fmt='.2f')

calculate KMO

kmo_all, kmo_model = calculate_kmo(efa_data)

cronbach alpha

alpha_f1 = pg.cronbach_alpha(data=df_factor[['passion1', 'passion2', 'passion3', 'passion4']])