Displaying first few rows of the dataset
-> first 5 rows
-> first 10 rows
Checking which data types our dataset contains
columns titles
information about one column (age)
displaying seleted columns
getting unique values for ine column
min and max
count abs and relfrequency of one variable
datset.dtypes (no brackets)
datset.columns -> Index([‘ID‘, ‘age‘, …],dtype = ‘object‘)
datset.age or datset[‘age’] (each entry + sum at the end with name, length and dtype)
dataset[[‘age‘],[‘ID‘]]
dataset.age.unique()
dataset.min() + dataset.max()
dataset.gender.value_counts() or dataset.gender.value_counts(normalize=True)*100
Creating plots
histogram
kerndel density plot
sns.histplot(dataset[‘clumnname’], fill = Tue, color = ‘salmon‘)
sns.kdeplot(dataset[‘clumnname’], fill = Tue, color = ‘salmon‘)
functions to get a
mean
median
quantile 0.25
quantile 0.75
mode
dataset.column.mean()
dataset.column.median()
dataset.column.quantile(0.25)
dataset.column.quantile(0.75)
dataset.colum.mode()
functions to get
Variance
Standarddeviation
dataset.columnname.var()
dataset.columnname.std()
functions to calculate
skewness
kurtosis
skew(dataset.columnname, axis = 0, bias = True)
kurtosis(dataset.columnname, axis = 0, bias = True)
log and ln transformed
log2
log10
np.log2(datsaet.column)
np.log10(datsaet.column)
create correlation matrix
spearman
pearson
matrix = dataset[[‘ColName’],[‘ColName2‘]].corr(method = ‘spearman‘)
-> ranked Correlation
matrix = dataset[[‘ColName’],[‘ColName2‘]].corr(method = ‘pearson‘)
-> linear correlation
conditional probability for
“an individual has a premium subscription, given that they prefer 'action' movies as their favorite genre?“
pd.crosstab(index=sampled_users['premSub'], columns=sampled_users['favGenre'], margins=True) ->
-> 3241/ 3998 = 0.86
ploting normal distribution
plt.plot(x, stats.norm.pdf(x, mu, std), label = ‘label’)
df.iloc[2,1]
df is the dataframe
-> 55000 (3rd row, 2nd column)
Difference between stats.norm.ppf (percentage point function) and stats.norm.cdf (cumulative distribution function)
stats.norm.cdf() -> takes a value -> returns a probability (area under the curve) -> XYZ%** of the observation is smaller than some value x" (we are looking for XYZ%)
stats.norm.ppf() -> takes a probability (i.e., arean under the curve) (e.g., 95%) -> returns a value of the normal distribution (e.g., 1.96) -> ""XYZ% of the observation is smaller than some value x" (we are looking for x)
How to formulate a one sample test in Python:
one sided
two sided
One sided:
t_statistic_1s_g, p_value_1s_g = stats.ttest_1samp (a=sample.income, popmean=50000, alternative="greater")
t_statistic_1s_g, p_value_1s_g = stats.ttest_1samp (a=sample.income, popmean=50000, alternative="less")
Two sided:
t_statistic_1s_g, p_value_1s_g = stats.ttest_1samp (a=sample.income, popmean=50000, alternative="two-sided")
How to formulate two sample test in Python:
1 sided
2 sided
One-sided:
t_statistic_1s_g, p_value_1s_g = stats.ttest_ind(sample1.value, sample2.value, alternative="greater")
Two-sided:
t_statistic_1s_g, p_value_1s_g = stats.ttest_ind(sample1.value, sample2.value, alternative="two-sided")
compute the t_score with python
t_score = stats.t.ppf(1 - alpha/2, df=degrees_of_freedom)
Difference in Python formula: one sample vs. two sample
stats.ttest_1samp(a=sampled_users.satisfaction, popmean=4, alternative="two-sided")
stats.ttest_ind(sample_portuguese.satisfaction, sample_german.satisfaction, alternative="greater")
linear regression
logistic regression
smf.ols(formula= 'avgHoursWatched ~ satisfaction + income', data=sampled_users)
smf.logit('premSub ~ income + avgHoursWatched + satisfaction', data=sampled_users_logit).fit()
Kmeans cluster
kmeans = KMeans(n_clusters=k, random_state=42)
cluster labels
cluster_labels = fcluster(Z, t=4, criterion='maxclust')
Discretize variables on a scale from 1 to 5 (use rounding to generate discrete data)
def discretize_data(data, min_val=1, max_val=5):
correlation heatmap
sns.heatmap(data_summary.corr(), annot=True, cmap='coolwarm', fmt='.2f')
calculate KMO
kmo_all, kmo_model = calculate_kmo(efa_data)
cronbach alpha
alpha_f1 = pg.cronbach_alpha(data=df_factor[['passion1', 'passion2', 'passion3', 'passion4']])
Calculate VIF
panel data effects models
time fixed
random fixed
time_fixed_effects_model = PanelOLS(y, X, time_effects=True)
random_effects_model = RandomEffects(y, X)
Hausman Test
Dependent and idependent variables for effect
Last changed13 days ago