undefined

Buffl

RMBA

von Luca I.

Characteristics: Science, Theory and Facts:

Science: use of evidence, testable explanations and predictions

Theory: comprehensive explebation support by vast body of evidence

Facts: tested and confirmed so many times, so that there is no doubt

Inductivism

Inductivism: Bottom-up reasoning → from observations to general laws (creates theories, but not certain).

-> infer laws from examined cases

Problem: no non-circular way to justify predictions

Deduction: Top-down reasoning → from general laws to specific cases (tests/applies theories, certain if premises are true).

guess → predict → test → try to disprove.

Deductive research approach

Formulate research question
Theory & hypotheses
Method selection
Data gathering
Data cleaning
Exploratory data analysis
Modelling

Criteria for good research question:

• (new, interesting, relevant)

• Clarity

• Focus

• Feasibility

• Testability

• Should not be leading

Research methods:

	Qualitative	Quantitaive
Data collection technique	Interviews, Delphi, textual analysis	Survey, Experiment, Web-scraping
Data analysis techniques	Case writing, Mind maps, Word frequency counting	Machine learning, Cluster analysis, Factor analysis

Empirical research

Econometrics vs. Machine Learning

Econometrics

Machine learning

Helps us with…

understanding why something

happens

predict what will happen

Theory-driven, causal inference,

interpretability

Data-driven, predictive power,

flexibility

Variables

Name	Expl.	Examples
Qualitative variables	Cannot be ordered in a logical or natural way	hair color names
Quantitative var.	Measurable quantities	body height temperature
Discrete var.	Take finite number of values	ratings from 1-5
Continuous var.	Can take infinite number of values	body height
Grouped var.	also called categorical	income
Binary var.	Grouped variable with only two values	adult/ child

Scales

Empirical cumulative distribution function (CDF), for ordinal Values:

10% has a satiscaction (st) of 1
23% has an axact st of 5
25% has a st of 2 or less

Empirical cumulative distribution function (CDF), for metric Values:

40% watches 1 hour or less
40% watches between 1 and 2 hours
80% watches 2hours or less
the small spike at the end means that all values >10 were combined

Arithmetic mean:

“Durchschnitt”
sum of all deviations is zero
good intuition if the data is normally distributed
Caution! Outliers can severely bias this intuition

Median

(measure of central tendency9

compared to the mean, the median is more robust to outliers

Mean vs. median, interpretation:

Mean and median are similar if the data is symmetrically

distributed.

Mean vs. median, interpretation:

If the data has more than one center, neither mean

nor median have meaningful interpretations.

Mean vs. median, interpretation:

Mean and median may differ if the data is skewed.

Mean vs. median, interpretation:

If the data has outliers it is better to use the median

as it is robust to outliers.

quantiles & percentiles

Measures of central tendency

Q-Q plots:

To compare the distribution of two variables

Luigi is slower than domenico
Mario and Salvadore deliver equally fast -> variables have the same distribution

Mode

Measures of central tendency

value that occurs most compared to other values

Mean vs. Median vs. Mode

As the distibution is right/ positively skewed the median is > than the mean, as all small values count equally the mean get’s “influences”

What is skewness?

measue of asymetry
Skewness = Schräge/ Schiefe/ Asymetrie
positively = right, negatively = left

How is this plot skewed?

right/ positively skewed

“Where the tail is”, so the following chart is right-skewed

Why positive? Because in a right-skewed distribution:

The mean is pulled to the right (towards larger, positive deviations from the median/mode).
When you calculate the skewness formula, those big positive deviations dominate, so the result is > 0.

Measures of dispersion:

Absolute deviation
Mean squared error
Variance
Standard deviation

Absolute deviation:

We take the absolute as otherwise negative values might bias the deviation metric

Pros:

less sensitive to outliers than the MSE or Variance
easy to interpret

Cons:

Variance

MSE of the mean
Pros: Mathematically convenient
Cons: hard to interpret and sensitive to outliers

Standard deviation:

square root of variance
Pros: Interpretable, widely used
Cons: Sensitive to outliers

Standardization:

var is standardized if its mean is 0 and the variance is 1
To standardize a variable, we subtract its mean
and divide it by its standard deviation

Why standardization:

makes values better comparable
faster convergence ( convergense means reaching the optimal solution in iterative algorithms.)
comuting distanves more appropriate

Box Plots

How can we interprete the Boxplot for the 8th month?

IQR (Interquartile Range): Quite large → high variability, NOT Variance, as IQR focuses only on the middle 50%
No extreme outliers visible for this month
The median being closer to Q1 means the lower half of the data (Q1 → median) is more concentrated (shorter spread)
If this were shown as a density curve: The longer spread above the median suggests a right/positive skew

Kurtosis with example

measures the tailedness of a distribution (skewness measured the asymmetry)
A portfolio with a low kurtosis value indicates a more stable and predictable return profile which may indicate lower risk. Investors may intentionally seek investments with lower kurtosis values when they're building safer, less volatile portfolios

Covariance

Measures the degree to which two variables vary
together (values from -inf until inf)
The unit of the covariance is expressed in the product
of the units of both variables → difficult to interpret
positive covariance, increase or decrease together
negative cov, if one increases the other decreases
cov zero -> no linear relationship

For example you want to see the covariance between hours watched and income, at the end you get a number that has hours and € included -> difficult to interpret

Pearsons correlation coefficient

Correlation measures both the strength and direction of
the linear relationship between two variables
normalized and unitless version of covariance (-1 to 1)

Interpretation:

correlation close to +1 or −1 indicates a strong linear
relationship
A correlation close to 0 indicates a weak or no linear
relationship
While covariance measures only the direction,
correlation also indicates the strength of the relationship

Spearman’s rank correlation coefficient

The values of R (unitless) lie between −1 and +1 and measures the degree of correlation between the ranks of X and Y.

Interpretation:

R = 1 → all observations are assigned the same rank
R = -1 → all observations are assigned the opposite rank
While covariance measures only the direction,
correlation also indicates the strength of the relationship

Spearman vs Pearson

Pearson’s correlation looks at raw values → best for linear, normally distributed data.
Spearman’s correlation looks at ranks → useful for ordinal data, non-linear monotonic trends, robust to outliers.

Example: If exam scores increase with study hours but in a curved (non-linear) way, Pearson might be low, but Spearman will still be high, since the ranking order is preserved

Random Variables and Probabiity density function (PDF)

Discrete random varaibles

Continuous random varaibles

Example

Number of heads appearing in 10 flips of a coin

Unemployment rate

Max. temperature on a given day

Function

writing a function is possible(sum all P’s is 1)

PDF

As we cannot assign a probability to any real value, we use the density.

The density is the relative probability that a variable falls into an interval. To obtain a probability, we need to integrate over this interval. (This gives us the area under the curve).

Probability density function

Cumulative density function

The cumulative density function (cdf) describes the probability that a random variable takes a variable smaller than a given number
CDF deals with continuous and discrete random variables

Idndependent Variables

(Joint distribution)

Two random variables, X and Y are independent if and only if:

Conditional Distribution

We are usually interested in how one random variable is related to one or more other random variables.
The conditional distribution of Y given X tells us the distribution of Y conditional on us having information about X

Ex. • Probability of blue eyes given that a person is Portuguese

Conditional Distribution Example

Bayes’ Theorem Calculation Example:

Interesting features of probability distributions

Measures of central tendency
Measures of variability or spread
Measures of association between two random variables

Expected Value

Measure of central tendency

The expected value of a random variable X, E(X), is a weighted average of all possible values of X.

Variance

Measure of variability

The variance of a random variable X is a measure of its dispersion around the expected value, μ ≡ E(X):

Standard deviation

Measure of variability

The standard deviation of a random variable, sd(X), is simply the square root of the variance:

Covariance

Measure of association

It is useful to have summary measures of how two random variables vary with one another

Correlation

Measure of association

The correlation coefficient is a measure of the degree of linear relationship between X and Y

Variance vs Correlation

Covariance = raw measure of joint variability, scale-dependent.

Correlation = normalized covariance, scale-independent, bounded between -1 and 1.

-> Think of correlation as the unit-free, standardized version of covariance

Deepdive correlation:

Zero correlation only rules out linear association.
independence always implies uncorrelated.
Uncorrelated does not imply independence. (counterexample: Y=X^2).

Standard continuous distributions

uniform distribution
normal “
Exponential

Uniform distribtion

Standard continuous distributions

Waiting time for a bus (if buses arrive every 10 minutes, and you arrive at a random time → Uniform(0,10)).
Randomly selecting a decimal number between 0 and 1 → Uniform(0,1).

Normal distribtion

Standard continuous distributions

N(0,1) -> standard normal distribution N(μ,σ2)
μ = mean (where the pak is, shifts <->) (verschiebt)
σ2 = variance -> how high the curve is (staucht oder streckt)

Central Limit Theorem (CLT)

The Central Limit Theorem states that when you take a sufficiently large sample size n from a population with any shape of distribution (with a finite mean and variance), the distribution of the standardized sample mean will approximate a standard normal distribution. This approximation improves as the sample size increases, regardless of the original distribution's shape.

CLT Example, Coin Flip with sample n=3 and 10:

There are 2^3=8 equally likely outcomes. I enumerated them and computed the sample mean for each one (you’ll see the full table titled “n=3 – All 8 outcomes and sample means” just above).

n=10 -> more like a bell curve

Beitreten

Vorschau

Author

Luca I.

Informationen

Zuletzt geändert
vor einem Monat

Kurs melden

1st Week

Author

Luca I.

Informationen