exam questions

Buffl

statistic 2. sem

von sami S.

which of the following statements are true for correlation analysis

Correlation analyses can test for linear relationship between variables
correlations do not prove causal relationships
correlation analyses measure the change of one variable with another
correlation analyses can test for linear and non-linear relationships

Correlation analyses can test for linear relationship between variables
correlations do not prove causal relationships
correlation analyses measure the change of one variable with another

you have run a correlation analysis and found p (two-tailed) <0.01, r= -0.95, and r2=0.9. What does this tell you?

there is a strong negative correlation between the two variables
90% of the variance within the data can be explained by the model
the correlation is statistically significant

what are semi-partial correlation analyses

analyses, in which a third variable is controlled for, while correlating two others. the third variable has an influence on only one of the others variables

what are possible problems arising from correlation analysis

with a high number of observations even correlations with a low-r-value can become significant
since causalities cannot be identified, if its not possible to identifie, which variable is influencing which
it can never be ensured that a third variable does not influence at least one of the two observed variables

Take home messages

ALWAYS think about your statistics before you collect data.
CHECK your data for outliers, normality, correlations, missing values, etc.!
Participate in stats-courses whenever possible, refresh your knowledge! PRACTICE!

GENEREAL PROCEDURE IN R:

1) PLOT your data

2) DRAW the model

3) CHECK for:

a) homogeneity of variances

b) normality of errors

c) autocorrelation between residuals

d) influential data points

4) INTERPRET your model

Where do scientific questions come from?

- observing patterns

- deriving models

- formulating hypotheses

How to answer scientific questions?

- designing an experiment/observational study

- collecting data

- visualising/analysing the data

model

model = possible explanation for a driving factor of a pattern

pattern

pattern = decrease in chaos; requires energy & driving factor

explained and unexplained variation

unexplained variation = variation around the means (= noise)

Within group variability

explained variation = variation of the means (= signal)

between group variability (difference between groups)

signal can be covered by the noise (-> too much unexplained variation)

accuracy and precision

accuracy = measurements are as close to the actual value as possible

precision = repeated measurements are as close to each other as possible

Precision has to do with the resolution and quality of devices we use to obtain data, accuracy with the way we calibrate them.

What you want is high accuracy (strong signal) and high precision (low noise) to obtain a high test power.

test power

test power is the likelihood to find an effect (, if there is one

Data-quality:

- has different levels:

- interval-data contain the highest amount of information, the highest test power

- Example: limpets in intertidal horizons

variables

independent variables = what you manipulate, also explanatory variable, predictor variable, factor

dependent variables = what you measure, also response variable, explained variable

- dependent v. should be a function of independent v.

- can be categorical or continuous

- type of variables determines which statistic to use

difference between population and sample

population = totality of all units characterised by a variable

sample = analysed part of the population

A population corresponds to the real world (unmeasurable), whereas a sample is an approach to describe the real world.

unit

unit = sampling unit = replicate = parallel

statistical population

statistical population = population of sampling units

treatment level

treatment level = experimental group

what should replicates be

Replicates…

- need to be independent

- shouldn’t be repeated measures

- shouldn’t be grouped together at one place

- should be of an appropriate spatial scale

Else you get pseudoreplication!

statistic and parameter

statistic = measure of some attribute of a sample, f.e. the sample mean, estimates a parameter (population mean)

parameter = measure of some attribute of a population, f.e. the population mean

mean and median

mean = , influenced by outliers

median = the observation that has equal numbers of observations above and below it, good for non-normal data

Box-Whisker plots:

- are separated into four quartiles (25% of the data points each)

- show the interquartile range (Q3 - Q1) as a box

- are the best way to summarise your data graphically, because they indicate distribution of the data

- good for non-normal data (use the median)

variance

Variance = sum of squared distances from mean / degrees of freedom

Measures how far a set of numbers (replicates) are spread out from their mean (variability around the mean).

Standard deviation:

Measures the amount of variation of a set of data values.

Software uses variances for calculating. We use SD for communicating.

Standard error

Standard error (of the mean):

Estimates the reliability of a sample statistic (most commonly of the mean). It is the standard deviation of sample means.

Driven by variability of population and number of replicates (n).

How to calculate the standard error:

calculate means of samples -> calculate mean of means -> SD for mean of means = SE

Coefficient of variation

Confidence interval

Indicates the precision of an estimated parameter (f.e. population mean).

It gives you the probability (most often 95%, confidence level) with which the true population parameter lies inside the borders of the calculated interval.

Factors affecting the width of the confidence interval are sample size, confidence level and variability in the sample.

- in a normal distribution 95% of all values are in the range of 1.96 * SD

- t = correction factor for small samples (n < 100)

core of statistical hypotheses testing

- smaller than 1: study is insignificant

- the bigger the test statistic, the higher the test power

- it is influenced by the effect size, the unexplained variation and sample size

t-test

can only handle two samples

getting large t-value

1) avoid noise

2) increase effect size

3) increase sample size

Student’s t-distribution:

It is a continuous probability distribution that is strongly related to the standard normal distribution and was developed to deal with low sample sizes. The difference is that it does not relate to the whole population, but only to a sample. This is because the population standard deviation is almost always unknown, so the sample standard deviation is used instead. Thus, there are multiple t distributions (/samples) to a population and all of them have a higher variance than the standard normal distribution (except for a t-distribution with ).

To construct a 95% confidence interval for a normal distribution, the t-value is 1.96 (). For a t-distribution it will be greater than 1.96, due to the distribution’s greater variance.

different type of errors

Type I Error: Rejecting the H0 although it’s right. We believe there is an effect, but there is not. > We see something, where there is nothing.
Type II Error: Accepting the H0 though it’s wrong. We believe there is no effect, but there is one. > We overlook the effect.

one and two tailed tests:

A directional hypothesis is called one-tailed.

A non-directional hypothesis is two-tailed

advantage/disadvantage of one-tailed test

Advantage:

> you increase test power

Disadvantage:

> you’ll miss the effect in case

it goes in the other direction

corrrelation

Correlations quantify how strongly two variables covary with each other.

correlation coefficient

-> r

- measures the degree of correlation

- ranges from -1 to +1 | perfect negative to perfect positive correlation

Most common one:

Pearson correlation coefficient:

= measure of the linear correlation between two variables x and y

rank correlation

Spearman’s rank correlation coefficient:

= measure of the correlation between the ranking of two variables

- requires ordinal data (ranks)

Partial correlations:

Reveal the unique variance explained

by one variable, while controlling a third

variable.

different statistical tests: for comparing poportions

univariate analysis

stat tests for comparing medians

stat testa for comparing means

statistical tests for independent variables

multivariate analysis

● ANOSIM (analysis of similarity)

● PERMANOVA (permutational multivariate analysis of variance)

● ordination techniques and related methods, like PCA or MDS

main: statstical modellng

express mechanistic understanding of explanatory variable
should be:
- accurate
- convenient
- adequate (explain sufficient amount of data)
- minimal (high explanatory power) —> estimating as few parameters as possible
can contain
- categorial factors
- interactions btw factors
- continuous covariates

purpose of any model

minimise error term

linear models

They follow the straight line equation:

variables

Error term is quantified by the residuals:

GLMs vs. GLMs

The term general linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors. It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).

The term generalized linear model (GLM) refers to a class of models in which the response variables have error distributions that do not follow the normal distribution, but an exponential family distribution (f.e. the Binomial-, Poisson- or Gamma-distribution).

simple linear regression

In regression analysis we can test for the influence of one or more continuous variables on one dependent variable.
in SLR: is a regression model with one response and one explanatory variable (both continuous).

method of least squares

The best fitting line should be the one that minimizes the sum of the absolute values of the residuals. Though it actually is the line that has the lowest sum of squared residual differences. Thus, the estimate gets less prone to errors.

total sum of squares

total sum of squares (SST) = squared differences between the mean and the observed values

SSM = SST – SSR

residual sum of squares

SSM = SST – SSR

(SSR) = squared differences between the line of best fit and the observed values

model sum of squares

(SSM) = the increase in accuracy by replacing the simplest model by the best fit model

SSM = SST – SSR

explained variation:

Coefficient of determination (R2)
Pearson’s correlation coefficient (r or R

Coefficient of determination (R2) is the fraction of variance in the dependent variable that’s explained by the independent variable(s)

| R2 * 100 = explained variation in percent

Pearson’s correlation coefficient (r or R) is a measure of the linear correlation between two variables X and Y

F-ratio

The F-ratio is a test statistic that measures the ratio of the variation explained by the model and the difference between the model and the observed data (unexplained variation) by using mean squares (MSM & MSR)

adjusted R2

The adjusted R2 takes into consideration how many predictors are in the model and how many data points are in the data set.

linear regression

For linear regression, we assume that…

- residuals are normally distributed,

- the variance in the residuals is constant,

- the residuals are not correlated,

- there are no influential data points.

One-factorial ANOVA

A one-way analysis of variance (ANOVA) compares means of three or more samples to each other. It tests for the influence of a categorical independent variable on a dependent variable.

A series of pairwise t-tests does the same, but it would increase the probability of a type I error and also the number of hypotheses.

ANOVA

ANOVA = compares the variability between samples with the variability within samples

H0 of an ANOVA:

Sample means and underlying population means do not differ from each other.

output of ANOVA

An ANOVA partitions the total sum of squares into two components:

SST

total sum of squares:

SSR | sum of squares within samples (error/residual)

SSM | sum of squares between samples (treatment/model)

mean squares

MS = SS/df

sums of squares/ degrees of freedom

F-statistic

o F larger than 1: more explained than unexplained variation

o the bigger ‘n – 1’ gets, the larger F gets -> large experiment = high test power

also for ANOVA: R2 = explained variation / total variation
ANOVA for an unbalanced design: PERMANOVA

For a one-factorial ANOVA, we assume that

- the samples are independent random samples,

- the populations are normally distributed,

- the population variances are equal.

Assumption ANOVA:

1. Independency of replicates

2. Independency of samples

3. Normal distribution

4. Populations have common variance

5. Additive factor effects

è Shows if there is a difference, but not were exactly

è Compares mean of treatments (explained and unexplained)

response variables

nominal: yes or no

interval: counting

imbalanced design

if you lose replicates (imbalanced design), rebalance the design (kick out replicates or a whole sample).

Post-hoc tests

With one-factorial ANOVAs, Kruskal-Wallis- and median-tests, one only tests whether or not there are significant differences in a group of means. Post-hoc testing does pairwise (or groupwise) mean comparisons, that indicate which means (or group of means) differ from one another.

Post-hoc testing – the shotgun approach:

Problem:

- comparing multiple means to one another increases the type I error rate

Solution:

- the Bonferroni correction:

method to counteract the problem of multiple comparisons

this of course, increases the type II error!

orthogonal contrast

linear combination of parameters or statistics whose coefficients add up to 0

anova vs regression

ANOVA	Regression
qualitative questions	quantitative questions
independent is categorical	independent is continuous
doesn’t expect relationship between x & y	expects linear relationship between x & y
lower test power	higher test power

Regression = best case scenario

> more powerful than ANOVA

> more information content than ANOVA

smoothers

smoothers = approximating functions that attempt to capture important patterns in the data

General additive models (GAMs):

Moving average:

Analyse data points by creating series of averages of different subsets of the full data set.

Local regression (LOESS method):

> LOcal regrESSion

Fits simple models to localized subsets of the data to build up a function that describes its variation.

Splines

Splits the data into bins and applies polynomial regression to each bin.

Multifactorial ANOVA

Examines the influence of multiple categorical independent variables on one continuous dependent variable.

combining all treatments of factors with each other
—> able to investigate interactions btw factors

E.g.: Partial pressure of CO2 and temperature on phytoplankton net growth.

orthogonality

independent variables that affect a dependent variable are uncorrelated

Nested designs:

Crossed vs. nested factors

Two factors are crossed when every category of one factor co-occurs in the design with every category of the other factor.

A factor is nested within another factor when each category of the first factor co-occurs with only one category of the other.

Mixed-effect modelling

Mixed-effect models consider not only fixed (dependent variables) but also random factors (geographic distribution of samples in the field) in your experiment.

Random factors contribute to the unexplained variation. Mixed-effect models avoid that they contribute to the test power of your testing.

- useful for repeated measurements

- good in dealing with missing values

fixed effects vs. random

Fixed effects: CO2 partial pressure nutrient conc. light regimes predator presence/absence sex

Random effects: genotype plot within region block within experiment split plot within a plot

to apply mixed effects the data has to be

either crossed or nested

Repeated measures ANOVA: pros and cons

Can handle data that stem from repeated sampling of identical sampling units.

+ reduced unexplained variation

+ reduced amount of sampling units (f.e. organisms)

- require a further assumption about the data: sphericity

- special statistical tests needed to analyse data

Most interesting bit in repeated measures: does my effect change over time?

Multiple regression

Modelling the relationship between a dependent and more than one independent variables (predictors).

ANCOVA

analysis of covariance
An ANOVA tests for the influence of one or more categorical independent variables on one dependent variable. Regression analysis tests for the influence of one or more continuous variables on one dependent variable.
Analysis of covariance (ANCOVA) is a method that combines ANOVA and regression. It measures the influence of one or more categorical predictors and one or more continuous independent variables on one dependent variable. The continuous independent is called covariate and is normally not of interest

Ancova picture kuchn plot

the blue area gets divided into the variance explained by the error and the variance explained by the covariate

ANCOVA goal:

Estimate the influence of the covariate on the dependent variable.

By incorporating covariates in the model, the amount of unexplained variation is reduced.

⇨ increased test power due to noise reduction

⇨ smaller effects become visible

Test for two samples: nominal

Test for two samples: ordinal

Tests for two samples: interval data

if your interval data are not normally distributed, you can switch to non-parametric tests (Mann-Whitney U, Wilcoxon)

-> transformation of interval to ranked data

Paired tests are

powerful

Paired tests compare two sets of measurements to assess whether their population means differ. They reduce the amount of unexplained variability in the experiment.
Therefore, it’s easier to detect differences between treatments, i.e. higher test power.

χ2 test example

Presence or absence of snails in two tidal horizons (Yes/No).H0: Proportion of the snails is the same in the two tidal horizons.

independent samples

Mann-Whitney-U test: example

Germination of previously cooled or uncooled seeds.H0: There’s no difference in the median of the two populations.

independent samples

t-test: example

Effect of fertilizer on the canopy height of a crop.H0: There’s no difference in the mean of two populations.

independent samples

Sign test: example

Fouling on structured and unstructured shells.H0: No difference in amount of more fouled structured and more fouled unstructured shells.

dependent samples

Wilcoxon signed rank test: example

Number of invertebrate larvae in the middle and at the edges of streams. H0: No median differences between members of pairs of larvae (from middle/edges).
dependent samples

Paired t-test: example

Fouling on structured and unstructured shells.H0: No difference in fouling rates between structured and unstructured shells.

dependent samples

def mean squares

sum of squares/ degrees of freedom

t-test definition

want to get significant effect -_> more explained than unexplained variation

Difference between correlation and interaction

correlation is simplest form of interaction, interaction is described by calculating statistics or making graphs.

Difference between correlation and regression

Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.

assumptions ANCOVA:

generaly anova:

independent observations
normaly distributed
homogeneity of variances

additionaly:

homogeneity of regression slopes
linearity
covariate is measured without error

Beitreten

Vorschau

Author

sami S.

Informationen

Zuletzt geändert
vor 2 Jahren

Kurs melden