which of the following statements are true for correlation analysis
Correlation analyses can test for linear relationship between variables
correlations do not prove causal relationships
correlation analyses measure the change of one variable with another
correlation analyses can test for linear and non-linear relationships
you have run a correlation analysis and found p (two-tailed) <0.01, r= -0.95, and r2=0.9. What does this tell you?
there is a strong negative correlation between the two variables
90% of the variance within the data can be explained by the model
the correlation is statistically significant
what are semi-partial correlation analyses
analyses, in which a third variable is controlled for, while correlating two others. the third variable has an influence on only one of the others variables
what are possible problems arising from correlation analysis
with a high number of observations even correlations with a low-r-value can become significant
since causalities cannot be identified, if its not possible to identifie, which variable is influencing which
it can never be ensured that a third variable does not influence at least one of the two observed variables
Take home messages
ALWAYS think about your statistics before you collect data.
CHECK your data for outliers, normality, correlations, missing values, etc.!
Participate in stats-courses whenever possible, refresh your knowledge! PRACTICE!
GENEREAL PROCEDURE IN R:
1) PLOT your data
2) DRAW the model
3) CHECK for:
a) homogeneity of variances
b) normality of errors
c) autocorrelation between residuals
d) influential data points
4) INTERPRET your model
Where do scientific questions come from?
- observing patterns
- deriving models
- formulating hypotheses
How to answer scientific questions?
- designing an experiment/observational study
- collecting data
- visualising/analysing the data
model
model = possible explanation for a driving factor of a pattern
pattern
pattern = decrease in chaos; requires energy & driving factor
explained and unexplained variation
unexplained variation = variation around the means (= noise)
Within group variability
explained variation = variation of the means (= signal)
between group variability (difference between groups)
signal can be covered by the noise (-> too much unexplained variation)
accuracy and precision
accuracy = measurements are as close to the actual value as possible
precision = repeated measurements are as close to each other as possible
Precision has to do with the resolution and quality of devices we use to obtain data, accuracy with the way we calibrate them.
What you want is high accuracy (strong signal) and high precision (low noise) to obtain a high test power.
test power
test power is the likelihood to find an effect (, if there is one
Data-quality:
- has different levels:
- interval-data contain the highest amount of information, the highest test power
- Example: limpets in intertidal horizons
variables
independent variables = what you manipulate, also explanatory variable, predictor variable, factor
dependent variables = what you measure, also response variable, explained variable
- dependent v. should be a function of independent v.
- can be categorical or continuous
- type of variables determines which statistic to use
difference between population and sample
population = totality of all units characterised by a variable
sample = analysed part of the population
A population corresponds to the real world (unmeasurable), whereas a sample is an approach to describe the real world.
unit
unit = sampling unit = replicate = parallel
statistical population
statistical population = population of sampling units
treatment level
treatment level = experimental group
what should replicates be
Replicates…
- need to be independent
- shouldn’t be repeated measures
- shouldn’t be grouped together at one place
- should be of an appropriate spatial scale
Else you get pseudoreplication!
statistic and parameter
statistic = measure of some attribute of a sample, f.e. the sample mean, estimates a parameter (population mean)
parameter = measure of some attribute of a population, f.e. the population mean
mean and median
mean = , influenced by outliers
median = the observation that has equal numbers of observations above and below it, good for non-normal data
Box-Whisker plots:
- are separated into four quartiles (25% of the data points each)
- show the interquartile range (Q3 - Q1) as a box
- are the best way to summarise your data graphically, because they indicate distribution of the data
- good for non-normal data (use the median)
variance
Variance = sum of squared distances from mean / degrees of freedom
Measures how far a set of numbers (replicates) are spread out from their mean (variability around the mean).
Standard deviation:
Measures the amount of variation of a set of data values.
Software uses variances for calculating. We use SD for communicating.
Standard error
Standard error (of the mean):
Estimates the reliability of a sample statistic (most commonly of the mean). It is the standard deviation of sample means.
Driven by variability of population and number of replicates (n).
How to calculate the standard error:
calculate means of samples -> calculate mean of means -> SD for mean of means = SE
Coefficient of variation
Confidence interval
Indicates the precision of an estimated parameter (f.e. population mean).
It gives you the probability (most often 95%, confidence level) with which the true population parameter lies inside the borders of the calculated interval.
Factors affecting the width of the confidence interval are sample size, confidence level and variability in the sample.
- in a normal distribution 95% of all values are in the range of 1.96 * SD
- t = correction factor for small samples (n < 100)
core of statistical hypotheses testing
- smaller than 1: study is insignificant
- the bigger the test statistic, the higher the test power
- it is influenced by the effect size, the unexplained variation and sample size
t-test
can only handle two samples
getting large t-value
1) avoid noise
2) increase effect size
3) increase sample size
Student’s t-distribution:
It is a continuous probability distribution that is strongly related to the standard normal distribution and was developed to deal with low sample sizes. The difference is that it does not relate to the whole population, but only to a sample. This is because the population standard deviation is almost always unknown, so the sample standard deviation is used instead. Thus, there are multiple t distributions (/samples) to a population and all of them have a higher variance than the standard normal distribution (except for a t-distribution with ).
To construct a 95% confidence interval for a normal distribution, the t-value is 1.96 (). For a t-distribution it will be greater than 1.96, due to the distribution’s greater variance.
different type of errors
Type I Error: Rejecting the H0 although it’s right. We believe there is an effect, but there is not. > We see something, where there is nothing.
Type II Error: Accepting the H0 though it’s wrong. We believe there is no effect, but there is one. > We overlook the effect.
one and two tailed tests:
A directional hypothesis is called one-tailed.
A non-directional hypothesis is two-tailed
advantage/disadvantage of one-tailed test
Advantage:
> you increase test power
Disadvantage:
> you’ll miss the effect in case
it goes in the other direction
corrrelation
Correlations quantify how strongly two variables covary with each other.
correlation coefficient
-> r
- measures the degree of correlation
- ranges from -1 to +1 | perfect negative to perfect positive correlation
Most common one:
Pearson correlation coefficient:
= measure of the linear correlation between two variables x and y
rank correlation
Spearman’s rank correlation coefficient:
= measure of the correlation between the ranking of two variables
- requires ordinal data (ranks)
Partial correlations:
Reveal the unique variance explained
by one variable, while controlling a third
variable.
different statistical tests: for comparing poportions
univariate analysis
stat tests for comparing medians
stat testa for comparing means
statistical tests for independent variables
multivariate analysis
● ANOSIM (analysis of similarity)
● PERMANOVA (permutational multivariate analysis of variance)
● ordination techniques and related methods, like PCA or MDS
main: statstical modellng
express mechanistic understanding of explanatory variable
should be:
accurate
convenient
adequate (explain sufficient amount of data)
minimal (high explanatory power) —> estimating as few parameters as possible
can contain
categorial factors
interactions btw factors
continuous covariates
purpose of any model
minimise error term
linear models
They follow the straight line equation:
Error term is quantified by the residuals:
GLMs vs. GLMs
The term general linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors. It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).
The term generalized linear model (GLM) refers to a class of models in which the response variables have error distributions that do not follow the normal distribution, but an exponential family distribution (f.e. the Binomial-, Poisson- or Gamma-distribution).
simple linear regression
In regression analysis we can test for the influence of one or more continuous variables on one dependent variable.
in SLR: is a regression model with one response and one explanatory variable (both continuous).
method of least squares
The best fitting line should be the one that minimizes the sum of the absolute values of the residuals. Though it actually is the line that has the lowest sum of squared residual differences. Thus, the estimate gets less prone to errors.
total sum of squares
total sum of squares (SST) = squared differences between the mean and the observed values
SSM = SST – SSR
residual sum of squares
(SSR) = squared differences between the line of best fit and the observed values
model sum of squares
(SSM) = the increase in accuracy by replacing the simplest model by the best fit model
explained variation:
Coefficient of determination (R2)
Pearson’s correlation coefficient (r or R
Coefficient of determination (R2) is the fraction of variance in the dependent variable that’s explained by the independent variable(s)
| R2 * 100 = explained variation in percent
Pearson’s correlation coefficient (r or R) is a measure of the linear correlation between two variables X and Y
F-ratio
The F-ratio is a test statistic that measures the ratio of the variation explained by the model and the difference between the model and the observed data (unexplained variation) by using mean squares (MSM & MSR)
adjusted R2
The adjusted R2 takes into consideration how many predictors are in the model and how many data points are in the data set.
linear regression
For linear regression, we assume that…
- residuals are normally distributed,
- the variance in the residuals is constant,
- the residuals are not correlated,
- there are no influential data points.
One-factorial ANOVA
A one-way analysis of variance (ANOVA) compares means of three or more samples to each other. It tests for the influence of a categorical independent variable on a dependent variable.
A series of pairwise t-tests does the same, but it would increase the probability of a type I error and also the number of hypotheses.
ANOVA
ANOVA = compares the variability between samples with the variability within samples
H0 of an ANOVA:
Sample means and underlying population means do not differ from each other.
output of ANOVA
An ANOVA partitions the total sum of squares into two components:
SST
total sum of squares:
SSR | sum of squares within samples (error/residual)
+
SSM | sum of squares between samples (treatment/model)
mean squares
MS = SS/df
sums of squares/ degrees of freedom
F-statistic
o F larger than 1: more explained than unexplained variation
o the bigger ‘n – 1’ gets, the larger F gets -> large experiment = high test power
also for ANOVA: R2 = explained variation / total variation
ANOVA for an unbalanced design: PERMANOVA
For a one-factorial ANOVA, we assume that
- the samples are independent random samples,
- the populations are normally distributed,
- the population variances are equal.
Assumption ANOVA:
1. Independency of replicates
2. Independency of samples
3. Normal distribution
4. Populations have common variance
5. Additive factor effects
è Shows if there is a difference, but not were exactly
è Compares mean of treatments (explained and unexplained)
response variables
nominal: yes or no
interval: counting
imbalanced design
if you lose replicates (imbalanced design), rebalance the design (kick out replicates or a whole sample).
Post-hoc tests
With one-factorial ANOVAs, Kruskal-Wallis- and median-tests, one only tests whether or not there are significant differences in a group of means. Post-hoc testing does pairwise (or groupwise) mean comparisons, that indicate which means (or group of means) differ from one another.
Post-hoc testing – the shotgun approach:
Problem:
- comparing multiple means to one another increases the type I error rate
Solution:
- the Bonferroni correction:
method to counteract the problem of multiple comparisons
this of course, increases the type II error!
orthogonal contrast
linear combination of parameters or statistics whose coefficients add up to 0
anova vs regression
Regression
qualitative questions
quantitative questions
independent is categorical
independent is continuous
doesn’t expect relationship between x & y
expects linear relationship between x & y
lower test power
higher test power
Regression = best case scenario
> more powerful than ANOVA
> more information content than ANOVA
smoothers
smoothers = approximating functions that attempt to capture important patterns in the data
General additive models (GAMs):
Moving average:
Analyse data points by creating series of averages of different subsets of the full data set.
Local regression (LOESS method):
> LOcal regrESSion
Fits simple models to localized subsets of the data to build up a function that describes its variation.
Splines
Splits the data into bins and applies polynomial regression to each bin.
Multifactorial ANOVA
Examines the influence of multiple categorical independent variables on one continuous dependent variable.
combining all treatments of factors with each other
—> able to investigate interactions btw factors
E.g.: Partial pressure of CO2 and temperature on phytoplankton net growth.
orthogonality
independent variables that affect a dependent variable are uncorrelated
Nested designs:
Crossed vs. nested factors
Two factors are crossed when every category of one factor co-occurs in the design with every category of the other factor.
A factor is nested within another factor when each category of the first factor co-occurs with only one category of the other.
Mixed-effect modelling
Mixed-effect models consider not only fixed (dependent variables) but also random factors (geographic distribution of samples in the field) in your experiment.
Random factors contribute to the unexplained variation. Mixed-effect models avoid that they contribute to the test power of your testing.
- useful for repeated measurements
- good in dealing with missing values
fixed effects vs. random
Fixed effects: CO2 partial pressure nutrient conc. light regimes predator presence/absence sex
Random effects: genotype plot within region block within experiment split plot within a plot
to apply mixed effects the data has to be
either crossed or nested
Repeated measures ANOVA: pros and cons
Can handle data that stem from repeated sampling of identical sampling units.
+ reduced unexplained variation
+ reduced amount of sampling units (f.e. organisms)
- require a further assumption about the data: sphericity
- special statistical tests needed to analyse data
Most interesting bit in repeated measures: does my effect change over time?
Multiple regression
Modelling the relationship between a dependent and more than one independent variables (predictors).
ANCOVA
analysis of covariance
An ANOVA tests for the influence of one or more categorical independent variables on one dependent variable. Regression analysis tests for the influence of one or more continuous variables on one dependent variable.
Analysis of covariance (ANCOVA) is a method that combines ANOVA and regression. It measures the influence of one or more categorical predictors and one or more continuous independent variables on one dependent variable. The continuous independent is called covariate and is normally not of interest
Ancova picture kuchn plot
the blue area gets divided into the variance explained by the error and the variance explained by the covariate
ANCOVA goal:
Estimate the influence of the covariate on the dependent variable.
By incorporating covariates in the model, the amount of unexplained variation is reduced.
⇨ increased test power due to noise reduction
⇨ smaller effects become visible
Test for two samples: nominal
Test for two samples: ordinal
Tests for two samples: interval data
if your interval data are not normally distributed, you can switch to non-parametric tests (Mann-Whitney U, Wilcoxon)
-> transformation of interval to ranked data
Paired tests are
powerful
Paired tests compare two sets of measurements to assess whether their population means differ. They reduce the amount of unexplained variability in the experiment.
Therefore, it’s easier to detect differences between treatments, i.e. higher test power.
χ2 test example
Presence or absence of snails in two tidal horizons (Yes/No).H0: Proportion of the snails is the same in the two tidal horizons.
independent samples
Mann-Whitney-U test: example
Germination of previously cooled or uncooled seeds.H0: There’s no difference in the median of the two populations.
t-test: example
Effect of fertilizer on the canopy height of a crop.H0: There’s no difference in the mean of two populations.
Sign test: example
Fouling on structured and unstructured shells.H0: No difference in amount of more fouled structured and more fouled unstructured shells.
dependent samples
Wilcoxon signed rank test: example
Number of invertebrate larvae in the middle and at the edges of streams. H0: No median differences between members of pairs of larvae (from middle/edges).
Paired t-test: example
Fouling on structured and unstructured shells.H0: No difference in fouling rates between structured and unstructured shells.
def mean squares
sum of squares/ degrees of freedom
t-test definition
want to get significant effect -_> more explained than unexplained variation
Difference between correlation and interaction
correlation is simplest form of interaction, interaction is described by calculating statistics or making graphs.
Difference between correlation and regression
Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.
assumptions ANCOVA:
generaly anova:
independent observations
normaly distributed
homogeneity of variances
additionaly:
homogeneity of regression slopes
linearity
covariate is measured without error
Zuletzt geändertvor 2 Jahren