Which are the five steps of theory construction (methodology)?
Identifying a relevant phenomena
Should be generalizations which fulfill following criteria
robust
stable
reproducable
Finding a good, stable phenomena very important, since this is the basis of the theory
Better to choose boring but stable phenomena than spectacular phenomena
Observation: People go to sleep when it gets dark
Formulate a prototheory
Done with abductive reasoning
Based on small set of generally valid principles, which could explain phenomena
Normally in this step represented verbally
Idea: darkness makes people tired
Develop a formal model
Generally valid principles are translated into rules of equations
Not the same as data model
Should help transcend cognitive
The longer there is darkness, the more tired do people get
Check the adequacy of the formal model
Does it work? If it doesn’t, go get more data
Evaluate the overall worth of the constructed theory
How are latente growth models and gerenal structural equation models (SEM) different?
Latent growth models are part of SEM BUT
LG models specifically model changes over different time points
SEM more generally models relationships between observed and lateten (or latent and latent and obsrved and observed) variables which can be causal or factor loadings
What are interecepts broadly speaking and what is their meaning in
Regression Models
Latent Growth Models
General SEM Models
Defining features of intercepts across all models
General terms
reflects the constant value which is added to variables
Provides a reference point for understanding the baseline or average level of variable when other predictors in the model are zero or have no effect
Intercepts in regression Models
B0
Predicted value of the independent variable 0 if all predictors/independent variables (x1, x2 etc.) are zero
Intercepts in latent Growth Models
Represents the initial level or starting point of an individuals trajectory over time
Captures individual differences in initial levels across participants when random variations are allowed
there might also be an intercept for some or all observed variables, representing their deviation above or below the implied latent trajectory (i.e. intercept)
Intercepts in general SEM Models
Each observed variable has an intercept, representing its expected value when all predictors (direct paths towards it) are zero
Intercepts contribute to the mean structure in SEM, enabling it to represent non-zero means across variables
Example in CFA model on personality traits:
Intercepts represent average item response or latent trait means
How are theory, data and phenomenons related to eachother?
Theroy - phenomena
Theory explains/predicts phenomena
which is made visible by data
Theory is abducted from phenomena
If the world is how theory A says it is, then phenomena A must be true
Phenomena - data
Data is generalized into a phenomena
Phenomena predicts (specific pattern in) data
Data offers evidence for the existence of phenomenas
Explain the differences between theories and models.
Theory
Conceptual framework
Explains or predicts phenomena
Model
Simplified, formal representation of specific relationship
Derived from theory, but NOT a theory itself
Tests, explores and refines theories
Provides a mathematical/empirical framerwork to operationalize theoretical constructs
Why should we formalize theories as statistical computational models?
Models clarify the understanding of complex phenomena
Theories can be vague an imprecise
Models translate those conceptual frameworks into mathematical or computational form, forcing
precision
clarity
Benefit
clarifies assumptions
allows objective scrutiny
Helps identify which parts of a theory are supported by data
Models facilitate iterative theory development
Models serve as a tool to
test theories
explore theories
Refine theories
When model predictions diverge from empirical results, it drives the refinement/rejection of theoretical assumptions
Benefit: promotes continuous improvement and understanding of the world
Models make testable predictions
Theories are generalized statements about relationships
Models allow us to make quantitative predictions
This offers testable and falsifiable predictions
Models quantify how much one variable affects another
Benefits: offer a framework to test hypotheses and validated theoretical claims with empirical data
Models allow generalization and prediction of future outcomes
Models use historical data to predict future observations
Benefit: useful for forecasting inboth research and applied settings
Models inform intervention and policy decisions
Theorie suggest how variables should be related
Models can simulate the effects of interventions and predict how those changes will affect outcomes
Benefit: help policymakers, clinicians or researchers evaluate the impact of interventions before implementing them -> better informed decisions
Explain the differences between predictive and casual models
Predictive models
Predict outcomes in similiar context with high accuracy without needing to understand casuality (A and B often ocurr together, not relevant why the often ocurr together)
Enough for pure prediction
Problematic for when it comes to intervention
Casual models
Generalize predictions to new outcomes
Casual drives make for more effective interventions
What is data?
Observations and measurements from the real world
Starting point of most theories
What is model structure?
Mathematical framework/equation which broadly describes how observations are related
Example: simple regression model with
b0 + b1 x Temperature + e
no specific values for parameters, but ther is a slope, a baseline etc.
General framework of how observations are related
What are model parameters?
The value of the parameters which best explain the data are estimated (i.e. how much “is” the slope numerically?)
Model fit procedures adjust the parameters to minimize the difference between model predictions and actual data
Criterion: typically done by minimizing an objective function like the sum of squared errors (SSE) or maximizing likelihood
SEE = sum(observed-predicted)^2
Which are the key components of statistical models?
Data
Model structure
Model parameters
What are Nested observation?
Grouped observations (couples, families, school class, repeated measurements within individual)
Ignoring nesting means assuming independence between observations, leading to:
Biased standard errir (Type 1, Type 2)
Inaccurate effect estimation
Estimations within the same groupe are correlated
Random Effects in Linear Mixed Models account for this dependency by modelling within-group variability
Accounting for nesting allows us to distinguish population-level trends from individual- or group-level deviations, leading to more
Accurate predictions
Inference
Theories
What are linear models?
Include fixed effects
Capture the relationship between predictors and outcomes
Fixed effects (betas): average effects of predictors across all individuals
eij: residual error for the j-th observation of individual i, assumed to be
independent
normally distributed
What are linear mixed effects models?
Model which extend traditional linear models by incorporate random effects to account for individual or group-level variability
Can include only random intercept or also random slope
What are the characteristics of a random intercept in a linear mixed effects model? How is it different from the residual error?
Formula for LMM with solely a random intercept:
u0i: random effect specific to individual i
Residual error varies for every individual and every obervation, while random intercept only varies across individuals but is constant across observations
Random intercept, therefore the difference in intercept is potentially an interesting difference (might for instance signify a different baseline level in different classes), while the residual error is noise (hence why it has to vary across observations and cannot be constant, it wouldn’t be random otherwise)
It stands for an individuals basis level which deviates from the trend (is thus added to the intercept)
What are the characteristics a random slope in a linear mixed effects model?
Formula for LMM with both random intercept and random slope
Random slope allows the effects of predictor X1 to vary across indiviuals
Which distribution are the random effects in Linear Mixed Effects Models expected to follow?
Multivariate normal distribution
bi = vector of random effects for individual or group i
N = multivariate normal distribution
0 = mean vector, assumed to be zero because the assunmption is that most people do not deviate from the average trend
G = Covariance matrix
What is a multivariate normal distribution?
If multiple variables which have a normal distribution are laid over eachother
Highest point: where all variables “cross”
Outer places are for instance people who might have a normal distribution for two variables, but not the third
What can be said about the covariance matrix G in relation to random effects in Linear Mixed Effects Models?
Covariance matrix of the random effects includes
Variance (diagonal): how much the individuals random effects vary from individual to individual (how strongly can the intercept or the slope differ from the average for each given individual?)
Covariance of random intercept and slope: How (if at all) are those random effects correlated with each other?
G: derivations follow a structured pattern
G can include both random intercepts and random slope, allowing for flexibility in modelling individual or group level variability
o^2 intercept: variance of random intercept
o^2 slope: variance of random slope
ointercept, slope: covariance between random intercept and slope
Positive: individuals with a higher than average intercept also have a higher than average slope
Their starting point is higher than average and increases/decreases “faster”
Negative: individuals with a higher than average intercept temd to have a lower than average slope
their starting point is higher, but it grows/deminished slower
Close to zero: likely no relationship between slope and intercept
Correlation between random intercept and slope helps us understand whether people with higher baseline outcomes also show greater or lesser sensitivity to predictors (random slopes)
How can a (simple) linear model be created in R?
Linearm <- lm(formula = reaction ~ Days, data = sleepstudy)
lm = linear model
reaction ~ Days: reaction predicted by days
data = sleepstudy: used data
summary(Linearm)
How can a linear mixed effcts model with a random intercept be created in R?
LLM1 <- lmer(Reaction ~ days + (1 I Subject), data = sleepstudy)
lmer = LLM formula
(1 I Subject) 1 -> intecept allowed to vary, grouped by subject
summary(LLM1)
How can a linear mixed effcts model with a random intercept and random slope be created in R?
LLM2 <- lmer(Reaction ~ days + (Day I Subject), data = sleepstudy)
summary(LLM2)
Identify a relevant phenomena
Observation: when it gets dark people go to sleep
The longer it is dark outside, the more people go sleep
Does it work? If not, get more data
What is SEM/Structureal Equation Models?
Framework that integrates
Factor Analysis: modeling relationship between observed and latent variables
Linear Regresssion/Path models: modeling casual relationships among variables (observed and latent)
-> encompasses measurement (reating observed to latent variables) and structural models (relating latent to latent variables)
Factor models in SEM representation represent
ovals for latent variables
rectangles for observed variables
Why are SEM/structural equation models useful?
Allows for testing complex relationships between multiple
Dependent and independent variables
Latent and manifest variables
Major reasons for it’s use in psychology are:
Ability to use
multiple variable
noisy variables
observed variables
to estimate a latent variable
The use of multiple dependent variable sometimes better represents a theretical idea than any individual indicator
The ability to include more than one type of dependent variable in the odel allows for models that represent entire theories rather than small pieces
Visual representation sometimes helps to understand implications and correct problems
How should SEM/structural equation models be interpreted?
Path coefficients (arrows)
represent the strength and direction of a relationshop between variables
Direct effect: the effect of one variable directly on another
Analoguous to regression coefficients -> represent how much one variable changes in response to change in another
Direct effects from a latent actor to an indicator (manifest variable) are sometimes calles factor loading (as in factor analysis)
Indirect effect: the effect of one variable on another through a mediator
Covariances (unstandardized) and correlations (standardized)
Variances
Which assumptions do SEM/structural equation models make?
Linearity: relationships between varibales are linear
Multivariate Normality: Residuals should be normally distributed (most near the expected value)
Independence of Residuals: residuals should not contain information about other residuals
No measurement error in predictors: assume no error in the measurement of exogenous variables (predictors) unless explicitly modeled
How do SEM/structural equation models work?
Compare observed mean and covariance matrix to model-implied mean and covariance matrix, given estimated parameters
Evaluate fit: does the model account for the observed means, variances and covariances
Typical basis of comparison: “saturated” (freely estimated) means and covariance matrix
Chi-square difference tests are typical for this, but many fit indices exist (non significant result = good)
What is the general goal of SEM/Structural equation models (or also statistical models in general)?
Enough parameters to represent relationship between data
Relate parameters to theoretical concerns
Avoid “over-fitting” -> minimum number of parameters necessary to explain data
How do you fit a structural equation model that represents change in reaction time without any random effects in R with lavaan?
Specify the simple linear growth model (model structure without model parameters)
name <-
‘i = ~ 1 * Day0 + 1 * Day2 + 1 * Day4 + 1 * Day9
-> this defines the intercept, it is always the same with the same weight for the four days intervall
-> this is about intervalls, not about Data which is contained for those days
s = ~ 0 * Day0 + 2* Day2 + 4 * Day4 + 9* Day9
-> the slope “grows”, i.e. the effect of the slope on the first day of sleep deprivation is none, on the second day times two etc.
i ~ imean * 1
s~ smean * 1
-> Asking lavaan to also estimate slopes and means
Day0 ~~ residualVar *Day0
Day2 ~~ residualVar * Day2
Day4 ~~ residualVar * Day4
Day9 ~~ residualVar * Day9
-> all have the same variance, captures whatever variability was not yet captured’
Fit the model to the data -> now adding actual data to the structure
fit_1 <- lavaan(name, data = sleepstudy)
specify structure + dataset
Summary(fit_1)
What is the output of a summary of the fit between a structural equation model and data?
Model Test User Model:
Test statistic -> chi-square, higher value = worse fit
Degrees of freedom
P-Value (of Chi-Square, below .05 = bad)
Intercept and slope with p value
Variances with p values -> how much of observed variance is NOT explained by model -> lower is better
How do you create a path diagram on a fitted model?
Graph Layout: specifiy where belongs what:
graph <- matrix(c
(NA, NA, NA, ‘s’,
-> this is the visual representation with 4 spaces, where s is completely on the right
‘i’, NA, NA, NA,
‘Day0’, ‘Day2’, ‘Day4’, ‘Day9’), ncol = 4, byrow= TRUE)
-> ncol = 4 means four colons (which makes sense, because it was 4 elements in the NA/s thing)
This is a general structure with no values inserted yet
Insert actual values
graphsem(model = fit_1, layout = graph, spacing_y=2, varaince_diameter =.3)
graph_sem = function
model = fit_1 -> fitted model with structure and data, which was fitted before
layout = structure which was sepcified before with matrix
spacing_y = how much space vertically
variance diameter = size of circles
How do you add a
random slope
random intercept
covariance between random intercept and random slope
Zero covariance between random intercept and slope
in a structural model equation in R (with lavaan)?
random slope: s ~~ s
random intercept: i ~~ i
covariance between random intercept and random slope: i ~~ s
Zero covariance between random intercept and slope: i ~~ 0*s
How do you compare two structural equation models with R?
lavTestLRT(fit1, fit2)
How can this result of the comparison of two structural equation models with lavTestLRT be interpreted?
4 has a slightly smaller AIC and BIC -> better fit
their Chisq is quite similiar, with 3 showing a smaller (and thus somewhat better) fit
Chis diff: difference of Chisq between 3 and 4
Df diff: difference in degree of freedoms, wherein more Df are better -> fit4 better
Pr(> Chisq) = 0.3697, which is bigger than 0.05 -> insignificant, i.e. difference in fit between the models not statistically significant
-> BIC etc not significant, the two models are not statistically significantly different but 4 is simpler and might therefore be prefered
What do path coefficients represent in Structural equation models and which distinction needs to be made?
Path coefficients represent the strength and direction of relationships between variables, similiar to regression coefficients (betas) in standrad linear models
When predictors are uncorrelated, each path coefficient represents the direct effect of that predictor on that specific variable (DV)
i.e. the influence of one predictor (i.e. hours of sleep) on the dependent variable (i.e. concentration) is not dependent on another predictor (i.e. what someone ate)
When predictors are correlated, the path coefficients represent the unique direct effect after accounting for the covariance with the other predictor
i.e. The path of sleep -> concetration represents the effects of sleep alone, while the shared effects with for instance mood are detracted from this path (and accounted for elsewhere)
What are variance and covariance?
Variance: the range of a variable, i.e. how far the data strays from the expected point
Wurzel(Variance) = standarddeviation
1 STD = 68% of Data in this range
Covariance: how two variables change together, i.e. if one variable increases if another increases
For instance relationship between x and y
If they are uncorrelated, the covariance is 0
Correlation: standardized, takes on values between -1 and 1 (covariation = negative infinite - positive infinite)
How does variance change basen on whether predictors are correlated or uncorrelated (exemplified on the money distribution allegory)?
Experiment:
Both groups get two random amounts of money (with a mean of 100 and a SD of 10) -> normal distribution
One group draws lots twice (independent/uncorrelated)
correlation = 0, Covariance = 0
Another group draws lots once and then gets the exact same amount they have drawn a second time in the second round -> perfectly correlated/covariance
Correlation = 1.0, Covariance = 100
-> Both groups will get (approximately) the same amount of money, but the variance of the second group will be significantly bigger
How is Variance calculated when predictors are correlated vs. when they are uncorrelated?
Sum of (mean - actual value)^2/number of included values (mean)
Variance when predictors are correlated:
Variance draw 1 + Variance draw 2 + Covariance draw 1 + Covariance draw 2
-> if uncorrelated, covariance will be 0
Variance when predictors are uncorrelated:
Variance draw 1 + Variance draw 2 + 2 x Covariance (assuming perfect correlation)$
-> additional variance
What do the single components of this SEM represent?
Triangles
on top: intercept/starting point
Triangle on side: weight of other things, i.e. here how much money is in the wallet already
Circles:
Predictors, here two drawings of money
have their respective variance on the side
Path coefficients
100 = expected mean value
b1 = 1.0 -> weight of path coefficient, similiar to betas
Dashed line: in some cases covariance, in others no covariance
Square: Result, dependent variable
Resvar = rest variance, unexplained
How would the total variance of y be calculated based on this model?
y = var(x1) x b1^2 + var(x2) x b2^2 x cov(x1, x2) x b1 x b2 + resvar(y)
= 100 x 1^2 + 100 x 1^2 x cov(either 0 or 100) x 1 x 1
How would the expected mean of y be calculated based on this model?
1 x expected mean x1 + 1 x expected mean x2 + 1 x path coeff 50 (baseline amoung)
= 1 x 100 x 1 + 1 x 100 x 1 + 1 x 50 = 250
What is RAM notation and what do the single components of RAM notation mean?
Compact way of expressing the different SEM relations, used in some software
I = identity matrix, diagonal of 1
tells you simply how many variables there are
A = Asymetric (direct path) matrix
shows how strong effects are
Asymetric because it might show the effect of x on y but not vice versa
S = Symmetric Matrix (non-direct paths/covariance)
shows covariance -> symmetric, because variables move together
Also shows variance and restvariance
M = Means Vector
average means of variables
What is the model implied covariance and how is it calculated in RAM?
Model implied covariance: how all variables move together
i.e. if you’re connected by a rope and someone pulls, how much does everyone move?
-> model implied means: where does everyone land after the pull?
normally model implied covariance calculated as follows:
(I - A)^-1 x S x (I - A)^-T
= identity matrix (which variables exist) - Asymmetric matrix (one directional effects, i.e. effect of x on y/betas) x Symmetric matrix (covariance) x transposed asymmetric matrix
Takes into account indirect effects, covariance, and transposed effects (i.e. also how y influences x)
Combination of direct, indirect and mutual influence between variables
What is the model implied mean and how is it calculated in RAM?
i.e. if you’re connected by a rope and someone pulls, where does everyone land?
-> model implied covariance: i.e. if e pulls, how much does everyone move?
RAM calculation
(I-A)^-1 x M
I - A^-1 takes into account direct and indirect realtionship
M = basic means = adjusting average values of variables based on how they influence each other
What needs to be done to make different measurement types (i.e. in latent variables) equivalent? What are the implications for latent variables in psychology?
To make things equivalent, we might need to allow for differences in
Factor loading
Measurement specific intercepts
With latent variables (and thus often in psychological measurements), we do not know the intercept and the factor loading, therefore we have to estimate the intecept and the factor loadings
What is an important step to take before interpreting estiamted model parameters and why is it relevant?
Before interpreting estimated model parameters, it is important to ensure the the model which has been fit to data provides a good representation of ther underlying relationship in the data
i.e data interpretation will be faulty if certain covarainces are not represented in the model
What is model fit?
When a model which has been fit to data provides a good representation of the underlying relationship
What happens when data is fit to a model which does not adequately represent the relationship between the variables (i.e. bad model fit)?
Model is likely to make poor predictions regarding future data
The estimated parameters may misinform us about relations between psychological constructs
Extreme examples: Assuming independence between different extraversion measures
When is the least squares approach a good approach to find the best model fit and when does it make sense to use another statistical approach?
LS good choice for estimating the parameters of linear regression models
Residual variation is assumed to be the same across all data points -> makes sense to use least squre, as this approach aims to find estimates which minimise the sum of squared residuals
Not a very good choice for SEM
There are different dependent variables, hence also often different residual variances
100 residual variance for one variable might be not so much, while for another variable a residual variance of 0.2 might be a lot
LS would however focus on minimising the RV of the 100 variable because it is bigger (neglecting how it might change the RV of the 0.2 variable) -> relative importance of the residuals very wrong
-> Some means is needed to weight the residuals while accounting for their relative importance (i.e. how large a prediction error they represent) -> likelihood appriach
What defines the likelihood of a row of data in typical SEM modelling? What makes for a better likelihood?
The multivariate normal distribution with an expected mean and covariance matrix
A higher likelihood results when an observation
is close to the expected mean
has a small variance
has a small covariance
What can be said about the connection between likelihoid and log?
In SEM, for numerical reasons we normally work with the log likelihood of all the data we are fitting
This is done by taking the sum of the log likelihoods of each row of data
= sum of log of multivariate normal density of set of observations y given the expectedmean and expected covariance matrix
Sometimes in SEM modelling, it has become convetion to instead use two times the negative log likelihood (-2LL) and minismise it rather than maximize it
What are nested models? Which model comparison can be used for nested models?
Nested models: model that can be obtained from a more complex model by constraining some of its parameters -> simpler
i.e. setting factor loading or covariance to zero
removing a predictor
-> fewer free parameters which have to be estimated
Also called a Null moden or restricted model
Alternative model: more complex model which includes additional free parameters
also called unrestricted model
-> if the more complex model is not a significantly better fit, the null model is kept because otherwise there is a risk of overfitting the data
Test used for comparison: the Likelihood Ratio Test (LRT)
What does the likelihood test ratio do?
Test used to compare null model and more complex model
Test compares the log likelihood of the two tests
LLunrestricted = log likelihood of the more complex/unrestricted model
LLrestricted = Log likelihood of the simpler (restricted)/ Null model
Likelihood ratio test statistic (chi^2) is given by
chi^2 = -2 x (LLrestricted - LLunrestricted)
-> always results in a positive value, since LL restricted < LLunrestricted (and is then multiplied by -)
The statistic follows a chi-square distribution with degrees of freedom equal to the difference between the number of free parameters between the two models
What is a chi^2 distribution?
Difference in frequency between expected frequency (of an event) and real, observed frequency
Answers the question, whether such a difference between expected and real frequency is still likely
How should the likelihood ration test be interpreted?
If the test statistic is significant (p-value below specified threshold like 0.05) it suggests that the difference in log likelihood us unlikely due to random sampling variation
-> the more complex model provides a significantly better fit to the data than the simple model
If the test statistics is not significant, it means that the simpler model fits the data just as well as the more complex model, thus the simpler model must be preferred
The more complex model might still fit the data better, but not enough to justify the inclusion of extra parameter
The extra parameters are also more likely to be overfitting the data and thus worsening the performance for predicting new data and interring relations
What simple way exists to check whether a model actually fits the data? How can this be implemented in R?
Comparing
the model implied means with the actual means
the model implied covariance matrix and the actual covariance matrix (variance and covariance)
In R
inspectSampleCov(model_1, big5)
= sample covariance & means matrix of actual data
OR
lavInspect(fit_piqDuration, "sampstat")
lavInspect(fit_1, what=’exp’)
expected covariance and mean matrix
lavInspect(fit_1, what=’res’)
-> output is the difference between observed - expected data -> residual covariance, means and variance
residuals(fit_piqDuration, type='cor')
BUT: does not tell us if those differences are significant
Could be checked by adding an additional parameter and conduction a likelihood ratio test (thus comparing the two tests)
When does standardization of data make sense?
Sometimes working with parameters estimates based on raw units of measuerment data makes more sense, especially if the unit of measurement has obvious or understood implications (things like age, time, temperature etc.)
“increase in number of words known per year of age in chldren”
For psychological variables there is often no such implicit meaning of the measurement scales used - a 5 of extraversion does not mean anything without further context, and extraversion might have a slide of 0 to 5 in one measurement and a slide of 0 - 100 in another
-> in such cases it is much easier to interpret standardized estimates
Besides fixing the latent variance to 1 and scaling the observed variables so they have a variance of 1, theres an easier solution -> calculate the appropriate standardized form of estimates based on the unstandardized estimates
What are standardized factor loadings and how are they calculated?
Raw factor loadings (Lambda): how much an observed variable changes for a one-unit change in the latent variable
If the observations have different scales (i.e. one observation measured in kg, another in meters etc.) the raw factor loadings are difficult to compare (is an increase by 5 kg more than an increase by 5 meter?) -> solution: standardization
calculating a standradized factor loading
Lambdastd = sqrt(o^2F/o^2x)
o^2F = the variance of the observed variable which is explained by the latent factor
o^2x = total variance of the observed variable
If the variables are already standardized (which is often the case in SEM), i.e. the total variance is set to 1, this results in
sqrt(1 - resvar(observed variable)/1) = sqrt(explained variance)/sqrt(1 - resvar)
The standardized factor loading is equivalent to the correlation between the latent factor and the observed variable
interpretation: if the latent factor increases by one standard deviation, the observed variable increases by the standardized factor loading
What can be said about standardized variance in SEM?
In a standardized solution, all variances are expressed with respect to a total variance for each variable (latent and observed) of 1.00
Latent factor variance:
Variance of the latent factor is set to 1 in standardized solutoin
therefor the latent factor is interpreted as a standardized latent construct
Resiudal variance: the residual variance is also standardized and represents the proportion of variance in an observed variable that is not explained by the latent factor
1 - Lambdastd, since lambdasta is represents the proportion of explained variance
How can covariances between latent factors or observed variables be standardized to correlatoin?
pXY = Cov(X, y) / ( oX x Oy)
oX = standard devition of varibale X
Cov-> Assuming there is one latent factor Y connected to two observed variables X1 and X2
Variance(y) x factor loading path(y -> x1) x factor loading path(y -> x2)
What are the benefits of standardization?
Interpretability: Standardized estimates are easier to interpret because they express relationships as proportions of variance or correlation -> uni-free and comparable across variables
Comparison: Direct comparison of
factor loadings
correlations
variances
across different
models
variables
studies
Even if they use different measurement scales!
In softwares like lavaan in R, standardized estimates are often included by the output by using options liek standardized = TRUE or by requesting standardized solutions in model summaries
How is covariance between two observed variables calcualted in SEM?
Assuming there is one latent factor Y connected to two observed variables X1 and X2
How can two (fitted) models be compared and what is the output thereof?
input: anova(model_1, model_2)
Output: AIC, BIC, ChiSq
How would you interpret those model-comparison outputs:
Slope model has
Smaller AIC
Smaller BIC
Aditionally, the more complex (slope) mode has a signficant chi^2 value (p < .05), indicating it is a better fit
Without the direct model comparison, the amount of explained variance also indicated a better fit for the slope model
Fit a model in r for a structural equation model representing the relationship between two latent factors (linear regression). What would be needed for a random intercept and random slope?
Create model (model structure):
model1 <- ‘
i =~ 1*time0 + 1*time1 + 1*time2 + 1*time3
s =~ 0*time0 + 1*time1 + 2*time2 + 3*time3
i ~ imean*1
s ~ smean*1
time0 ~~ residualVar * time0
time1 ~~ residualVar * time1
time2 ~~ residualVar * time2
time3 ~~ residualVar * time3
i ~~ ivar * i
s ~~ svar * s
i ~~ covsi * s -> allows for covaraiance between intercept and slope
’
Fit to data/parameters:
fit_model <- lavaan(model = model1, data = modeldata)
Represent data
summary(fit_model)
What is the formula for calculating the expected variance of an observed variable in SEM
Observed variable: A
Latent variable influenceing A: B, C
Formula
Residual variance A +
+ Variance B x (Path B-A)^2
+ Variance C x (Path C-A)^2
+ 2 x Cov(B, C) x (Path C-A) x (Path B-A)
-> root of this for standard deviation
What can be said about latent factor models over time?
A factor showing the same concept at different times can look just like a model with multiple latent factors relating to each other
The “two” latent factors (i.e. Extraversion as a child vs. as an adult) probably have a certain covariance -> could tell us something about the stability of this latent factor
Requirements for estimation are the same for factors over time
we need to fix a certain amount of parameters to specific values
Requirements for interpretation might differ
Measurement parameters might be different, i.e. one item on an extraversion scale might be highly relevant as a child (playing with others) and way less relevant as an adult
What is measurement invariance (in SEM)?
Measurement invariance = when none of the measurement parameters differ across factors, i.e. measurement properties do not vary
For instance if extraversion is measured at childhood (E1) and adulthood (E2), measurement ivariance would be if one manifest variable would predict extraversion equally well at both time points
Comparisons are easiest when there’s measuerement invariance
If all loading and measurements between two (or mored) models are the same, it’s called total measurement invariance
normally not the same
What assumptions does SEM make regarding measurement invariance and what can be said about this assumption?
Classical SEM perspective:
Comparisons regarding latent variables (i.e. change up or down in the mean of the latent variable) are only possible when there’s measurement invariance across the variables of interese (i.e. time)
This assumption is not really true
i.e. Imagine studying extraversion over time, but some measures occurred during the pandemic lockdowns -- average responses to questions about party attendence etc could be substantially lower
Doesn’t necessarily reflect changes in underlying extraversion
Reflects a change in the measurement property, not in latent factor
Interpretation does however get harder when measuerement properties change
meaning of latent factor is always determined by
measurement used
relation between measueres and latent factor
How can we test measueremtn invariance for a model in SEM?
Create a model where two latent factors have same measuerement parameters
Create a factor where measurement parameters can vary
Compare the simpel and the more restrictive models
What does this image represent?
Latent Variable of six people over 50 different time points
Black Dots: latent Variable
Latent Factor is constant across individuals (in terms of intercept and slope) and does not change over time
Residual errors and residual variance are not included
Manifest variable matches latent factor perfectly and is measured perfectly (without error), therefor it is not visible (“hidden” behind latent factor)
What does this image represent
Residual errors and residual variance are now included
Development of one latent variable over 50 time points for multiple individuals
Different intercepts and slopes which are likely correlated -> people with lower intercepts have bigger slope
Orange points: residual errors of manifest variable (/manifest variable + residual errors)
development of one latent variable for six individuals measured with three different manifest variables (red, blue, orange)
The manifest variables have the same factor loading but different intercept/mean structure
They all have approximately a factor loading of 1, but blue has a lower intercept and thus starts further down
Factor loading deducted from the slope
What does the blue line represent here
Representation of the development of one latent variable over 50 time points for six individuals
Measured with three different manifest variables
Variables all have the same factor loading -> slope look alike and follow the latent factors slope -> probably 1
The blue line has less residual variance (0.1 instead of 1) -> indicator for better measurement for latent factor
Which aspects indicate that a manifest variable is a better indicator for a latent variable?
Lower residual variance
Higher factor loading
What does the blue line represent in this picture
Represented are the development of a latent variable for six individuals over 50 time points, measured with 3 manifest variables
The blue manifest variable has a higher factor loading -> if goes up more for each point in latent factor that increases than the latent factor -> this manifest variable is better
Expected value for the measurement is calculated as follows:
latent variable x factor loading + intercept of manifest variable
expected value = latent variable x factor loading + intercept
factor loading = expercted value - intercept of manifest variable/ latent variable
What meaning does the scale of a latent variable have?
There is no true scale for a latent variable
Scale is simply set to represent things in relation to it
It does not change the relationship of the observed variables
this does not mean that the scale is menaingless - without a scaled latent factor, a change in 2 in a manifest variable could not be intrerpreted
When the latent factor has a mean of 100 and SD of 15, 2 is meaningless
When the latent factor has a mean of 0 and a SD of 1, a change of 2 is huge
Imagine you have to create a SEM model for how the IQ of patients changes at three different measurement points (piq_1, piq_2, piq_3). What meanings would the different components have?
model <-
‘i =~ 1*piq_1 + 1*piq_2 + 1*piq_3
s =~ 0*piq_1 + 1*piq_2 + 2*piq_3
s ~ smean * 1
-> specifying mean and scaling intercept and slope
piq_1 ~~ residualVar*piq_1
piq_2 ~~ residualVar*piq_2
piq_3 ~~ residualVar*piq_3
-> implies same residual variance
-> piq_1~~ piq_1 would imply free residual variance’
fit_piq <- lavaan(model, data = …)
summary(fit_piq)
Output
Intercept and slope
How are the requirements for latent factor models over time different and equal to the requirements for multiple latent factor models?
Equal: same requirements for estimation
Fixing one variance/factor loading and one mean
Often variance 1, latent mean 0
Different: requirements for interpreteation
Comparisons are easiest when none of the measurement parameters differ over time
If a measurement for extraversion from year 1 to year 2 changes in term of loading, it is hard to understand if a change was due to time or due to measurement
What is measurement invariance? How is it connected to SEM?
When none of the measurement parameters vary across factors (= measurement properties do not vary)
Classical SEM view: comparison regarding latent variables only possible when measurement properties are invariant across all variables of interest (including time) -> often not very realistic though, since the context also influences those things
How can measurement invariance be tested beteween models capturing change in time?
Create one model where parameters for both are fixed to be the same and a freer model where some or all parameters are allowed to vary across time
Compare restrictive and simpler model
What is the relationship between variance and standard deviation?
Var(x) = (SD(X))^2
SD(X) = Wurzel (Var(X))
-> in SEM calculation of total variance
Wurzel von (Variance Factor x factor -> observed v.^2 + residual variance)) -> SD of observed variable
How can correlation between two observed variables be calculated in SEM?
Corr(x1, x2) = Cov(x1, x2)/SD(x1) x (SD(x2))
Explicitely, assuming x1 and x2 are being influenced by same latent factor
calculate covariance:
Loading F -> x1 x Loading F -> x2 x Variance F
Calculate SD
Wurzel(total variance)
Total variance = Variance F x Loading F -> x1 ^2 + Resiudal Variance
How can the proportion of explained vvriance be calculated?
R^2Y = 1 - (Var(Y) residual ( Var(y))
How can the implied means of a variable X in a SEM be calculated?
Loading F -> x1 x meanF + Intercept x
How can the variance of an observed variable be calculated if it is influenced by multiple latent factors?
Var(x) =
LoadingL1 -> x^2 x VarianceL1 +
LoadingL2 -> x^2 x VarianceL2 +
2x(CovL1, L2) x LoadingL1 -> x x LoadingL2 -> x^2
+ Residual Variance x
If L1 and L2 do not covary
How can RAM notification be used to draw SEM models?
Draw a rectangle for any observed variable
Draw a circle for every latent variable
Draw a triangle with a 1 inside for the constant (means)
For every element of the A matrix which is not zero, draw a single headed arrow
For every element of the S matrix which is not zero, draw a double headed arrow
if it is the same variable -> double headed arrow starting and ending on same variable
For the m matrix, for every element that is not zero draw a path from the triangle to that variable
How is the implied covariance matrix different from RAM?
Smaller matrice sometimes used as alternative to RAM
Assumes that the observed variables are all indicators of the latent factor
have no direct paths between themselves
Have no covariances between themselves
Looking at this output, anwer the following questions:
are there obvious patterns in the data that the model is not capturing? What are these patterns, and what might they represent conceptually, i.e., with respect to the subjects and the research questions?
What aspect/s could we include in our model to represent the idea that initial perfor mance after coma is (possibly) related to the duration of the coma? And what if the recovery rate was related to the duration of the coma?
If such a feature were important, and we included it in the model, what would you expect to see happen with the residual correlations?
Covariance between observed variables is not being captured by model
They might reptesent that a better performance at point 1 will go hand in hand with better performance at point 2
we would need to include the duration variable, and model an effect from it the latent intercept
This path would allow the initial performance IQ to be influenced by the duration of the coma
We would do similar for the latent slope– make it depend on the duration of the coma.
i ~ imean * 1 + duration
-> Introduces a relationship between the intercept and the durationdurationduration, meaning changes in durationdurationduration affect the initial level
s ~ smean * 1 + duration
duration ~ durationmean * 1
# Residual variances
piq_1 ~~ residualVar *piq_1
piq_2 ~~ residualVar *piq_2
piq_3 ~~ residualVar *piq_3
duration ~~ durationVar * duration
Residual correlatoins would decrease since explained variance (by the latent factor.s) would increase
Based on this image, does adding effects of duration to this model make sense/make for a better model fit? Why or why not?
No it doesnt
Path coefficients of duration to slope/intercept are pretty much zero
If compared with other graph, residual variances have not decreased -> no (significant amount of residual variance has been explained by adding duration)
Fit a regression model where “piq” is predicted by duration and the interaction of duration with time
lm_piq <- lm(data = Wong,piq~duration+time*duration)
What can be said about time and causality in relation to SEM?
Not always so clear -> change in one thing can lead to change in another thing and so on
Mediating role of things often not direct, but rather happening at a significantly later time point
Eg. the meidating role of exercise motivation on peoples fitness -> Intervening on motivation (A) will immediately impact daily exercise (B) but wont change fitness (C) one day later
We can however measure change in motivation and estimate how this translates to later change in fitness
What is state dependent change and how is it implemented in SEM?
State dependent change: how a variable changes may depend on its current past values or the current/past value of other variables
This is different from correlated intercept and slope, since correlated intercept and slope means that a lower intercept at point A leads to a constatly lower/higher slope, but not that a lower intercepts meas that the slope will become steeper (or that learning becomes faster or slower) as time progresses
What is represented here
State dependendence
Image 1:
state dependent slope -> change in where the single individuals start and thus in how steep slope is at any given point
All end up at same point: at lower skill level, skills increase faster and then come to a plateau
Image 2
State dependence with individual differences in equilibrium
Slope curve is similiar, but plateau at different level
Might represent a learning curve by age (where at some point getting better is harder)
Random change in latent variable
In reality, latent variables do not change smoothly and predictably most of the time
Two sources of noise here
Residual (measuerement) error -> everything we are not interested in
Random fluctuations in latent variable -> unpredictable changes in variable we are interested in
What are residuals and which different kinds of residuals exist?
Residuals in general
Term residual generally used for what is left after a prediction
Raw data residuals
Residual for each specific data point
i.e. how much noise around Day 3?
Differene between prediction and actual data point = residual
Prediction is made using model implied means and covariances
Each of those should be independent of all other raw data residuals
if not, it means tehre is information left in the residual that we could use to make better predictions (i.e. our model does not allow for covariance between two variables but there’s actually covariance there) and the model is misspecified
if one residual can be used to predict another -> bad
Covariance/correlation matrix residuals
Not specifically about the raw data, but relates to higher level patterns in the data -> covaraiances/correlations
Appears if there is a difference between the sample data covariance matrix and the model implied covariance matrix
Observing non-zero residual covariance is one way we can know thet the “raw data residuals” will not be independent
Residual variances
Amount of variance we expect to see in the observed variable after accounting for the predictors given based on all other element of the model
i.e. total variance - explained variance
it this is incorrect, the model is misspecified and inferences may be wrong
What can be said about modeling covariance between observed variables?
Explicitely modelling covariation between two observed variables is undesirable because it implies the latent factor cannot explain all the covariation of the data
But this is not a problem in regards to the assumptions of SEM
Can be a way to actually avoid violating assumptions
Pattern of variance and covariance in the model needs to match the model, otherwise the raw data residuals will not be indepenent
What is the issue with large n’s and what can be done about this?
With lots of data, even minor imperfection will be statistically significant (with the Chi seuared difference test)
There’s no single solution to this
need to balance the importance of model imperfection and model simplicity/complexity depends on the context
For predictions in new scenario -> simpler model
For predictions in same context -> no need to understand model, choose best performer
For interpretation, sacrificing small amount of predictive power for the value of much simpler and more general concepts makes sense
RMSEA as possible solution -> models significance while also not simply favouring more complex models when more data is available
What is RMSEA and what does it do?
Model fit index
Index which does not simply favour more complex model when more data is available (as the chi^2 index does) by
including sample size in model fit calculations
icnorporates a division by sample size -> makes it less sensitive to dataset size
Population focused: RMSEA estimates how well the model would perform in the population, not just the sample
How should RMSEA fit indices be interpreted?
RMSEA < 0.05
good fit
model closely apporximates the population covariance matrix
0.05 < RMSEA < 0.08
acceptable fit
Model reasonable but could be improved
RMSEA > .10
Poor fit
Model fails to adequately represent data
What are interactions in SEM and what other name exists for interactions?
Also called moderators
Interactions : when we want to know how a particular parameter (i.e. correlation between latent factors, mean of a slope or intercept factor etc.) might differ as a function of a covariate like age, gender, treatment etc
What is the difference between simple and complex interactions in SEM and how can they be included in a SEM model?
Simple interactions: When we want to examine whether a variable makes a difference in terms of mean or variance AND the interaction term is observed (therefore not a latent variable) we can include this in the model as a covariate
More complex interactions: when we want to know whether there is a difference in one or more
covariances
variances terms as a function of some variable
How are interactions in SEM different when the moderator is a discrete vs. a continuous variable?
Discrete variable:
we can use mutlitple group SEM to estimate different models for each group, specified by interaction (grouping) variable
Example: create a model where measurement works for young and old people the same way, then create model which makes a difference between the two -> compare two models
Continious variable
SEM cannot handle contiuous interctions/moderators
For the same reason SEM does not handle interaction between latent variables
How can you visualize group models in R for a SEM model?
Create the model first:
Goop <- ‘
Neuroticism =~ N1 + N2 + N3
Conscientiousness =~ C1 + C2 + C3
Neuroticism ~~ Neuroticism*1 (-> variance)
Conscientiousness ~~ Conscientiousness*1
Neuroticism ~~ Conscientiousness (-> structural relationship)
Residual variance
N1 ~~ N1
N2 ~~ N2
C1 ~~ C1 etc.
Mean structure
N1 ~ 1
N2 ~ 1
etc’
Fit the model separately
fit_males <- lavaan(goop, data = big5[big5$gender = ‘male’, ])
fit_females <- lavaan(goop, data = big5[big5$gender = ‘female’, ])
create path diagramm of fitted model
graphLayout <- matrix(c( ‘N1’, ‘N2’, ‘N3’,
‘con’, NA, ‘Neuro’,
‘C1’, ‘C2’, ‘C3’), byrow=TRUE, ncol = 3)
Fit data to path diagram
graphSEM(model = fit_males, layout = graphLayout)
graphSEM(model = fit_females, layout = graphLayout)
-> visual representation
summary(fit_males) for summary with standardization of estimates
How can we figure out with SEM whether theres a group difference in terms of an effect (assuming the groups are discretely divided and not contiuos)?
Create general model where effect of group can vary
Create model where groups are forced to have same relationship
model_restricted <-
Neuroticism ~~ c(corrNC, corrNC) * Conscientiousness
-> structural relatioship, both groups have the same correlatoin between Neuroticism and Cornscientiousness
-> for forced different corr: c(corMale, corrFemale) * Conscientousness
fitrestricted <- lavaan(modelfitrestricted, data = big_5, group=’gender’)
summary(fitrestricted)
Compare the two models
anova(fitunres, fitrestricted)
-> restricted has fewer free parameters, so if it is a significant effect the more complex (unrestricted) model is a better fit
What does a saturated model look like in R and what is the point of saturated models?
Saturated models are the most unrestricted models in a way, allowing for all kinds of correlations and covariations
Model must be constructed in a way that all variables are free to have the relationship they want
saturated <- ‘
N1 ~~ N2 + N3 + C1 + C2 + C3
N2 ~~ N3 + C1 + C2 + C3
N3 ~~ C1 + C2 + C3
C1 ~~ C2 + C3
C2 ~~ C3
N2~~N2 etc.
N3 ~ 1 etc’
-> fit model
saturated models allow us to see if the fitted model really makes more sense than the free covariations -> if the saturated model is a better fit, we gotta inspect what went wrong (i.e. if we’ve not paid attention to a covariance etc.) and can try to build those missing relationships into our model
We could now extract the residuals from the fitted data to see what has been forgotten about
How can the residuals of a fitted model be extracted in r?
residualsgroup1 <- residuals(fit_group1, type = “corr”)
print(residualsgroup1$cov)
Which different regression syntax exist in R to incoroporate an interaction in a linear regression model, given
dataset = people
dependent variable = fear
independent variable 1 = bees
independent variable 2 = big noses
thehorrors <- lm(data = people, fear ~ bees*bignose) OR
thehorrors <- lm(data = people, fear ~ bees + bees*bignose) OR
thehorrors <- lm(data = people, fear ~ bignose + bees*bignose) OR
thehorrors <- lm(data = people, fear ~ bees + bignose + bees*bignose)
-> when we look at interactions we also automatically look at the main effects (i.e. its the influence of interaction of bees and bignoses on fear that we look at, but also automatically the effects of bignoses alone and bees alone on fear)
What is the meaning behind the regression syntax in R for Linear random effect model?
lme4(data = Wongm, piq ~ (1 I id)
-> 1 = intercept, constant which depends on id
lme4(data=Wong, piq~(subject I id)
same as lme4(data=Wong, piq~(1 + subject I id)
-> subject = (not constant) which depends on id -> slope, but also implicitly models 1 (constant) which depends on id -> intercept
What is autocorrelation? When does it happen?
Autocorrelation: score at first observarion is related to score at second observation etc.
Happens often in multiple observations of the same thing (i.e. nested data)
Individual differences in slope and intercept are one source of autocorrelation, but not the only one
What is the difference between auto- and cross correlation and which different kinds of auto- and crosscorrelations exist?
Autocorrelation: earlier values of one variable are correlated/can predict later varlues of that variable
Positive: If at one point variable goes up, likely at next point variable goes up even more
Negative (very rare and unlikely): if variable goes up at point 1, it goes down at point 2 etc.
Crosscorrelation: earlier values of one variable are correlated/can predict later values of different variable
Positive Crosscorrelation: When variabne A goes up, variable B goes up shortly after (overlapping parallel lines)
Negative Crosscorrelation: When variable A goes up, Variable B goes down shortly after (overlapping mirrored lines)
Relation between variables over time can be very informative
When is model comparison used?
Comparing competing theories
Extend theories
Check whether theoretical model matches data/observations in the wolrd
What does the chi^2 test do in terms of model comparison and how should it be interpreted? What other name is there for the chi^2 test?
Also likelihood test
Chi^2 used to compare two model fits
one model is simple and one more complex
Gives probability of observing the difference in likelihood of the two model fits
Nullhypothesis/Assumption: simpler model is better fit/equally as good as more complex model
if p < .05 = only 5% probability that simpler model is as good as the more complex model (but technically there’s still a certain probability that it is a better fit) -> should lead to rejection of null hypothesis
What does the AIC do and how should it be itnerpreted?
Model comparison tool
Broadly applicable, not only for nested models (as the chi^2/likelihood test does)
Combines likelihood of a model with number of parameters
Interpretation
Lower AIC = this model is expected to perform/predict better than the one with higher AIC
Makes no claims about statistical significance though
What is the meaning of theories?
Inellectually:
important for collective history
allow new intellectual viewpoints
Practical
Facilitates undertanding our surroundings and empirical phenomenons
Help us predict and control phenomenons in our world
What is the role of theories in psychology?
Lack of strong theories in psychology -> could explain reproduction crisis
Lack of theory-construction
Strong focus on theory-testing rather than theory creation
Toothbrush-problem: theories in psychology are mostly simply product of single individuals
Loyality to hypothetico-deductive methode: idea that scientific progress depends on repeated testing of theories
Why is the lack of strong theories in psychology an issue?
Danger of repeatedly reinventing the wheel
Lack of overview of existing theories
Lack of unterstanding of connection between phenomenons
Lack of understanding, if and which phenomenons come from same source
Without strong theories it is hard to create effective interventions
Without strong theories it is harder to create/operationalize studies
Which two different starting points of scientific methodology exist?
hypothetico-deductive science
Putative theory which is repeatedly testes
Theory construction methodology (TCM)
Starting point: set of relevant pheonmenas
Endpoint: Theory which explains phenomenas
What are phenomenas and their role in science?
Stable and generally valid characteristics/features of the world
empirical generalizations
Science tries to explain phenomenas
What is data and how is it different than phenomenas?
Quite direct observations or reports about observations in the world
Distinct, related to specific investigative context
Have specific empirical patterns (while phenomenas have general empirical patterns)
Data is NOT/only indirectly explained by theories
Theorie explain phenomenas which are made visible through data
What are theories?
Theories help explain phenomena
Set of linked statements
at least one of the statements expresses a general principle
Why ways exist to evaluate the overall worth of a constructed theory?
hypothetico-deductive method: evaluating theory based on how well it can predict pheonmena
Other criterions -> Kuhns five features to evaluate a theory:
accuracy (genauigkeit)
consistency
Goal
simplicity
productivity
Inference for the best explanation -> TCM
Which way of evaluating the overall worth of a constructed theory does TCM prefer?
Inference for the best explnation
Theory of coherence of explanations
Three criterions:
Explanatory breadth: number of phenomenas, which are explained through theory
Criterion analogy: analogical thinking, repeated success
Criterion of simplicity: prefers theories with fewer parameters
What is TCM?
Haigs abductive theory of method
Scientific research subdivided into two categories
Discovering empirical phenomenas
Explaining pheonmenas through theories which exist to explain those pheonmeas
Multicriterial perspective: theories have two purposes
Predictive
Explanatory
What is analogical abduction?
Zuletzt geändertvor 6 Tagen