What is an Mediation Analysis?
A mediation analysis examines whether the effect of an independent variable on an outcome variable is transmitted through an intermediate variable called a mediator.
Instead of asking:
Does X affect Y?
it asks:
Through which mechanism does X affect Y?
Desribe a classical mediation model
A mediation model contains four important paths
1) Total Effect (c) which represents the overall effect before considering any mediator. In my study this indicates a higher reuse intention from Pia overall.
2) Then the path a, measuring how strongly does the conition affect the mediator “Pia increased trust compared to Gina”
3) Then the path b, indicating how strongly does the mediator predict reuse “Higher trust differences predicted higher reuse differences”
or “Higher empathy differences predicted higher reuse differences”
And indirect effect is a x b - estimating how much of the condition effect travels through the mediator.
Why was a special mediation model needed?
A standard mediation would not have been possible because i didnt have two independent groups due to the within subjects design. Every particpant experienced Gina and Pia. Standard mediation models need independent groups.
What is special about the montaya and hayes model?
Instead of using raw scores, difference scores are calculated für each participant. Now the repeated measure design becomes a regression problem.
Here they use a parallel multiple mediator model. This is important because empathy and trust are correlated. If u would test them separately one mediator might receive credit for variance actually explained by the other.
Running parallel allows a estimation of their unique contributions.
Why use bootstrapping?
The indirect effect is the product of path a and b (a xb). Such product terms are usually not normally distributed- which is why traditional approaches become problematic.
Bootstrapping repeatedly resamples the original data (with replacement) and estimates the indirect effect thousands of times (e.g., 5000 samples) to create empirical confidence intervals. If the confidence interval does not include zero, the indirect effect is considered significant.
Why should i be careful with my results of the thesis in terms of the mediation analysis?
Your thesis repeatedly emphasizes:
exploratory analysis
wide confidence interval
lower bound close to zero
small sample size
limited statistical power
Therefore the correct interpretation is:
The data are consistent with empathy functioning as a possible mediating pathway, but they do not provide strong confirmatory evidence.
The confirmatory analyses demonstrated that Pia increased empathy, trust, and reuse intention. However, they did not explain why reuse intention increased. Therefore, I conducted an exploratory within-subjects mediation analysis to investigate whether changes in perceived empathy and perceived trust could explain the relationship between linguistic style and reuse intention. Following Montoya and Hayes, mediation was modeled using difference scores because the study employed a repeated-measures design. The results suggested a possible indirect pathway through perceived empathy, whereas the indirect effect through trust was not statistically reliable. However, because the analysis was exploratory, based on a small sample, and produced wide confidence intervals, these findings should be interpreted as preliminary evidence rather than confirmatory proof.
Because my thesis is not only testing if personalization directly increse reuse. I wanted to highlight in paricular the interneal organism effects that shape user response, looking beyond two dimensional stimulus response framworks.
This is also why i included mediation analyses to form a starting point for follow up studies.
What does CASA say
Basically stated that people apply social rules to computers even when they know they are interaction with a machine. Such as politeness: so maybe you experienced it by your on when you say “thank you” to an conversational agent.
This is the basis for the study, as personalization wouldnt work if people would treat agents as pure tool.
Why then also MET
It extends the CASA paradigm by stating that people interat with meadia and technology using the same social rules they use with humans.
What is social response theory?
It stated mindless social responding. Peuople often respond automatically. This is nice in my context because it can be assumed that in low demand context such as charging, processing may unfold more easily.
Why should using a user's name matter?
According to Self-Reference Theory, information linked to the self receives deeper cognitive processing. A person's own name is one of the strongest self-referential cues, making communication feel more personally relevant and engaging.
Explain Gross’s Process Model
Can be regulated at different stages:
Coginitive Reappraisal (chabging how the situation in interpreted)
Response Modulation (changing repsonses that have already emerged - Guided breathing)
Why can you claim that linguistic style caused the differences?
Functionality was intentionally held constant across conditions. The intervention, voice, timing, and informational content remained the same. Therefore, observed differences are theoretically attributed to the linguistic style profile, which served as the manipulated stimulus within the S-O-R framework.
I chose a within-subjects design because the study focused on subtle differences in linguistic style. By exposing each participant to both agents, every participant served as their own control. This reduced inter-individual variability, increased statistical power, and allowed a more sensitive comparison between the generic-empathic and personalized-empathic agent. Additionally, fewer participants were required compared to a between-subjects design.
The paired t-test assumes that the difference scores between conditions are normally distributed. I tested this assumption using the Shapiro-Wilk test. Since normality was violated for several dependent variables, I selected the Wilcoxon Signed-Rank Test, which is the appropriate non-parametric alternative for paired observations.
What exactly does the Wilcoxon test evaluate?
It evaluates whether the median difference between paired observations differs significantly from zero. In this study, it assessed whether participants consistently rated one agent higher than the other.
In repeated-measures designs, the normality assumption applies to the distribution of the paired differences, not the individual condition scores. Therefore, I tested the difference scores between Pia and Gina. This is the correct procedure when deciding whether a paired t-test is appropriate.
The manipulation check was essential to establish construct validity. Before interpreting differences in trust, empathy, or reuse intention, I first needed evidence that participants actually perceived Pia as more personalized and empathic than Gina. The manipulation check confirmed that participants consistently recognized the intended distinction between the two agents.
What would it mean if the manipulation check had failed?
It would weaken the interpretation of all subsequent findings because we could no longer be confident that participants perceived the experimental manipulation as intended.
Therefore i did also the Pilot testing to get some insights how the manipulation works.
The manipulation check produced binary outcomes. Participants selected which agent appeared more personalized or more empathic. Since the responses were categorical and involved two possible outcomes, a Binomial Test was the appropriate method for evaluating whether the observed distribution differed significantly from chance.
Statistical significance indicates whether an observed effect is unlikely to have occurred by chance under the null hypothesis. Effect size, on the other hand, describes the magnitude of that effect. While significance tells us whether an effect exists, effect size tells us how important or meaningful that effect is in practice.
Why are effect sizes important?
Because statistical significance is influenced by sample size. Large samples can make very small effects significant, whereas effect sizes provide information about practical relevance.
How do you interpret effect size benchmarks?
Common guidelines suggest that r = .10 is small, r = .30 is medium, and r = .50 or above is large.
The study was specifically designed to isolate linguistic style as the manipulated variable. Functionality, voice, intervention structure, timing, and informational content were held constant across conditions. The primary systematic difference between the two agents was their linguistic style profile. Therefore, the observed differences in empathy, trust, and reuse intention can reasonably be attributed to the linguistic manipulation.
Can you identify which specific linguistic cue caused the effect?
No. The study manipulated a bundle of cues, including personalization, self-reference, agreeableness, relational framing, and empathic language. Therefore, I can conclude that the overall linguistic profile influenced the outcomes, but not which individual cue was primarily responsible. Isolating specific cues would require a factorial design and represents an important direction for future research.
Can you explain the architecture of your application?
The application follows a modular client-side architecture based on HTML, CSS, and JavaScript. HTML defines the structure and experiment flow, CSS handles the visualization and participant interface, and JavaScript contains the experiment logic, state management, timer synchronization, audio control, and CSV data logging. Communication between experimenter and participant windows is implemented using BroadcastChannel to enable synchronized interaction.
Why use BroadcastChannel?
BroadcastChannel allowed synchronized communication between experimenter and participant windows without requiring a backend server.
Why not WebSockets?
Because the experiment ran locally and only required communication between browser windows on the same device.Keep latency low and controlable,
What design pattern does your architecture resemble?
The architecture loosely follows a state-driven event architecture with separation between presentation, application state, and event handling.
Why Wizard-of-Oz instead of a fully autonomous agent?
I used a Wizard-of-Oz setup because the goal was not to test autonomous speech recognition or natural language generation, but to test the effect of controlled linguistic agent profiles. By using predefined MP3 snippets, I could keep functionality, timing, voice, and content comparable across conditions and isolate the influence of linguistic style.
How did you prevent data loss?
Affect Grid responses were stored immediately in the application state and then appended to a master CSV file. I also used a CSV write queue to serialize write operations and avoid race conditions, so that multiple writes could not interfere with each other.
How did you synchronize experimenter and participant views?
The experimenter and participant windows were synchronized using the BroadcastChannel API. The experimenter window sent events such as timer progress, Affect Grid prompts, reset signals, and voice-status updates. The participant window received these events and updated its interface accordingly. This allowed both views to run in separate browser windows without requiring a backend server.
How did you ensure reproducibility?
Reproducibility was supported by using standardized MP3 files, a fixed experimental flow, controlled timing, and the same interface structure for all participants. The system loaded the audio snippets from a selected folder and generated the flow automatically, which reduced manual variation during the experiment.
Why did you use modular functions?
I structured the code into modular functions for tasks such as timer control, audio playback, Affect Grid handling, participant synchronization, and CSV logging. This made the prototype easier to debug, maintain, and adapt during the study.
How do you know the prototype itself did not influence the results?
I standardized the interface and interaction flow across conditions and only manipulated the predefined linguistic cue bundle. Therefore, differences should primarily reflect the intended manipulation rather than technical variation.
What happens if communication between experimenter and participant fails?
The system was designed to degrade gracefully. The experimenter controlled the flow manually and participant responses were additionally stored locally before export.
Why not build a fully automated pipeline?
Because the goal was experimental control rather than autonomous interaction. Wizard-of-Oz reduced variability and allowed isolation of the linguistic manipulation
Was was the biggest technical challenge?
Synchronizing experimenter control, participant interaction, audio playback, and affect measurements while maintaining a smooth user experience and reliable data collection.”
How did you validate that the software worked correctly?
I conducted pilot testing, verified synchronization manually, tested data export, and standardized the experimental procedure before running the main study.
Why include a Standarization task ?
Mainly to reduce floor effect and ensure sufficient variance in arousal prior the breathing intervention. When a participant is already very low aroused before the intervention, it naturally cannot regulate.
Task was introduced prior the session and then just started by visual cues. Then just the word sequence was played to minimize additional linguistic cues that could blur the manipulation and agent persona.
How to manage Typ I errors - false positives ? How did you control for that risk ?
I was mindful of the risk of Type I errors. First, I set my significance threshold at a standard level—usually p < 0.05. Where I conducted multiple comparisons, I considered corrections like Bonferroni to adjust for multiple testing. Most importantly, I interpreted my results holistically—considering effect sizes, confidence intervals, and theoretical plausibility. In other words, I didn’t rely on p-values alone. If anything appeared marginal, I highlighted it as exploratory rather than definitive.
I used a convenience sample because participants were recruited based on accessibility within the university environment, which allowed efficient data collection under practical constraints. However, this limits external validity because the sample may not represent the broader population of EV users. Since the study focused on comparing experimental conditions under controlled settings, internal validity was prioritized over representativeness. Future research should include more diverse and heterogeneous populations.
Would your results generalize?
The findings should be interpreted cautiously and mainly as evidence within this target population.
Your participants were technology-affine students. Could that have biased trust ratings?
Would you expect stronger or weaker effects?
Yes, this is a potential source of bias. Participants with prior experience with conversational agents may evaluate trust and empathy differently than less experienced users. I addressed this by collecting prior experience data in the pre-survey and reflecting this limitation in the discussion. Future studies could explicitly model agent experience as a conditional factor or include it as a random effect in a linear mixed model.
It is difficult to predict. Experienced users may show reduced novelty effects but potentially more differentiated evaluations.”
Why was a representative sample not necessary for your study?
Because the primary goal of my study was not population estimation but theory testing under controlled conditions. The objective was to investigate mechanisms of perceived trust, empathy, and emotional regulation rather than estimate prevalence in the general population. Therefore, experimental control and internal validity were prioritized.
Which design principle do you consider the most important?
I would argue that the most important design principle is coherence across the linguistic profile, which I describe as a coherent cue bundle. My findings suggest that relational depth does not emerge from isolated cues, but from the alignment between foreground and background cues. For example, a personalized opening using the participant’s name only becomes effective when it is supported by consistent pronouns, interaction metaphors, warmth, and relational framing throughout the interaction. Otherwise, personalization may appear inconsistent or even uncanny.
Why not personalization as main implication as whole?
Personalization is a broad term. Howeverm, my findings suggest that personalization itself is not sufficient. Personalization becomes effective only when embedded into a coherent social script.
Wofür Cronbachs Alpha berechnen?
Cronbach’s alpha was calculated to assess internal consistency reliability, meaning whether the items of a scale consistently measured the same construct. It does not prove validity but supports that the scale operates reliably within the current sample.
What is grand mean centering im parallel mediator model ?
In the parallel multiple mediator model, grand mean centering means that mediator values are shifted relative to the overall sample mean. This does not change the mediation effect itself, but makes coefficients easier to interpret and improves estimation stability when multiple mediators are included simultaneously.
When did u use Welch-t-test or the Mann-Whitney U test? And why?
For order-effect analyses, I compared the Pia–Gina difference scores between the independent order groups AB and BA. I used Welch’s t-test when normality held and Mann–Whitney U when it did not.
Why not always use Mann–Whitney?
Because Mann–Whitney assumes independent samples, whereas my study compared repeated measurements from the same participants.
Why did you conduct Harman’s Single Factor Test and what did it show?
I conducted Harman’s Single Factor Test to assess potential common method bias resulting from collecting multiple self-report measures within the same study. The idea is that if one dominant latent factor explains most of the variance, the observed relationships could partly reflect the measurement method rather than actual constructs. In my analysis, the first factor did not explain the majority of variance, suggesting that common method bias was unlikely to substantially distort the results.
What exactly is common method bias?
Common method bias refers to artificial correlations that arise because variables are measured using the same method, for example self-report questionnaires collected in the same session. This can inflate observed relationships between constructs.
How do you interpret the Z value in your analysis?
The Z value is the standardized test statistic produced by nonparametric tests such as the Wilcoxon signed-rank test or Mann–Whitney U test. It indicates how far the observed result deviates from the expected value under the null hypothesis, measured in standard error units. Larger absolute Z values indicate stronger evidence against the null hypothesis. However, the Z value itself does not indicate practical importance and should be interpreted together with the p value and ideally with an effect size
Why did you report confidence intervals in addition to p-values?
Confidence intervals provide information beyond statistical significance by showing the range of plausible population effects. In my analysis, the confidence interval included zero, indicating that both no effect and small positive or negative effects remained compatible with the observed data. Therefore, the result was interpreted as inconclusive rather than evidence for absence of an effect.
What would have been a different method for qualitative Analysis?
Reflexive Thematic Analysis
A more interpretive alternative would have been reflexive thematic analysis, placing greater emphasis on researcher interpretation and meaning construction.
Or Grounded Theory, which would have been focuesed more on generating new theories directly from data rather than interpreting findings within predefined theoretical assumptions.
What Quality and Rigor approaches did you do in terms of qualitaitve Research?
Iterative coding, coding framework constantly critical reviewed and checked.
Independent Coder, coded the subset
What Quality measure could be done besides Inter coder check?
Member checking with the particpants togehter, so particpants include into the analysis process.
How did you ensure that the quantititative study did not bias the qualitative study, vise versa?
I ensured there was no bias by keeping both analyses separate. I completed the quantitative analysis first—testing hypotheses and documenting results—before starting the qualitative analysis. The qualitative findings were used afterward to help interpret the patterns we had already established quantitatively. This way, qualitative insights provided explanations without influencing the quantitative outcomes.
How ecologically valid is a Wizard-of-Oz experiment?
It offers strong experimental control but lower ecological validity than a field study. The approach was appropriate for investigating causal effects of linguistic design before investing in a fully implemented system.
Could personalization become manipulative?
Yes. Personalization can increase trust and emotional engagement, which creates ethical responsibilities. Therefore, transparency, informed consent, and user control should be central design requirements.
Why focus on EV charging breaks and not another waiting situation?
Charging breaks are emotionally relevant, involve uncertainty and waiting, and occur in a vehicle context where conversational agents are already established. This makes them a particularly suitable application domain.
Why did you use a breathing exercise instead of a more sophisticated intervention?
I wanted a highly standardized micro-intervention that could be kept identical across conditions. This ensured that differences could be attributed to linguistic style rather than differences in intervention content
What does a p-value actually mean?
A p-value represents the probability of observing data at least as extreme as the collected data if the null hypothesis were true.
Why can a non-significant result still be meaningful?
A non-significant result does not prove that no effect exists. It may indicate a small effect, insufficient power, high variability, or genuinely no difference. Interpretation must consider context and confidence intervals.
Make the example out of my non existing effects
Which result surprised you the most?
The most surprising result was that the personalized-empathic profile substantially increased perceived empathy, trust, and reuse intention, while not affecting the immediate effectiveness of the identical breathing intervention. This ultimately led to the concept of temporal dissociation discussed in the thesis.
What was the dependent variable in your LMM?
change in arousal extremity
What were the fixed effects?
The fixed effects were the agent condition, baseline arousal extremity, and their interaction.
What was the random effect?
Participant was included as a random effect to account for repeated measurements and individual differences.
What does the interaction term mean in your LMM?
The interaction tested whether the effect of the agent condition depended on participants’ initial arousal extremity.
What did your interaction result show of your LMM/H1b?
The interaction was not significant, meaning that the difference between Pia and Gina did not depend on participants’ initial arousal extremity.
b=0.32, p=.230
B is the estimated regression coefficient
What is the main takeaway of your LMM?
The breathing exercise reduced arousal extremity mainly as a function of participants’ initial state, but this reduction was not enhanced by the personalized linguistic style.
Last changed2 days ago