Item difficulty
Definition
Interpretation
optimal difficulty
Mean item response score
p low (< .20): difficult test item
p moderate (.20 - .80): moderately difficult
p high (> .80): easy item
=> Variance / SD is maximized at p = 0.5 for dichotomous items
Item discrimination
Linear relationship of item to total scale score (scale score: mean item responses per participant or sum score)
Correlation between item score and total score on test
– Total score: Sum score or mean response on all items per test taker
Item discrimination answers the question: Does the item differentiate among test takers varying in their ability / personality trait?
Item Discrimination
Why show some items high discrimination and others show low discrimination?
Consider the following mathematical ability items
How many quarters are in three bushels? a) 12 b) 24
What is 10 times 10 ? a)10 b)100
Both items require the ability to perform multiplication
The first item, however, also requires knowledge of what a bushel is. This kind of knowledge is irrelevant to math ability and therefore induces error variance in item responses
Consequently, this item would have a low item discrimination as it is only weakly related to math ability
Dichotomous items
Rating scales
Point-biserial correlation (i.e., correlation between a dichotomous and a continuous variable)
Pearson correlation
Positive values closer to 1 are desirable
– What do negative discriminations imply?
Check scoring key —> reverse-coded?
Item-total correlations are directly related to reliability
– Becausethemoreeachitemcorrelateswiththetestasawhole, the higher all items correlate with each other
Part-whole corrected item discriminations
Item discriminations tend to be spuriously inflated (biased) because each item is correlated with the test of which that item is a part —> The correlation is partly because of the correlation of the item with itself
This is why we usually interpret part-whole corrected item discriminations (based on correction formulas)
Point-biserial correlation
Formula
Explanation
At what value of p is r pbis maximized?
Relationship between p-value (difficulty) and item discrimination
—> Not unlikely to see item discrimination values < .30 for very hard or easy items! These low item discriminations canbea mathematical artifact
Guidelines Summary (3)
Consider dropping or revising items with discriminations lower than .30
But be careful: Low item discrimination can be due to the mathematical artifact!
Not unlikely to see item discrimination values < .30 for very hard or easy items!
Not advised to put too much emphasis on maximizing Cronbach’s alpha (i.e., omitting items from the test to maximize alpha)
Binary [0, 1]
The proportion of people who answered the item correctly (p)
Used with dichotomously scored items – Correct Answer – score = 1 – Incorrect Answer – score = 0
Item difficulty a.k.a. p-value (but not to be confused with the p-value of significance tests!)
• Dichotomous items
Mean=p • Example with n = 5: p = (1+1+0+1+0)/5 = .6
Var(X) = p*q, where q = 1-p
Interpretation of R output
item difficulty
Variability of item scores
Sample size
Guidelines
Should we only choose items of p = .50?
Not necessarily ...
1) When wanting to screen the very top group of applicants (i.e., admission to university or medical school).
=> Cutoffs may be much higher (e.g., p <= .20)
2) Other institutions want a minimum level (i.e., minimum reading level)
=> Cutoffs may be much lower (e.g., p >= .80)
Item Difficulty
Guidelines (4)
High p-values, item is easy; low p-values, item is hard
If p-value = 1 (or 0), everyone answering question correctly (or incorrectly) and there will be no variability in item scores
If p-value too low, item is too difficult, needs revision or perhaps test is too long (i.e., not all participants could complete the test)
Good to have a mixture of difficulty in items on test
Rating scale items
The mean of the item responses on the rating scale
Example: Rating scale items with 5-point Likert-scale: “Strongly disagree”(1) to “Strongly agree”(5)
Last changed3 months ago