You have the right domains and an individual baseline. Now the hard part: knowing that a low score means impaired — not noise, not a bad night, not "still learning the game." This lesson is the measurement science that separates a real detector from a plausible-looking one, worked through your three tasks.
Your setup, and why this lesson is aimed at it
You track reaction time on every task and compare each person to their own ever-changing baseline —
both excellent choices. But an ever-changing baseline hides a specific failure mode, and your goal ("detect
any impairment") makes one metric — sensitivity — matter more than any other. Let's
make your detector trustworthy on purpose.
Recall first
Which domain is closest to a signature of cannabis?
Time-perception distortion is close to unique to cannabis, which is why a time-estimation task helps you type impairment. But typing is your nice-to-have — this lesson is about the must-have: detecting it at all.
The core question
Every impairment decision is really one question: is today's score far enough below normal that it
can't just be noise? "Normal" is the baseline. "Far enough" is a decision rule. Get either wrong and
you either cry wolf (false alarms) or sleep through the fire (misses). Three things make the answer
trustworthy — a good baseline, a principled threshold, and a task with room to show a drop.
① The baseline — and the "ever-changing" trap
You chose an individual baseline over population norms. Right call:
norms falsely flag naturally slow people and falsely clear naturally fast ones; a personal baseline cancels
the between-person differences you named — ability and task familiarity.[1] But
"ever-changing" introduces two ways the baseline can quietly betray you:
Trap A — Practice absorption
Early sessions are the learning curve. Your tasks are strategic and
learnable, so a user keeps improving for many sessions. If those improving sessions feed the
baseline, it keeps dropping — and later, real impairment only brings them back to an old level, so
it reads as "normal."
Trap B — Impairment absorption
If the baseline updates too fast and includes recent impaired sessions, it chases the user downward.
Someone impaired every Friday night slowly teaches the baseline that slow-and-erratic is their
normal — and the detector goes blind to a recurring problem.
Fix A
Let the baseline mature — don't trust it until performance has plateaued past the learning curve
(watch each user's curve flatten), or explicitly model/subtract the practice trend.
Fix B
Update robustly and slowly: use a median or trimmed mean over a rolling window, and exclude
sessions you already flagged as impaired from the baseline. The baseline should represent the
person un-impaired.
② The decision rule — Reliable Change Index
A baseline is a center point; you also need its spread. The
Reliable Change Index (RCI) asks whether today's change exceeds normal
test-retest noise, by measuring the change in units of the person's own variability.[2]
Roughly:
change score = (today − baseline) ÷ (baseline's own standard deviation)
If that number is bigger than a chosen cutoff, the change is unlikely to be noise. Two payoffs for you: (1) it
turns "seems slow today" into a defensible statistical call, and (2) it needs each person's
variability — which means the RT variability you're already able to compute is doing
double duty: it's both a fatigue signal and the denominator of your decision rule. Track and store it
deliberately, not just mean RT.
Don't average away your best signal
Mean reaction time is the obvious metric — but impairment, especially fatigue, often shows up as
increased variability and lapsesbefore the mean moves.[3]
A user can keep a normal average while their responses become erratic. Compute intra-individual variability
and a lapse count (e.g., responses beyond a threshold) on every task, not just the average.
③ Sensitivity vs. specificity — pick your dial
Every threshold trades two errors: sensitivity (catching truly
impaired people) against specificity (not falsely flagging sober
ones). You can't max both — moving the cutoff to catch more impaired people also flags more sober ones.
Your stated goal — detect impairment of any kind — is a decision to prioritize
sensitivity: a missed impaired user is worse than a second-look on a sober one. So set the RCI
cutoff looser, and consider an "any task flags it" (OR) rule across your three tasks rather
than requiring all three. The cost is more false positives — which you manage with a cheap confirmatory retest,
not by tightening until you start missing real impairment.
④ Ceiling & practice — the task-design guardrails
Two effects can silently zero out sensitivity, and both bite game-like tasks hardest:
Ceiling effect. If sober users score ~100%, there's no room for impairment to show. Watch
Task 3 (memorize 3 shapes): three items is below normal memory span, so accuracy may sit at
the ceiling and detect nothing. Your escape hatch is exactly your strength — reaction time still has
headroom even when accuracy is perfect, so score the speed and variability of the recall,
not just whether it was right. (Or raise load to 4–5 items to reintroduce accuracy signal.)
Practice effect. Learnable tasks keep improving, masking impairment and (per Trap A above)
corrupting the baseline. The PVT is the gold standard largely because it barely improves with
practice;[3] your richer tasks won't share that gift, so you must handle the
learning curve explicitly rather than assume it away.
On your time budget
You want the whole thing under ~2 minutes on a phone. Good news: brief tasks can work — a 3-minute
PVT-B retains much of the full PVT's sensitivity to sleep loss.[4]
The honest caveat: shortening is a real trade-off — at least one study found a 3-minute version diverging from
the 10-minute reference under some conditions.[5] So if you add a vigilance task,
validate your short version against a longer one rather than assuming the sensitivity carries over.
Check yourself
An ever-changing baseline that updates too fast mainly risks...
If recent impaired sessions feed the baseline, it drifts toward the impaired state and stops flagging it. Fix: slow/robust updates and exclude flagged sessions.
For a "detect any impairment" goal, you should tune the threshold toward...
Catching every impaired person is the priority, so accept more false alarms and clean them up with a quick confirmatory retest.
A memory task's accuracy is always perfect. You can still detect impairment by...
When accuracy hits the ceiling, RT and its variability still have room to move — which is why tracking RT everywhere (as you do) rescues a ceiling-prone task.
Your single win
You can now state what makes your detector trustworthy: a baseline that has matured past the learning
curve and excludes impaired sessions; a decision rule (RCI) that judges today's drop
in units of the person's own variability; a threshold tuned toward sensitivity; and tasks
kept off the ceiling by scoring RT, not just accuracy. Every one of those hangs on the RT-variability
you're already positioned to capture — so the biggest immediate lever is to store and use variability
and lapse counts, not just mean RT.
I'm your teacher — ask me anything. Want to work out a concrete RCI cutoff, design
the rule for "how many sessions before a baseline is trusted," or spec a <40-second vigilance task that fits
your 2-minute budget? Bring it to the chat.
References
[1] Evidence for Added Value of Baseline Testing in Computer-Based Cognitive Assessment, PMC (2013).
[2] Reliable Change on Neuropsychological Tests in the Uniform Data Set, PMC (2016).
[3] Basner & Dinges, Maximizing Sensitivity of the PVT to Sleep Loss, SLEEP (2011); Van Dongen et al. (2003).
[4] Basner, Mollicone & Dinges, Validity and Sensitivity of a Brief PVT (PVT-B) (2011).
[5] The 3-Minute PVT Demonstrates Inadequate Convergent Validity…, Frontiers in Neuroscience (2022).