Real clinical data is messy โ irregular visit schedules, rater variability, extreme dropout, missing observations. Can the LCMM-Soft pipeline survive it all? 1,100 simulations across 11 stress conditions reveal a 7.5-fold power advantage โ with genuine vulnerabilities at the extremes.
Simulations are only as convincing as their assumptions. Our previous experiments (EXP-001 through EXP-004) used clean, well-behaved data โ perfectly timed visits, consistent measurement, predictable dropout. Real ALS trials look nothing like that. Patients miss visits. Raters disagree. People drop out in droves.
This experiment throws everything at the pipeline: visit timing jitter of up to ยฑ2 months, rater noise that doubles total measurement error, dropout rates that eliminate half the sample, 40% randomly missing visits, and the worst-case combination of all of these at once.
The result: LCMM-Soft achieves 76โ100% power across most conditions (90% on clean data), with Type I error controlled at 0โ6% everywhere. The standard LMM maintains nominal Type I error but achieves only 8โ22% power โ a 7.5-fold deficit. Two conditions genuinely hurt the pipeline: severe rater noise (48% power) and combined severe degradation (22%). The LMM isn't broken โ it's blind.
Data-Generating Process. Three-class ALS trajectory model with within-class random effects (RI SD = 3.0, RS SD = 0.15). N=200 per arm, 5 visits over 12 months. Class proportions: 40% slow, 35% fast, 25% crash. Two scenarios: null (no treatment) and class-specific (50% slowing in slow progressors only).
Implementation: Entirely in R 4.5.2 using lme4 for LMM fitting and vectorized EM for LCMM estimation. M=5 pseudo-class draws (computational shortcut; the full pipeline specifies M=20).
Under clean conditions, LCMM-Soft detects the class-specific treatment effect 90% of the time. LMM detects it 12% of the time. The standard method isn't broken โ it simply can't see treatment effects that are concentrated in a subgroup.
Type I error controlled at 0โ6% across all conditions. Power 76โ100% across most conditions, with two genuine vulnerabilities.
| Stress Condition | Power | Type I Error |
|---|---|---|
| Clean | 90% | 2% |
| Jitter ยฑ1mo | 100% | 4% |
| Jitter ยฑ2mo | 92% | 4% |
| Rater noise SD=2 | 88% | 2% |
| Rater noise SD=5 | 48% | 0% |
| Dropout +30% | 86% | 0% |
| Dropout +50% | 84% | 6% |
| Missing 20% | 84% | 2% |
| Missing 40% | 86% | 2% |
| Combined mild | 76% | 4% |
| Combined severe | 22% | 2% |
Type I error nominal at 2โ14% (one exception: dropout +30% at 14%). Power severely limited at 8โ22%.
| Stress Condition | Power | Type I Error |
|---|---|---|
| Clean | 12% | 6% |
| Jitter ยฑ1mo | 14% | 2% |
| Jitter ยฑ2mo | 16% | 6% |
| Rater noise SD=2 | 22% | 2% |
| Rater noise SD=5 | 12% | 4% |
| Dropout +30% | 22% | 14% |
| Dropout +50% | 16% | 4% |
| Missing 20% | 18% | 6% |
| Missing 40% | 14% | 4% |
| Combined mild | 12% | 4% |
| Combined severe | 8% | 4% |
Finding 1: The LMM is blind, not broken. With the corrected DGP (including within-class random effects), the LMM maintains nominal Type I error (2โ6%) across most conditions. Its problem isn't that it gives wrong answers about nothing โ it's that it can't see treatment effects concentrated in a subgroup. A drug that helps 40% of patients looks like noise to the LMM.
Finding 2: LCMM-Soft delivers 7.5ร more power on clean data. 90% vs 12% power for class-specific effects. This advantage is maintained across moderate degradation: 100% under mild jitter, 92% under severe jitter, 88% with moderate rater noise, 84โ86% under dropout and missing data, 76% under combined mild degradation.
Finding 3: Two genuine vulnerabilities exist. Severe rater noise (SD=5, nearly doubling total measurement error) drops LCMM power to 48%. Combined severe degradation (jitter + noise + dropout + missingness simultaneously) drops it to 22%. In both cases, the measurement noise obscures the class structure that LCMM relies on. Even here, LCMM still matches or exceeds LMM (48% vs 12% for rater SD=5; 22% vs 8% for combined severe).
Finding 4: Type I error is universally controlled. LCMM-Soft maintains 0โ6% Type I error across all eleven conditions โ no exception exceeds nominal by more than simulation noise. The one LMM condition that raises concern is dropout +30% (14%), likely driven by informative censoring interacting with the averaging across heterogeneous classes.
The LCMM-Soft pipeline doesn't just work under ideal conditions โ it works under the full spectrum of degradation that multi-site ALS trials produce. Its power advantage narrows under extreme conditions but never disappears. The standard LMM isn't wrong โ it's answering a different question. When treatment effects are heterogeneous, the LMM's question gives the wrong answer 88% of the time.
This experiment addresses the "clean data" objection. The LCMM-Soft pipeline isn't a fragile theoretical construct. It survives visit jitter, rater noise, catastrophic dropout, and massive missingness. Its advantages are real and robust.
The honest finding is also the most interesting one: the LMM isn't anti-conservative under the corrected DGP โ it's properly calibrated but underpowered. This means ALS trials using standard analysis aren't producing false positives at alarming rates. They're producing false negatives. Treatments that work for subgroups are being declared failures because the signal is diluted across a heterogeneous population.
That's the cost of linearity. Not wrong answers โ missed answers.