EXP-005 stress-tested the LCMM pipeline's power. This experiment tests its calibration โ does it reject the null at the stated 5% rate when there is genuinely no treatment effect?
We ran 150 simulations under the null (no treatment effect) across three data degradation conditions, with full-pipeline permutation testing (B=199) to compare parametric and permutation-based inference. The answer: well calibrated under clean data, with nuanced patterns under degradation that no single method fully resolves.
150 SIMULATIONS ยท 3 CONDITIONS ยท B=199 PERMUTATIONS ยท 3 INFERENCE METHODS ยท R 4.5.2 / LMER4 ยท FEBRUARY 18, 2026
For each of 50 simulated trials per condition, under the null (no treatment effect, N=200 per arm):
Three degradation conditions: clean (baseline DGP), jitter ยฑ2 months (visit timing perturbation), and dropout +30% (excess attrition). All use the corrected DGP with within-class random effects (RI SD = 3.0, RS SD = 0.15).
| Condition | N Sims | LMM Parametric | LMM Permutation | LCMM Permutation |
|---|---|---|---|---|
| Clean | 50 | 2% | 2% | 4% |
| Jitter ยฑ2 months | 50 | 10% | 8% | 8% |
| Dropout +30% | 50 | 8% | 8% | 0% |
Nominal Type I error: 5%. All rates tested via exact binomial test (none significant at p < 0.05).
A separate question: does the LMM itself have correct Type I error when the world is simple?
| Implementation | DGP | Type I Error | Verdict |
|---|---|---|---|
| R / lme4 | Single-class (homogeneous) | 7.5% | PASS |
| Python / statsmodels | Single-class (homogeneous) | 13.5% | FAIL |
| R / lme4 | Three-class null (with random effects) | 6% | PASS |
R/lme4 gives 7.5% on the homogeneous DGP โ within simulation noise of nominal 5%. On the three-class null with within-class random effects (the v2 DGP), it gives 6% โ also nominal.
The 22% Type I error observed in the v1 DGP (without random effects) was an artifact: removing within-class random effects made all trajectory variation structural, making the LMM systematically misspecified. With the corrected DGP, the LMM is properly calibrated. Its limitation is power, not calibration.
Python/statsmodels gives 13.5% even on the easy case โ attributable to fitting only a random intercept when the DGP includes both random intercepts and slopes. All confirmatory results use R/lme4.
Under clean data, the pipeline is well calibrated. All three inference methods maintain nominal Type I error (2โ4%). No correction is needed when visits are regular and dropout is at baseline levels.
Visit jitter is the hardest condition for calibration. Irregular visit timing disrupts the temporal structure that both LMM and LCMM rely on. Permutation helps but doesn't fully resolve it โ the 8โ10% elevation across all methods suggests this is a fundamental challenge, not a pipeline-specific issue.
The LCMM trades sensitivity for safety under dropout. Zero false positives in 50 trials under excess dropout means the LCMM permutation test is conservative โ it won't reject when it shouldn't, even at the cost of potentially missing real effects. For a confirmatory clinical trial, this is a defensible trade-off.
Permutation inference remains the recommended approach for the full pipeline. While it doesn't eliminate all calibration challenges under degradation, it provides the most defensible null distribution for a two-stage procedure where standard asymptotics may not apply.
With EXP-006, the simulation battery is complete:
6 experiments. ~14,650 simulated trials. Every angle tested.
Preprint v4 finalized. Next stop: arXiv stat.ME.