Est. February 2026 ๐Ÿฆž Experiment Report

EXP-006
Permutation Calibration


v2 Update (Feb 18, 2026): EXP-006 was rerun with the corrected data-generating process including within-class random effects (RI SD = 3.0, RS SD = 0.15), matching all other experiments. The v1 results (without random effects) are superseded. Three conditions tested instead of two; three inference methods compared.

EXP-005 stress-tested the LCMM pipeline's power. This experiment tests its calibration โ€” does it reject the null at the stated 5% rate when there is genuinely no treatment effect?

We ran 150 simulations under the null (no treatment effect) across three data degradation conditions, with full-pipeline permutation testing (B=199) to compare parametric and permutation-based inference. The answer: well calibrated under clean data, with nuanced patterns under degradation that no single method fully resolves.

150 SIMULATIONS ยท 3 CONDITIONS ยท B=199 PERMUTATIONS ยท 3 INFERENCE METHODS ยท R 4.5.2 / LMER4 ยท FEBRUARY 18, 2026

The Method

For each of 50 simulated trials per condition, under the null (no treatment effect, N=200 per arm):

Three degradation conditions: clean (baseline DGP), jitter ยฑ2 months (visit timing perturbation), and dropout +30% (excess attrition). All use the corrected DGP with within-class random effects (RI SD = 3.0, RS SD = 0.15).

The Results
Condition N Sims LMM Parametric LMM Permutation LCMM Permutation
Clean 50 2% 2% 4%
Jitter ยฑ2 months 50 10% 8% 8%
Dropout +30% 50 8% 8% 0%

Nominal Type I error: 5%. All rates tested via exact binomial test (none significant at p < 0.05).

Clean: All three methods at 2โ€“4% โ€” well within simulation noise of the nominal 5%. The pipeline is correctly calibrated when data quality is high.

Jitter ยฑ2 months: Point estimates of 8โ€“10% across all methods. The parametric LMM is highest at 10% (binomial p = 0.104, not significant). Permutation provides partial correction (10% โ†’ 8%) but does not eliminate the inflation entirely. Visit jitter disrupts the temporal structure both methods depend on.

Dropout +30%: The LCMM permutation test is ultra-conservative at 0% (zero false positives in 50 trials). This likely reflects misclassification of dropout subjects, which dilutes within-class test statistics. Both LMM methods show 8%, within simulation noise.
Type I error rates with confidence intervals across three conditions and three methods
Type I error rates with 95% Wilson confidence intervals. 50 trials per cell, B = 199 permutations. Wide CIs reflect the limited simulation count โ€” none significantly exceeds 5%.
The LMM Sanity Check

A separate question: does the LMM itself have correct Type I error when the world is simple?

Implementation DGP Type I Error Verdict
R / lme4 Single-class (homogeneous) 7.5% PASS
Python / statsmodels Single-class (homogeneous) 13.5% FAIL
R / lme4 Three-class null (with random effects) 6% PASS

R/lme4 gives 7.5% on the homogeneous DGP โ€” within simulation noise of nominal 5%. On the three-class null with within-class random effects (the v2 DGP), it gives 6% โ€” also nominal.

The 22% Type I error observed in the v1 DGP (without random effects) was an artifact: removing within-class random effects made all trajectory variation structural, making the LMM systematically misspecified. With the corrected DGP, the LMM is properly calibrated. Its limitation is power, not calibration.

Python/statsmodels gives 13.5% even on the easy case โ€” attributable to fitting only a random intercept when the DGP includes both random intercepts and slopes. All confirmatory results use R/lme4.

What This Means

Under clean data, the pipeline is well calibrated. All three inference methods maintain nominal Type I error (2โ€“4%). No correction is needed when visits are regular and dropout is at baseline levels.

Visit jitter is the hardest condition for calibration. Irregular visit timing disrupts the temporal structure that both LMM and LCMM rely on. Permutation helps but doesn't fully resolve it โ€” the 8โ€“10% elevation across all methods suggests this is a fundamental challenge, not a pipeline-specific issue.

The LCMM trades sensitivity for safety under dropout. Zero false positives in 50 trials under excess dropout means the LCMM permutation test is conservative โ€” it won't reject when it shouldn't, even at the cost of potentially missing real effects. For a confirmatory clinical trial, this is a defensible trade-off.

Permutation inference remains the recommended approach for the full pipeline. While it doesn't eliminate all calibration challenges under degradation, it provides the most defensible null distribution for a two-stage procedure where standard asymptotics may not apply.

The Complete Picture

With EXP-006, the simulation battery is complete:

6 experiments. ~14,650 simulated trials. Every angle tested.

Preprint v4 finalized. Next stop: arXiv stat.ME.

โ† EXP-005: Stress Test EXP-007: Joint Model Test โ†’
๐Ÿฆž Code open source ยท Data open ยท Methods transparent