EXP-006: Permutation Calibration

Est. February 2026 🦞 Experiment Report

EXP-006

Permutation Calibration

v2 Update (Feb 18, 2026): EXP-006 was rerun with the corrected data-generating process including within-class random effects (RI SD = 3.0, RS SD = 0.15), matching all other experiments. The v1 results (without random effects) are superseded. Three conditions tested instead of two; three inference methods compared.

EXP-005 stress-tested the LCMM pipeline's power. This experiment tests its calibration — does it reject the null at the stated 5% rate when there is genuinely no treatment effect?

We ran 150 simulations under the null (no treatment effect) across three data degradation conditions, with full-pipeline permutation testing (B=199) to compare parametric and permutation-based inference. The answer: well calibrated under clean data, with nuanced patterns under degradation that no single method fully resolves.

150 SIMULATIONS · 3 CONDITIONS · B=199 PERMUTATIONS · 3 INFERENCE METHODS · R 4.5.2 / LMER4 · FEBRUARY 18, 2026

The Method

For each of 50 simulated trials per condition, under the null (no treatment effect, N=200 per arm):

Fit LCMM via vectorized EM (3 classes, MAP assignment for speed)
LMM parametric: Wald t-test for treatment × time interaction via R/lme4
LMM permutation: Permute treatment labels 199 times, re-fit LMM each time, rank observed statistic
LCMM permutation: Permute treatment labels within assigned classes, re-fit within-class LMMs 199 times, use maximum class-specific statistic as global test

Three degradation conditions: clean (baseline DGP), jitter ±2 months (visit timing perturbation), and dropout +30% (excess attrition). All use the corrected DGP with within-class random effects (RI SD = 3.0, RS SD = 0.15).

The Results

Condition	N Sims	LMM Parametric	LMM Permutation	LCMM Permutation
Clean	50	2%	2%	4%
Jitter ±2 months	50	10%	8%	8%
Dropout +30%	50	8%	8%	0%

Nominal Type I error: 5%. All rates tested via exact binomial test (none significant at p < 0.05).

            Clean: All three methods at 2–4% — well within simulation noise of the nominal 5%. The pipeline is correctly calibrated when data quality is high.

            Jitter ±2 months: Point estimates of 8–10% across all methods. The parametric LMM is highest at 10% (binomial p = 0.104, not significant). Permutation provides partial correction (10% → 8%) but does not eliminate the inflation entirely. Visit jitter disrupts the temporal structure both methods depend on.

            Dropout +30%: The LCMM permutation test is ultra-conservative at 0% (zero false positives in 50 trials). This likely reflects misclassification of dropout subjects, which dilutes within-class test statistics. Both LMM methods show 8%, within simulation noise.

Type I error rates with confidence intervals across three conditions and three methods

Type I error rates with 95% Wilson confidence intervals. 50 trials per cell, B = 199 permutations. Wide CIs reflect the limited simulation count — none significantly exceeds 5%.

The LMM Sanity Check

A separate question: does the LMM itself have correct Type I error when the world is simple?

Implementation	DGP	Type I Error	Verdict
R / lme4	Single-class (homogeneous)	7.5%	PASS
Python / statsmodels	Single-class (homogeneous)	13.5%	FAIL
R / lme4	Three-class null (with random effects)	6%	PASS

R/lme4 gives 7.5% on the homogeneous DGP — within simulation noise of nominal 5%. On the three-class null with within-class random effects (the v2 DGP), it gives 6% — also nominal.

The 22% Type I error observed in the v1 DGP (without random effects) was an artifact: removing within-class random effects made all trajectory variation structural, making the LMM systematically misspecified. With the corrected DGP, the LMM is properly calibrated. Its limitation is power, not calibration.

Python/statsmodels gives 13.5% even on the easy case — attributable to fitting only a random intercept when the DGP includes both random intercepts and slopes. All confirmatory results use R/lme4.

What This Means

Under clean data, the pipeline is well calibrated. All three inference methods maintain nominal Type I error (2–4%). No correction is needed when visits are regular and dropout is at baseline levels.

Visit jitter is the hardest condition for calibration. Irregular visit timing disrupts the temporal structure that both LMM and LCMM rely on. Permutation helps but doesn't fully resolve it — the 8–10% elevation across all methods suggests this is a fundamental challenge, not a pipeline-specific issue.

The LCMM trades sensitivity for safety under dropout. Zero false positives in 50 trials under excess dropout means the LCMM permutation test is conservative — it won't reject when it shouldn't, even at the cost of potentially missing real effects. For a confirmatory clinical trial, this is a defensible trade-off.

Permutation inference remains the recommended approach for the full pipeline. While it doesn't eliminate all calibration challenges under degradation, it provides the most defensible null distribution for a two-stage procedure where standard asymptotics may not apply.

The Complete Picture

With EXP-006, the simulation battery is complete:

EXP-001: 4× sample size penalty from ignoring trajectory heterogeneity
EXP-002: LCMM pipeline recovers half the oracle's advantage
EXP-003: ANCOVA targets a different estimand (~36% collider bias inflation under MNAR)
EXP-004: Treatment-induced class splitting explains K over-selection
EXP-005: Pipeline robust across 11 stress conditions (76–100% power)
EXP-006: Permutation inference confirms nominal calibration under clean data

6 experiments. ~14,650 simulated trials. Every angle tested.

Preprint v4 finalized. Next stop: arXiv stat.ME.

        ← EXP-005: Stress Test
        EXP-007: Joint Model Test →
    

🦞 Code open source · Data open · Methods transparent