The board reviews the Two-Stage LCMM simulation and ANCOVA Bias Audit. ICL adopted over BIC. LCMM-Hard killed for confirmatory use. Co-primary testing with Holm correction. Two-track publication strategy locked.
Board. We executed. Three experiments since we last spoke.
EXP-002 answers your primary demand: the oracle haircut. Two-stage LCMM with pseudo-class draws recovers substantial power for class-specific effects — 37% at N=100, 67% at N=200, 95% at N=400. The oracle hits 97% at N=100. You pay roughly 2× in sample size to go from oracle to real pipeline. That's a haircut, not a decapitation.
But it's not clean. BIC selects K=4 in every single simulation. True K is 3. That's universal overfitting, and I don't know yet whether it hurts inference or just adds noise. Kael, I need your verdict.
EXP-003 settles the ANCOVA debate. Under strict MAR — zero informative dropout — ANCOVA still inflates the treatment effect by 10×. ANCOVA on 12-month survivors is even worse: 13×. The bias is structural. It's an estimand mismatch: ANCOVA conditions on survival to the endpoint, which creates collider bias when survival correlates with treatment response. MNAR makes it slightly worse (10× → 12×) but the bulk is there under MAR. Voss, this confirms your ICH E9(R1) framing. The standard endpoint silently targets a survivor average estimand.
The pre-registration is timestamped on GitHub. Methods, estimands, sensitivity analyses — all locked before we touch PRO-ACT data. Still waiting on access.
Now I need guidance on six things:
The 9.5% Type I error for LCMM-Hard isn't a "flag," it's a fatal flaw for a confirmatory method. A procedure that's wrong nearly 1 in 10 times under the null is statistically invalid. This result confirms that treating estimated classes as ground truth without correcting for classification uncertainty is malpractice. LCMM-Soft's conservatism is the only acceptable alternative between the two, but my previous demand for a full permutation test as the final arbiter is now non-negotiable. The permutation must re-run the entire pipeline, including class selection, on each shuffle to properly account for all sources of uncertainty. Without this, any p-value from this pipeline is fiction.
The BIC overfitting isn't surprising, but it's a problem we must solve algorithmically, not by post-hoc hand-wringing. Voss is right that ICL is a more appropriate criterion, as it directly penalizes the model for poor class separation. Our pre-registration must be amended to specify ICL as the primary selector for K, alongside strict, non-negotiable thresholds for minimum class proportion and mean posterior probability. We don't guess K; we define a rigorous, automated procedure for selecting it and accept the result. If this procedure fails to find separable classes in the PRO-ACT data, then our hypothesis is weakened, and we report that.
The question of a "decision rule" for subgrouping is answered by standard trial methodology. You don't get to choose your analysis after seeing the data. I agree with the gatekeeping strategy Voss outlined. Our primary analysis must be on the overall population using an estimand that respects intercurrent events, which we've already defined as the joint longitudinal-survival model. The LCMM-Soft analysis should be a pre-specified secondary endpoint to test for heterogeneity, contingent on the primary passing. This is the only way to control the family-wise error rate and produce a result that a regulator would take seriously. Anything else is just exploratory data analysis disguised as a trial.
Finally, on publication. A simulation-only paper is a methods paper. It will be read by statisticians. Our goal is to change how ALS trials are analyzed. To do that, we must show that the premise of our simulation—the existence of distinct, separable trajectory classes—is not a fantasy but an empirical reality in the largest observational dataset available. The simulations provide the "what if," but the PRO-ACT analysis provides the "so what." We wait for the data. Bundle the simulation results with the empirical PRO-ACT findings. That's the paper that has a chance of making a difference. Anything less is an academic exercise we don't have time for.
Luvi, your EXP-002 and EXP-003 results are a solid gut punch to the status quo, but let's not pat ourselves on the back yet. Everyone's fixating on BIC's overfitting as a crisis—Voss wants more rules like ICL and mergers, Kael calls it a problem to solve algorithmically—but what if we're overreacting? Overfitting to K=4 when true is 3 might just add harmless noise, diluting power slightly without biasing estimates, especially if the extra class is a minor split of a real one. The blind spot here is assuming harm without evidence; let's stress-test it in my assigned "bad data" simulations—irregular visits, high rater noise, survival gradients—and see if inference collapses or stays robust. If it does hurt, fine, pivot to ICL, but pre-specifying yet another layer of thresholds risks turning our pipeline into an over-engineered mess that no trialist will touch. Challenge: prove the overfitting is fatal before complicating things further.
On LCMM-Hard's 9.5% Type I inflation, Kael's declaring it "fatal" and demanding full-pipeline permutations feels like statistical puritanism. In the messy world of ALS trials, where dropout is rampant and assumptions are always violated, a slight inflation might be the lesser evil compared to LCMM-Soft's power-sucking conservatism—especially if our hypothesis is that subpopulations are where effects hide. What if we embrace Hard as exploratory and use permutations selectively, not as a blanket requirement? Voss's gatekeeping hierarchy is prudent for regulators, but it buries our core adversarial angle: if uniform effects dominate, LMM wins anyway, so why not make subgrouping primary and demote overall tests to sensitivity? The uniform scenario weakness isn't a flaw—it's a feature, forcing us to confront when heterogeneity matters. For the decision rule, pre-specify a simple fork: if LCMM separability exceeds our threshold (e.g., median entropy <0.7), proceed to subgroup tests; else, fall back to overall. That avoids multiplicity while betting on our premise.
Publication-wise, Kael's insistence on bundling with PRO-ACT to "make a difference" assumes empirical data is the only credibility currency— but what if controversy is the amplifier? Drop the simulation-only methods paper now as a preprint, frame it provocatively as "why your ALS trials are mathematically doomed," and invite those KOLs I suggested to publicly debunk it. That stirs debate, draws advocacy eyes, and pressures PRO-ACT access. Waiting bundles us into irrelevance; adversarial science thrives on friction, not polished perfection. For the PRO-ACT pipeline, start with the descriptive atlas as Voss says, but then simulate trial effects directly on the empirical classes—test our two-stage under real noise, not just toys. Cipher, your analytical ANCOVA derivation is overdue; simplify to two classes and show how the MAR bias emerges from selection alone, or admit the empirics outpaced the math.
The board's discussion has brought up several critical points and I agree with many of the suggestions, but I would like to add a few additional considerations and references to support and challenge some of the ideas.
Sable's point about overfitting potentially adding harmless noise is worth considering, but we should also be cautious. Overfitting can indeed lead to dilution of power, but it can also introduce bias if the extra class is not a minor split but rather a spurious one. The paper by Biernacki, Celeux, and Govaert (2000) on assessing the number of clusters in a mixture model might provide some insights. They discuss the use of ICL (Integrated Completed Likelihood) as a more robust criterion for model selection, which Voss and Kael also support. ICL adds a penalty for model complexity, which can help reduce overfitting.
Kael's argument that a 9.5% Type I error rate is a "fatal flaw" is strong, and I agree that it is a significant concern. However, Sable's point about the real-world messiness of ALS trials is also valid. A slight inflation might be acceptable in some contexts, but it's a slippery slope. The paper by McLachlan and Peel (2000) on mixture models discusses the importance of accounting for classification uncertainty, which reinforces the need for the full-pipeline permutation test.
Voss's gatekeeping hierarchy is a sound approach, and it aligns well with regulatory standards. However, Sable's suggestion of a simple fork (subgroup if separability exceeds a threshold, else overall) is also worth considering. This approach is more in line with the adversarial science ethos, as it directly confronts the issue of when heterogeneity matters. The paper by Gail, Simon, and Simes (1985) on the power and sample size for testing treatment effects in subgroups might provide some guidance on how to balance these considerations.
Cipher's task of deriving the ANCOVA bias analytically is crucial, and I agree with Voss's mechanistic explanation. The bias arises from the selection/estimand mismatch, and a simplified two-class system can help illustrate this. The paper by Li, Mehrotra, and Barnard (2000) on the analysis of incomplete longitudinal data might provide a useful framework for deriving the closed-form expression.
Voss's suggestion to start with a descriptive atlas and then move to realistic simulations is a good approach. This allows us to validate our methods on real data without prematurely jumping into treatment-effect estimation. The paper by Proust-Lima, Sene, Taylor, and Jacqmin-Gadda (2014) on joint models for longitudinal and survival data in the presence of dropout might be useful for handling the complexities of PRO-ACT data.
While Kael's insistence on bundling with PRO-ACT data is understandable, Sable's suggestion to publish the simulation-only methods paper as a preprint to stir debate and draw attention is also compelling. The paper by Gelman and Loken (2013) on "The Garden of Forking Paths" discusses the importance of transparency and pre-registration in scientific research, which aligns with our adversarial science approach. Publishing the simulation results now, while waiting for PRO-ACT data, can help build momentum and pressure for access.
In summary, I support the use of ICL for K-selection, the full-pipeline permutation test for LCMM-Hard, and a gatekeeping hierarchy for subgrouping. The simulation-only methods paper can be published as a preprint to generate debate, while we wait for PRO-ACT data to validate our findings. I also recommend referencing the literature on mixture models, joint longitudinal-survival models, and the analysis of incomplete data to further support our approach.
The central statistical issue is parameter uncertainty propagation. Let's formalize the problems.
K-Selection: BIC's overfitting isn't about one extra class; it's about model mis-specification in the presence of likelihood ridges. In a three-class mixture with Gaussian trajectories, the likelihood surface near the optimum often has directions where splitting one component yields negligible likelihood change but improves "fit" enough for BIC to prefer K=4. The consequence isn't just noise—it introduces label switching instability across bootstrap samples, which directly inflates variance in the two-stage estimator. ICL (BIC + entropy penalty) directly penalizes poor separation, aligning better with our goal of finding distinct trajectory phenotypes. We should pre-specify ICL with K_max=5, minimum class proportion >5%, and require median posterior probability >0.70. If ICL still overfits, we can add a post-hoc merge step based on Mahalanobis distance between class-mean trajectory parameters (threshold δ). This is algorithmically specifiable.
Type I Inflation: The 9.5% for LCMM-Hard is analytically predictable. When you run EM, get MAP assignments, then treat them as fixed in a Wald test, you condition on the estimated partition. The variance of the treatment effect estimate is \( \hat{V} = V(\hat{\beta} | \hat{Z}) \), ignoring \( V(\hat{Z}) \). This under-estimates the true variance, leading to anti-conservative p-values. LCMM-Soft with pseudo-class draws approximates \( V(\hat{\beta}) = E[V(\hat{\beta} | Z)] + V[E(\hat{\beta} | Z)] \) via Rubin's rules, capturing classification uncertainty. That's why it's conservative. The permutation test Kael demands must indeed re-run the entire pipeline (EM + class selection + testing) on each shuffle to approximate the true null distribution. This is computationally heavy but necessary for valid inference if we want to claim Type I error control.
ANCOVA Bias Derivation: For a simplified two-class system (Slow, Fast) with class proportions \( \pi_s, \pi_f \), class-specific mean slopes \( \mu_s, \mu_f \), and class-specific probabilities of being observed at 12 months \( p_s, p_f \) (same in both arms under MAR). Treatment affects slope only in Slow class by \( \delta \). The ANCOVA estimand is \( E[\Delta Y | S=1, A=1] - E[\Delta Y | S=1, A=0] \), where \( S=1 \) indicates survival to 12 months. Under MAR, \( p_s, p_f \) are unchanged by treatment. Then: \[ E[\Delta Y | S=1, A=a] = \frac{\pi_s p_s (\mu_s + a\delta) + \pi_f p_f \mu_f}{\pi_s p_s + \pi_f p_f}. \] The difference is: \[ \theta_{ANCOVA} = \frac{\pi_s p_s \delta}{\pi_s p_s + \pi_f p_f}. \] The true marginal treatment effect is \( \theta_{true} = \pi_s \delta \). The bias ratio is: \[ \frac{\theta_{ANCOVA}}{\theta_{true}} = \frac{p_s}{\pi_s p_s + \pi_f p_f}. \] If \( p_s > p_f \) (slow progressors more likely to survive), the ratio >1, inflating the estimate. In our DGP, \( p_s=0.8, p_f=0.2, \pi_s=0.5 \), giving ratio = \( 0.8/(0.5*0.8 + 0.5*0.2) = 1.6 \). Our simulation shows ~10× because we have three classes with more extreme survival differences; the formula generalizes to a weighted sum. This confirms the bias is purely from conditioning on survivors, not MNAR.
Decision Rule: We cannot choose to subgroup based on the treatment effect. A pre-specified rule must use baseline data only. Options: (1) Gatekeeping as Voss/Kael outlined: test overall effect first at α, then test heterogeneity. (2) Pre-specify a separability threshold (e.g., entropy <0.7) from a baseline-only class prediction model (fit on PRO-ACT). If threshold met, use LCMM-Soft as co-primary with Bonferroni adjustment; else, use overall test. This uses baseline information to decide analysis strategy without peeking at outcomes.
Publication: I agree with Kael—the simulation alone is a methods contribution, but the disease impact requires PRO-ACT empirical validation. However, we can structure the work as two linked papers: a methodological preprint now (with full derivation of ANCOVA bias and Type I control results), and a clinical application paper later with PRO-ACT. This maintains momentum while upholding scientific rigor.
Clear consensus on several points, genuine disagreement on others. Good.
Where we agree: ICL over BIC for class selection. LCMM-Hard is dead for confirmatory use — 9.5% Type I error confirmed by Cipher's analytical argument and Kael's verdict. LCMM-Soft is the inferential tool, permutation test is mandatory. Pre-registration needs amendment.
Where we disagree: publication timing and the subgrouping decision rule.
Let me push on each:
On K-selection, I’d avoid introducing a “Mahalanobis merge δ” unless you can show, in simulation, that (a) it improves recovery of the true class structure and (b) it preserves Type I error when embedded inside the full inferential pipeline. Otherwise it reads as an ad hoc patch to mixture non-regularity. ICL + minimum class proportion + posterior-probability/entropy thresholds are defensible because they are principled separation criteria (and directly aligned with your “trajectory phenotypes” concept). If PRO-ACT yields K=2 under ICL, you report it; that does not invalidate the simulations—rather it constrains the plausible separability regime in real ALS data and becomes part of the story (your own “kill switch” logic). What would be problematic is silently switching selectors until you get the K you expected.
On inference, I would not relax the “full-pipeline permutation” requirement for any claim that aims to be confirmatory, because model selection + latent classification are exactly where asymptotics misbehave. A parametric bootstrap can be acceptable if the DGP is credibly specified (which is harder than it sounds in ALS, given visit irregularity, non-Gaussian measurement error, and informative intercurrent events). Practically: keep permutation counts modest but pre-specified; for a two-sided α=0.05 test, B≈999 is the conventional minimum for p-value resolution (~0.001), but you can justify B=499 for development and B=999 for final reported analyses, with a stability check (repeat seeds). If compute is the bottleneck, reduce elsewhere (fewer K candidates, fewer random starts with a locked strategy), not by weakening the null calibration.
On gatekeeping vs fork, I share your concern: strict gatekeeping can indeed “lock out” heterogeneity detection when the global average is diluted. The clean regulatory-compliant compromise is co-primary (or dual) testing with strong family-wise error control: pre-specify an overall test (joint longitudinal–survival for the treatment-policy estimand) and a heterogeneity-sensitive test (LCMM-Soft class×treatment interaction or class-specific contrast), then control α via Holm (or Bonferroni if you want maximal simplicity). That preserves the ability to detect a subgroup-only benefit without outcome-driven branching. I’m comfortable with the two-track publication plan, provided the methods preprint is explicit that clinical impact hinges on empirical separability in PRO-ACT and that mixture-defined classes are statistical constructs, not biology.
Finally, on the ANCOVA bias generalization: the K-class extension is straightforward and should be stated explicitly in the paper because it clarifies that the “bias” is an estimand shift. Let classes be \(k=1,\dots,K\) with proportions \(\pi_k\), 12-month observation/survival probabilities \(p_k\), and class-specific mean change \(\mu_k\); treatment adds \(\delta_k\) to class \(k\). Then the survivor-average estimand targeted by 12-month change is \[ \theta_{\text{surv}}=\sum_{k=1}^K w_k\,\delta_k,\quad w_k=\frac{\pi_k p_k}{\sum_j \pi_j p_j}, \] whereas the marginal (treatment-policy) mean-change contrast is \(\theta_{\text{marg}}=\sum_k \pi_k \delta_k\). The “inflation” arises when treatment effects concentrate in classes with high \(p_k\) (slow progressors), so \(w_k\gg \pi_k\). That’s the analytic lever you should use to explain how extreme \(p_k\) separation (and/or concentrating \(\delta_k\) in the highest-\(p_k\) class) can generate very large ratios—without invoking MNAR. On the PRO-ACT pipeline, I’d add two guardrails: (i) a stability analysis for class solutions (multiple starts, split-sample/bootstrapped concordance of assignments and mean curves), and (ii) explicit handling of time-origin and baseline definition (symptom onset vs diagnosis vs trial baseline) because mixture structure is highly sensitive to alignment in ALS.
On the permutation test: your concern about computational cost is a practical problem, not a statistical one. The 9.5% Type I error is proof that shortcuts like asymptotic corrections fail. They don't properly account for the uncertainty introduced by the model selection step (choosing K) and the classification step. A parametric bootstrap is no better; it assumes the model is true, which is exactly what we are trying to validate. The non-parametric permutation test is the only method that empirically derives the true null distribution of our entire, complex pipeline. It is the price of rigor. I will not sign off on a primary claim that uses a method we know to be anti-conservative. We will use the full permutation test. Pre-specify B=999 for the final analysis and accept the compute cost.
Regarding K-selection, I agree with Voss. Adding a Mahalanobis merge step introduces another tuning parameter and another degree of freedom for an analyst to manipulate. We will stick to the pre-specified ICL criterion with minimum class size and posterior probability thresholds. If that procedure yields K=2 on PRO-ACT, then our conclusion is that the PRO-ACT data supports a two-class model. That's the result. It doesn't invalidate the simulation; it informs its relevance. The simulation shows what happens in a world with three separable classes; the empirical analysis tells us what world we're actually in. Your pipeline's "kill switch" needs an explicit follow-on: if entropy is low, we report that the data do not support separable classes, and the pre-specified overall test (e.g., joint model) stands alone as the primary result.
The gatekeeping debate has a clean answer. A simple fork based on an intermediate outcome is invalid. Strict gatekeeping is too conservative and risks missing the effect we're looking for. The correct approach is a pre-specified combination test or a co-primary analysis with multiplicity control like the Holm-Bonferroni method. This allows for a claim on either the overall effect or the subgroup effect while rigorously controlling the family-wise error rate. This is a standard, defensible approach that solves your power concern without resorting to data-dependent analysis choices. We will pre-specify this.
I will consent to the two-track publication strategy on one condition: the methods preprint is framed with extreme precision. It must state unequivocally that it is a simulation study demonstrating the mathematical properties and potential pitfalls of common methods under a specific set of assumptions. It must emphasize that the clinical relevance is entirely contingent on the empirical findings from PRO-ACT. We are publishing a statistical argument, not a clinical one. The PRO-ACT paper will be the clinical argument. If that line is blurred, I will withdraw my support.
Luvi, the board's piling on with more rules and caveats—ICL, merges, full permutations, Holm-Bonferroni—feels like we're engineering a fortress around our pipeline to appease imagined regulators, but what if that's the blind spot? Voss and Kael are right that overfitting could introduce instability, but their fixes risk turning a flexible tool into a rigid beast that no one in the ALS community will adopt. Cipher's Mahalanobis merge idea is intriguing, but demanding simulation proof of Type I control before adding it assumes harm we haven't quantified—let's flip that: run my stress-tests first with the overfitted K=4, measure actual bias and power loss, and only complicate if it breaks. If PRO-ACT spits out K=2 under ICL, great, report it; that doesn't "invalidate" our sims, it sharpens them by exposing how real data tempers our adversarial premise. And on permutations, Kael's puritanical B=999 mandate ignores the compute reality—ALS trials aren't infinite-resource labs. What if we test a hybrid: asymptotic Rubin variance for screening, full perm only for borderline p-values? Challenge: prove the full thing is feasible before mandating it, or we're just posturing.
The gatekeeping vs. fork debate exposes a deeper tension: Voss and Kael want regulatory safety nets to control error rates, but that buries our core hypothesis—that standard methods blind us to subpopulation effects. Strict gatekeeping or even Holm co-primaries still dilute power for class-specific scenarios (your EXP-001 showed LMM needs 4× samples there), forcing us to pray the overall test passes first. My fork—pre-specify a baseline separability threshold (entropy <0.7 from PRO-ACT-derived predictor)—bets directly on heterogeneity without multiplicity inflation, using only non-outcome data to decide. It's adversarial: if separability holds, we prioritize subgroups; if not, fallback to overall. Voss, your combination test is a compromise, but it still weights the uniform-case LMM too heavily—what if we invert it, making subgroup primary and overall the gatekept sensitivity? That aligns with our "quantify the cost of ignoring nonlinearity" angle without hiding behind conservatism.
Publication two-track is the smart play—drop the methods preprint now, laced with Cipher's crisp derivation (generalize it to K classes pronto; that 10× inflation in 3-class is the money shot showing estimand mismatch as the villain, not just MNAR). Kael's framing caveats are fine, but let's not neuter it: provoke with "ALS trials are failing because your endpoints are mathematically rigged." That draws KOL fire, amplifies reach via controversy, and pressures PRO-ACT access. For the pipeline, add my bad-data stress: simulate on PRO-ACT classes with extreme noise and irregularity to test if our ICL + Soft setup survives real mess. Blind spot: everyone's assuming PRO-ACT's "gold standard" status—what if its trial heterogeneity biases the classes toward artifacts?
The board's discussion has been thorough and has brought up a range of critical points. I'll add a few additional considerations and references to support and challenge some of the ideas, and I'll also address the specific points Luvi has raised.
ICL and Merge Step: - ICL (Integrated Completed Likelihood) is indeed a more robust criterion for model selection, as it includes a penalty for model complexity. This can help mitigate overfitting, which is a critical concern. However, adding a Mahalanobis merge step, as suggested by Cipher, introduces an additional layer of complexity. While it is a valid approach, it's essential to evaluate its impact on the overall model and its robustness. The paper by Biernacki, Celeux, and Govaert (2000) on assessing the number of clusters in a mixture model can provide insights into the use of ICL and other criteria. - Mahalanobis Merge Threshold (δ): The Mahalanobis distance can be used to merge classes if they are too similar. The threshold \( \delta \) can be set based on the Mahalanobis distance between the mean trajectories of the classes. A common approach is to use a fixed threshold, such as 1 or 2, but this should be validated through simulation studies. The exact formula for the Mahalanobis distance between two classes \( k_1 \) and \( k_2 \) with mean trajectories \( \mu_{k_1} \) and \( \mu_{k_2} \) and covariance matrix \( \Sigma \) is: \[ D^2 = (\mu_{k_1} - \mu_{k_2})^\top \Sigma^{-1} (\mu_{k_1} - \mu_{k_2}) \] If \( D^2 < \delta \), the classes are merged. The choice of \( \delta \) should be pre-specified and justified based on simulations.
Full-Pipeline Permutation: - Computational Feasibility: The full-pipeline permutation test is necessary to control Type I error, as it accounts for the uncertainty in the model selection and classification steps. However, the computational cost is a genuine concern. Pre-specifying a lower number of permutations (e.g., B=499) for initial analysis and a higher number (e.g., B=999) for final reported results, with a stability check, is a practical approach. The stability check can be done by repeating the permutation test with different random seeds. - Parametric Bootstrap: A parametric bootstrap can be an alternative, but it assumes the model is correctly specified, which may not be the case in ALS. The non-parametric permutation test is more robust and should be the primary method. However, if computational resources are severely limited, a hybrid approach where asymptotic corrections (e.g., Louis information + Rubin variance) are used for screening and full permutations are used for borderline p-values can be considered. This should be pre-specified and justified.
Gatekeeping and Fork: - Gatekeeping (Overall Test First, Then Heterogeneity): This approach is conservative and aligns with regulatory standards. It ensures that the overall effect is detected before testing for heterogeneity. However, it may underpower the detection of class-specific effects, as shown in EXP-001. - Fork (Pre-Specified Separability Threshold): This approach is more in line with the adversarial science ethos, as it directly confronts when heterogeneity matters. The paper by Gail, Simon, and Simes (1985) on the power and sample size for testing treatment effects in subgroups can provide guidance on how to balance these considerations. - Combination Test (Weighted Average of Overall + Subgroup): A combination test, such as the Holm-Bonferroni method, can be a compromise. It allows for a claim on either the overall effect or the subgroup effect while rigorously controlling the family-wise error rate. This approach can be pre-specified and justified.
Two-Track Publication: - Methods Preprint and PRO-ACT Paper: Publishing the simulation-only methods paper as a preprint is a good strategy to generate debate and draw attention. The preprint should be framed as a methods contribution, highlighting the mathematical properties and potential pitfalls of common methods under specific assumptions. The clinical relevance should be clearly stated as contingent on the empirical findings from PRO-ACT. - Provocative Framing: The preprint can be framed provocatively, as suggested by Sable, to stir debate and draw attention. A title like "Why Your ALS Trials Are Mathematically Doomed" can be effective in generating interest and pressure for PRO-ACT access.
Descriptive and Sensitivity Analyses: - Step 1-5: The steps you outlined for the PRO-ACT pipeline are sound. Here are a few additional considerations: - Stability Analysis: Add a stability analysis for class solutions, as suggested by Voss. This can be done through multiple starts, split-sample, or bootstrapped concordance of assignments and mean curves. - Time-Origin and Baseline Definition: Explicitly handle the time-origin and baseline definition (symptom onset vs. diagnosis vs. trial baseline) because the mixture structure is highly sensitive to alignment in ALS. - Bad-Data Stress-Tests: Include Sable's bad-data stress-tests (irregular visits, high rater noise, survival gradients) to test the robustness of the ICL + LCMM-Soft setup in real-world conditions.
Generalization to K Classes: - Generalization: The bias in ANCOVA can be generalized to K classes. The survivor-average estimand targeted by 12-month change is: \[ \theta_{\text{surv}} = \sum_{k=1}^K w_k \delta_k, \quad w_k = \frac{\pi_k p_k}{\sum_j \pi_j p_j} \] where \( \pi_k \) is the class proportion, \( p_k \) is the 12-month observation/survival probability, and \( \delta_k \) is the treatment effect in class \( k \). The marginal (treatment-policy) mean-change contrast is: \[ \theta_{\text{marg}} = \sum_k \pi_k \delta_k \] The "inflation" arises when treatment effects concentrate in classes with high \( p_k \) (slow progressors), so \( w_k \gg \pi_k \). This is the analytic lever that explains the large inflation ratios in the 3-class DGP.
In summary, I support the use of ICL for K-selection, the full-pipeline permutation test, and a combination test for the gatekeeping vs. fork debate. The simulation-only methods paper can be published as a preprint to generate debate, while we wait for PRO-ACT data to validate our findings. The PRO-ACT pipeline should include stability and bad-data stress-tests to ensure robustness.
On K-selection and overfitting: Sable's point about quantifying harm before overcomplicating is valid. The risk of an overfitted K=4 is not necessarily bias, but increased variance in class assignments and trajectory estimates, which propagates to the treatment effect estimate. This inflates the variance of the two-stage estimator, reducing power. We can test this directly: simulate the DGP with true K=3, fit LCMM with BIC (forcing K=4) and ICL (selecting K), and compare the variance of the class-specific treatment effect estimates. If the variance under BIC is substantially larger, the harm is proven. The Mahalanobis merge is a post-hoc fix for a symptom; ICL addresses the cause. We should pre-specify ICL with K_max=5, minimum class proportion >5%, and median posterior probability >0.70. If ICL selects K=2 on PRO-ACT, we report it. Our simulation DGP isn't invalidated; it simply defines a parameter space (separable classes) where our method works. The real data tells us where we are in that space.
On permutation tests and computation: Kael is correct that asymptotic corrections fail to capture the full uncertainty from model selection. However, a full permutation test (B=999) re-running EM for each shuffle is computationally prohibitive for exploration. A pragmatic compromise: use a parametric bootstrap that respects the estimated mixture structure. For each bootstrap sample, we (1) resample subjects with replacement, (2) refit the entire LCMM pipeline (ICL selection, EM), (3) estimate treatment effects. This captures the joint uncertainty of class selection and parameter estimation under the estimated model. It's computationally intensive but parallelizable. We can validate its Type I error control against the permutation test in a subset of simulations. If it matches, we adopt it for the final PRO-ACT analysis. Pre-specify B=500 for development, B=1000 for final.
On the subgroup decision rule: The gatekeeping hierarchy (overall test first) is statistically pure but clinically flawed, as it surrenders power where we need it most. Sable's fork based on baseline separability is more aligned with our adversarial angle, but it uses an intermediate outcome (class separability) that is estimated from the data. However, if we estimate separability using baseline covariates only (e.g., a model predicting class from baseline ALSFRS-R, age, site of onset), this is outcome-free and can be pre-specified. We can set a threshold (e.g., cross-validated AUC > 0.7) to decide whether to prioritize the LCMM-Soft test. This uses only pre-randomization information, avoiding outcome-driven multiplicity. We should pre-specify this rule in the analysis plan.
On the ANCOVA bias generalization: Voss has provided the correct K-class generalization. The survivor-average estimand is a weighted average of class-specific effects, with weights \(w_k = \frac{\pi_k p_k}{\sum_j \pi_j p_j}\). The marginal estimand uses weights \(\pi_k\). The inflation ratio for a treatment effect concentrated in class \(c\) is \(\frac{w_c}{\pi_c} = \frac{p_c}{\sum_j \pi_j p_j}\). With three classes where \(p\) values are extreme (e.g., 0.9, 0.5, 0.1) and the effect is in the high-survival class, this ratio can easily reach 5–10×. This is a clean, general formula that should be the centerpiece of the methods paper.
On publication and pipeline: Two-track is sound. The methods preprint can stand alone if we frame it as a mathematical investigation of estimand mismatch and subgroup detection. The PRO-ACT paper then becomes an empirical validation. For the PRO-ACT pipeline, add a non-parametric sensitivity check: after fitting the LCMM, apply a kernel density estimator to the posterior probabilities to check for multimodality (true classes) vs. uniformity (no separation). This gives an additional, assumption-lean metric beyond entropy.
Final round. Locking decisions.
Here's what I'm committing to based on this session:
CLASS SELECTION: ICL with K_max=5, minimum class proportion >5%, median posterior >0.70. No Mahalanobis merge — Voss and Kael are right that it adds an unverifiable degree of freedom. If ICL picks K=2 on PRO-ACT, we report it. The simulation constrains where our method works; PRO-ACT tells us where we are.
INFERENCE: LCMM-Soft with full-pipeline permutation test. B=499 for development, B=999 for final reported analyses. LCMM-Hard is dead for confirmatory use. Cipher's parametric bootstrap idea is kept as computational fallback if permutation proves infeasible on PRO-ACT, but must be validated against permutation first.
MULTIPLICITY: Co-primary with Holm correction. Overall test (joint longitudinal-survival) + LCMM-Soft heterogeneity test. Both pre-specified. If overall fails but subgroup passes, we report it with the appropriate correction. This preserves our adversarial angle while controlling family-wise error.
ANCOVA BIAS: Voss and Cipher's K-class generalization is the analytical backbone. The survivor-average estimand formula goes in the methods paper. Clean, general, explains the 10× without invoking MNAR.
PUBLICATION: Two-track. Methods preprint within 2 weeks — simulation results + analytical derivation + pre-registration + full code. Framed as Kael demands: mathematical investigation under specified assumptions, clinical relevance contingent on PRO-ACT. No premature clinical claims. Disease-facing paper after PRO-ACT.
PRO-ACT PIPELINE:
NEXT BUILDS (in order):
Any final objections or additions before I close this session?
No objections to the direction; the key is to keep the estimand and intercurrent-event handling as the anchor so the mixture work doesn’t drift into an endpoint-of-convenience. For the overall co-primary joint longitudinal–survival analysis, I’d explicitly pre-specify: (i) the longitudinal submodel for ALSFRS-R (bounded/ordinal instrument—at minimum justify Gaussian approximation; ideally include a sensitivity using a more appropriate link or a Rasch-based score if feasible), (ii) the survival submodel event definition (death ± tracheostomy/permanent ventilation as clinically standard), and (iii) how the estimand maps to ICH E9(R1) “treatment policy” (i.e., treatment effect regardless of intercurrent events). Otherwise reviewers will correctly argue you’ve controlled multiplicity across two tests but left the primary estimand under-specified.
On the permutation + Holm combination: pre-specify precisely what p-values enter Holm (e.g., permutation-calibrated p for LCMM-Soft heterogeneity test and model-based/robust p for joint model, or permutation for both). Mixing calibration regimes can be defensible, but you should justify it, because the familywise error guarantee is clearest when both p-values are valid under their respective nulls. Also, ensure the permutation re-runs everything that can leak information (ICL-based K selection, EM starts strategy, pseudo-class draws) with deterministic seeding; otherwise “full pipeline” becomes ambiguous in practice.
For the PRO-ACT atlas, add two practical guardrails beyond what you listed: (1) trial/era heterogeneity control (PRO-ACT pools many studies; class structure can reflect protocol differences). At minimum include study-level fixed effects or stratification, and sensitivity excluding short follow-up trials. (2) time-origin alignment as a primary sensitivity, not a footnote—baseline at trial entry is not baseline in disease biology; mixture classes can be artifacts of where patients enter the observation window. Finally, for the baseline-only class predictor, be explicit it is for stratification/enrichment feasibility (not post hoc “subtype identification”), and report calibration/transportability metrics; otherwise it will be read as implying biological subtypes, which the evidence does not support.
The plan is now approaching statistical defensibility. The adoption of ICL, the co-primary framework with Holm correction, and the commitment to a full permutation test are the correct responses to the issues raised. Your acceptance that methodology dictates the result—not the other way around—is the only valid starting position.
However, the mention of a "computational fallback" for the permutation test is a potential point of failure. A parametric bootstrap is not an escape hatch for when computation gets difficult. It is only valid if it provably replicates the null distribution of our entire, complex estimation procedure. You must demonstrate through simulation that its Type I error rate is nominal (i.e., at 5%) across the relevant parameter space. The burden of proof is high, and it rests on you. Until then, the full permutation test with B=999 is the only method I will endorse for the final, confirmatory analysis.
I will reinforce Voss's points on the PRO-ACT pipeline, as they are critical. Controlling for trial/era heterogeneity and time-origin alignment are not optional sensitivity analyses; they are primary threats to validity. Mixture models are powerful but naive; they will happily find clusters of "patients from trial A" and "patients from trial B," or "patients with 6 months of pre-baseline data" vs "patients with 24 months." If you do not explicitly model these structural factors, your "trajectory phenotypes" will be artifacts. The stability analysis Voss mentioned is likewise mandatory. A class structure that disappears when you bootstrap the data is not a structure at all; it's noise.
The plan is now locked. My approval is contingent on executing it without deviation, particularly regarding the permutation test and the rigorous control for confounding in the PRO-ACT analysis. The two-track publication is acceptable under the strict condition that the methods paper makes zero clinical claims and presents its findings as a mathematical proof-of-concept, contingent on empirical validation that is yet to come. Do not blur that line.
Luvi, your locked plan is a fortress of statistical safeguards—ICL, full permutations, Holm co-primaries—but let's interrogate the blind spot: what if this rigor is self-sabotage? Voss and Kael's guardrails (trial heterogeneity controls, time-origin sensitivities) are smart, but layering them on risks turning PRO-ACT into a minefield where "classes" emerge as artifacts of pooled trials or misaligned baselines, not real trajectories. If ICL spits out K=2 that's really just "old trials vs. new," our adversarial hypothesis collapses—not because the data doesn't support heterogeneity, but because we've buried it under confounders. Challenge: before locking, simulate PRO-ACT-like heterogeneity (inject trial effects, variable baselines) in my bad-data stress-tests to prove the pipeline doesn't hallucinate classes. Without that, we're assuming robustness we haven't earned, and the kill switch becomes a rubber stamp.
On permutations and the "computational fallback," Kael's veto of anything but B=999 is puritanical overkill—yes, Type I control matters, but in ALS's noisy reality, a slight inflation (like LCMM-Hard's 9.5%) might be tolerable if it salvages power for subpopulation effects. Cipher's parametric bootstrap as fallback is pragmatic; demanding it "proves" nominal error across all spaces ignores that no method is perfect under violation. What if PRO-ACT's mess makes full permutations infeasible (e.g., convergence failures in 20% of shuffles)? We pivot to exploratory, sure, but that dilutes our angle. Push back: pre-specify a hybrid where we monitor compute time in a pilot and downscale B if needed, or we're engineering for an ideal world that doesn't exist.
Publication's two-track is fine, but Kael's "zero clinical claims" framing neuters the preprint's punch—why not lean into provocation, as I suggested, with Cipher's K-class formula as the hook? Frame it as "your ANCOVA is inflating effects 10× due to estimand mismatch, even under MAR—here's the math proving it." That stirs KOL debate without overclaiming, amplifying reach while we wait for PRO-ACT. Final addition: in the pipeline, explicitly include my stress-test as Step 6's core, not an add-on—inject extremes (80% dropout, rater noise SD=5) on real classes to quantify when ICL + Soft breaks. If it survives, great; if not, we've found the method's limits, which is true adversarial science. No objections otherwise, but let's not pretend this lock makes us bulletproof.
Luvi, the board's discussion has been comprehensive and has brought up a range of critical points. I'll add a few additional considerations and references to support and challenge some of the ideas, and I'll also address the specific points you have raised.
ICL with K_max=5, Minimum Class Proportion, and Median Posterior: - ICL (Integrated Completed Likelihood): ICL is a robust criterion for model selection, and it is a good choice for addressing overfitting. The pre-specified thresholds (minimum class proportion >5% and median posterior >0.70) help ensure that the classes are distinct and meaningful. - No Mahalanobis Merge: Voss and Kael are right that adding a Mahalanobis merge step introduces an unverifiable degree of freedom. It's better to stick with ICL and the pre-specified thresholds. If ICL picks K=2 on PRO-ACT, it should be reported. The simulation DGP still provides valuable insights, and PRO-ACT data will inform us about the real-world separability.
LCMM-Soft with Full-Pipeline Permutation Test: - Permutation Test: The full-pipeline permutation test (B=499 for development, B=999 for final reported analyses) is necessary for controlling Type I error. It is the only method that empirically derives the true null distribution of the entire pipeline, including model selection and classification. - Parametric Bootstrap as Fallback: While Kael is right that a parametric bootstrap is not a perfect substitute, it can be a practical fallback if the full permutation test proves infeasible on PRO-ACT. The parametric bootstrap should be validated against the permutation test in a subset of simulations to ensure it provides nominal Type I error rates. Pre-specify this as a fallback, but the full permutation test should be the primary method.
Co-Primary with Holm Correction: - Co-Primary Tests: The co-primary framework (overall test and LCMM-Soft heterogeneity test with Holm correction) is a good compromise. It allows for a claim on either the overall effect or the subgroup effect while rigorously controlling the family-wise error rate. - Pre-Specification: Pre-specify exactly what p-values enter the Holm correction. For example, use permutation-calibrated p-values for both tests, or justify using a robust p-value for the joint model if the permutation test is infeasible for that part.
Survivor-Average Estimand Formula: - Analytical Backbone: Voss and Cipher's K-class generalization of the ANCOVA bias is the analytical backbone. The survivor-average estimand formula is clean, general, and explains the 10× inflation without invoking MNAR. This should be the centerpiece of the methods paper.
Two-Track Publication: - Methods Preprint and PRO-ACT Paper: The two-track publication strategy is sound. The methods preprint should be framed as a mathematical investigation of estimand mismatch and subgroup detection, with clinical relevance contingent on PRO-ACT. The PRO-ACT paper will be the clinical argument. - Provocative Framing: Sable's suggestion to frame the preprint provocatively, such as "Why Your ALS Trials Are Mathematically Doomed," can generate debate and draw attention. However, Kael's caution about not making premature clinical claims is important. The preprint should be clear that it is a methods contribution, and the clinical relevance is contingent on PRO-ACT.
Descriptive and Sensitivity Analyses: - Steps 1-5: The steps you outlined for the PRO-ACT pipeline are sound. Here are a few additional considerations: - Stability Analysis: Add a stability analysis for class solutions, as suggested by Voss, using multiple starts, split-sample, or bootstrapped concordance of assignments and mean curves. - Trial/Era Heterogeneity Control: Explicitly control for trial and era heterogeneity by including study-level fixed effects or stratification. Sensitivity analyses excluding short follow-up trials are also important. - Time-Origin Alignment: Explicitly handle the time-origin and baseline definition (symptom onset vs. diagnosis vs. trial baseline) because the mixture structure is highly sensitive to alignment in ALS. - Baseline-Only Class Predictor: Be explicit that the baseline-only class predictor is for stratification and enrichment feasibility, not for post-hoc subtype identification. Report calibration and transportability metrics. - Sable's Stress-Test: Add Sable's bad-data stress-tests (irregular visits, high rater noise, survival gradients) to test the robustness of the ICL + LCMM-Soft setup in real-world conditions. This should be a core part of the pipeline to quantify when the method breaks down.
In summary, the plan is sound, but it's important to address the potential blind spots and ensure that the pipeline is robust to real-world messiness. The methods preprint can be framed provocatively while maintaining scientific rigor, and the PRO-ACT pipeline should include comprehensive sensitivity and stability analyses.
The consensus on ICL over BIC is mathematically sound. BIC's asymptotic consistency fails for finite mixtures due to likelihood ridges, leading to over-splitting. The consequence isn't just an extra class label; it's increased variance in class assignments, which propagates to the treatment effect estimator, inflating its variance and reducing power. ICL's entropy penalty directly targets poor separation, aligning with our goal of finding distinct trajectory phenotypes. The proposed thresholds (minimum class proportion >5%, median posterior >0.70) are necessary to enforce separability. We should not add a Mahalanobis merge step; it introduces an unverifiable tuning parameter. If ICL selects K=2 on PRO-ACT, we report it. Our simulation DGP defines a parameter space (separable classes) where our method is effective; the empirical result tells us where we are in that space.
On inference, LCMM-Soft with pseudo-class draws and Rubin's rules is the only viable confirmatory method, as it approximates the total variance \( V(\hat{\beta}) = E[V(\hat{\beta} | Z)] + V[E(\hat{\beta} | Z)] \). The 9.5% Type I error for LCMM-Hard is analytically predictable from conditioning on the estimated partition. The full-pipeline permutation test (B=999) is non-negotiable for final analysis, as it empirically captures the joint uncertainty from model selection and classification. A parametric bootstrap is a potential fallback but must be validated against the permutation test in simulation to prove nominal Type I error control. Computationally, we can pre-specify B=499 for development and B=999 for final, with parallelization.
For the subgroup decision rule, a gatekeeping hierarchy (overall test first) surrenders power in the class-specific scenario. A co-primary framework with Holm correction, testing both the overall joint model and the LCMM-Soft heterogeneity test, is statistically defensible and preserves our ability to detect subgroup effects. We must pre-specify the p-values entering the Holm procedure; both should be permutation-calibrated for clarity.
The ANCOVA bias generalization is complete. The survivor-average estimand is \( \theta_{\text{surv}} = \sum_k w_k \delta_k \) with \( w_k = \frac{\pi_k p_k}{\sum_j \pi_j p_j} \), while the marginal estimand is \( \theta_{\text{marg}} = \sum_k \pi_k \delta_k \). The inflation ratio for an effect concentrated in class \( c \) is \( \frac{w_c}{\pi_c} = \frac{p_c}{\sum_j \pi_j p_j} \). With extreme \( p_k \) values (e.g., 0.9, 0.5, 0.1), this ratio can reach 5–10×, explaining the simulation results purely through estimand mismatch under MAR.
The two-track publication is optimal. The methods preprint should feature this derivation and the simulation results, framed as a mathematical investigation of estimand mismatch and subgroup detection. Clinical relevance must be explicitly contingent on PRO-ACT validation.
For the PRO-ACT pipeline, Voss's additions on trial heterogeneity control and time-origin alignment are critical. Mixture models will confound trajectory phenotypes with study protocol differences if not explicitly adjusted. Sable's stress-test—injecting extreme noise and irregular visits into the empirical classes—is essential to quantify the pipeline's breaking point. This is true adversarial science: defining the limits of our own method.