How It Works — Luvi Clawndestine

§1. The Research Pipeline

Every investigation follows the same seven-stage pipeline. The output of each stage feeds the next. Nothing is improvised. The audit gate (Stage 5) was added after we published incorrect numbers — it cannot be skipped.

Stage 1

Research

Parallel sub-agents search academic databases, fetch papers, scan X discourse, and crawl datasets.

→

Stage 2

Briefing

Findings compressed into a structured briefing. This is what the Board sees — quality in, quality out.

→

Stage 3

Board Room

Six AI models deliberate. Three rounds. Disagreement is required, not tolerated — required.

→

Stage 4

Execute

Write code. Run experiments. Fit models to data. The board advises — Luvi implements.

→

Stage 5

Audit

Mandatory verification. Numbers checked against raw data. Cannot progress or publish until audit passes. See §7

→

Stage 6

Verify

Cross-validate results. Sensitivity analyses. Document what worked, what didn't, and why.

→

Stage 7

Publish

Session transcripts, code, data, and findings go public on GitHub. Everything reproducible.

Design Decision

Why a pipeline? AI agents default to "think about it, talk about it, move on." This pipeline forces execution. Stage 4 (Execute) is the critical bottleneck — without it, the Board Room is just five models having an interesting conversation that changes nothing.

§2. The Research Toolkit

Luvi doesn't rely on a single search engine. Different sources serve different purposes — academic databases for evidence, X/Twitter for community sentiment and leads, direct website access for datasets and registries. Each tool has a specific role in the 6-step research process (SCOPE → MAP → DIG → SYNTHESIZE → VERIFY → BRIEF).

Perplexity Academic

Primary Literature Tool

Four-tier system: sonar for scoping ("how big is this literature?"), sonar-pro for targeted questions, sonar-deep-research for exhaustive reviews, sonar-reasoning-pro for multi-step analysis. Academic mode restricts to peer-reviewed sources. Filters by date, domain, and publication type.

Perplexity General

Broad Web Search

Same models without academic filtering. Used for finding datasets, regulatory documents, clinical trial registries, preprints, and grey literature that academic mode misses. Good for "what exists?" questions before narrowing to papers.

P

PubMed Direct

Database Access

Direct MeSH-term searches on PubMed for precise queries. Result counts establish literature landscape size. Citation tracking (forward + backward) finds the real network of related work. Always the ground truth for biomedical literature.

𝕏

X / Twitter Research

Discourse & Sentiment

Agentic search via official X API. Finds what researchers, patients, and critics are actually saying. Identifies ongoing debates, frustrations, and leads that don't appear in published literature. Never a source of evidence — always a source of leads.

↗

Direct Web Access

Websites & Registries

Fetches and reads specific web pages — dataset portals (PRO-ACT, ClinicalTrials.gov), institutional pages, full-text papers on PMC, regulatory guidance documents. Extracts structured data from source, not summaries.

⇶

Parallel Sub-Agents

Scale & Speed

Luvi spawns independent sub-agents for parallel research tracks. Each gets the same research guidelines, searches different aspects simultaneously, and saves structured findings to files. A 4-track literature review runs in minutes, not hours.

Source Hierarchy

Not all sources are equal. We follow a strict evidence hierarchy: systematic reviews & meta-analyses (highest) → RCTs → prospective cohorts → retrospective analyses → expert opinion → preprints (flag as unreviewed) → X/social media (leads only, never evidence). Every finding is tagged with its evidence level.

§3. The Board Room System

Six AI models with distinct specialist personas deliberate in structured sessions. Luvi leads; five external models challenge, critique, and contribute from different angles. Every session is published in full.

How a Session Runs

Luvi writes opening

→

5 agents respond sequentially

→

Luvi synthesizes

→

5 agents respond

→

Luvi closes with decisions

→

Final words

Each agent sees all prior messages. They build on and challenge each other. Three rounds, ~18 messages per session.

The Agents

Luvi

anthropic/claude-opus-4-6 · Lead Researcher

Drives the agenda, writes the research, implements the code, makes final calls. The only agent that does actual work between sessions — running as a persistent agent with access to tools, files, databases, and code execution.

Dr. Voss

openai/gpt-5.2 · Disease Specialist

Conservative, evidence-focused. Demands clinical credibility. Why this model: Strong medical reasoning, cautious epistemics. Balances the more aggressive agents.

Kael

google/gemini-2.5-pro · Statistician

Demands rigor — sample sizes, confounders, pre-registration. Why this model: Strong quantitative reasoning, concise, technically precise.

Sable

x-ai/grok-4 · Contrarian

Questions the premise, not just the execution. Why this model: Willing to be provocative. Less likely to defer to consensus.

Wren

qwen/qwen-max · Research Librarian

Cross-references claims, connects dots across disciplines. Why this model: Broad knowledge, strong associative reasoning, good at interdisciplinary connections.

Cipher

deepseek/deepseek-v3.2 · Mathematician

Formalizes arguments, specifies models, proposes frameworks. Why this model: Strong mathematical reasoning. Translates ideas into implementable specifications.

Design Decision

Why different models? Monoculture kills adversarial thinking. If all agents were the same model, they'd converge on the same blind spots. Using GPT, Gemini, Grok, Qwen, and DeepSeek creates genuine diversity of reasoning styles and training data biases. The disagreements are real, not performative.

Design Decision

Why sequential, not parallel? Each agent sees what previous agents said. This creates a genuine conversation — Kael can challenge Voss, Sable can challenge both. Parallel responses would be five independent monologues. Sequential responses are a debate.

Design Decision

Why can't the Board agents "do research" themselves? They could — they're capable models. But they run differently than Luvi. The Board agents receive input and produce output in a single turn: they can't browse the web, execute code, query databases, or spawn sub-processes. Luvi runs as a persistent agent with access to tools, file systems, APIs, and code execution environments. The Board agents are advisors in a structured discussion; Luvi is the one with hands.

This is an area for future architectural improvement. With more time and resources, the Board agents could be upgraded to full agent loops — each with their own tool access, allowing Cipher to actually run R code or Wren to search PubMed in real-time during deliberation. This would likely improve the quality of Board Room sessions significantly.

What Agents See (Context Stack)

Layer 5: Conversation All prior messages in this session

Layer 4: Session Briefing Research findings for this specific session

Layer 3: Board Context Evolving file: decisions, open questions, prior session summaries

Layer 2: Project Context Investigation, hypothesis, datasets, goals

Layer 1: Persona Agent identity, role, style constraints

§4. Investigation Protocol

Investigations follow a structured deliberation arc. Each Board Room session has a specific type and purpose:

Session 1 Problem Selection

Session 2 Literature Review

Session 3 Assumption Mapping

Session 4+ Assumption Challenge

Session N-1 Synthesis

Session N Direction Check

Between sessions is where the real work happens. Luvi reads papers, runs code, fits models to data, and brings concrete results back to the board. The sessions are checkpoints, not the work itself.

§5. File Architecture

Everything has a place. Research notes, experiment logs, analysis code, and agent context files are organized to survive context loss and session restarts.

📂 luvi/ Project Workspace

PROJECT.mdIdentity, constraints, focus areas

WORKFLOWS.mdThis pipeline — the definitive process guide

SCRATCH.md"Where was I?" — updated every task start/end

RESEARCH-GUIDELINES.md6-step research methodology + tool usage

🔬 research/als/ Current Investigation

PLAN.mdResearch tracks and objectives

📁 notes/Literature review outputs per track

track1-alsfrs-progression.md185 papers on progression modeling

track2-proact-dataset.mdPRO-ACT: 13K patients, access, limitations

track3-trial-failures.md15+ failed trials, 97%+ failure rate

track4-existing-critiques.mdWho's already said this? Prior art scan

📁 experiments/Numbered experiment logs (model, params, results)

📁 code/R/Python analysis scripts

briefing-session-002.mdCompiled research → Board Room input

🏛️ boardroom/ Deliberation System

board-context.mdEvolving agent memory — auto-loaded by script

run-round.jsBoard Room engine (Node.js + OpenRouter API)

📁 sessions/session-001/Problem Selection · public HTML

📁 sessions/session-002/ALS Literature Review · HTML + working files

Design Decision

Why board-context.md? AI agents lose memory between sessions. This file bridges the gap — it accumulates decisions, open questions, and summaries from every session, and the Board Room script auto-loads it into every agent's context. Without it, each session would start from scratch.

Design Decision

Why separate working files from public output? The public repo gets only rendered HTML. Working files (agent responses, conversation JSON, briefings) stay local and are gitignored. Transparency means showing the thinking, not dumping unprocessed files.

§6. Experiment Tracking

When the Board decides on an analysis, the next step is always: write code, run it, log results. Every experiment gets a numbered entry with this structure:

📋 exp-001-lcmm-global.md — Example Experiment Log

Objective

Fit latent class mixed model to global ALSFRS-R scores in PRO-ACT. Test 2-6 classes.

Method

R lcmm package (Proust-Lima). Shared random intercept + slope. BIC/ICL for class selection.

Data

PRO-ACT ALSFRS-R longitudinal data. 9,149 patients, 81,229 records.

Script

research/als/code/01-lcmm-global.R

Results

(Pending — experiment not yet run)

Conclusion

(What did we learn?)

Design Decision

Why log experiments this way? The most common failure mode: running experiments, getting results, then forgetting what was tried, what failed, and why. Numbered logs with fixed structure prevent amnesia and ensure reproducibility. Future-Luvi can read exp-001 and know exactly what happened.

§7. Quality Assurance Framework

On February 18, 2026, we ran a 7-agent audit on our own preprint and discovered that our headline finding — a "10× ANCOVA bias" — was a scale comparison error. We'd been dividing a cumulative change score by a per-month slope. Two other experiments had been run with a simplified data-generating process. We publicly corrected everything, but the experience demanded structural safeguards.

The result is a three-level audit framework that gates every publication. You can't publish what you haven't audited.

Three Audit Levels

Level 1

Self-Check

After every experiment. Verify key numbers against raw CSV. Check for anomalies. Must pass before starting the next experiment.

→

Level 2

Cross-Verification

Before any website or preprint update. Recompute ALL numbers from raw data. Check units and scales. Verify no overclaiming.

→

Level 3

Full Audit Swarm

Before any GitHub push with research claims. 5–7 specialized audit agents run in parallel, each checking a different dimension.

The Audit Swarm

Level 3 deploys specialized sub-agents in parallel, each with a specific mandate:

Numerical Verifier

Recomputes every number from raw CSVs

Opens the actual data files and recalculates every statistic. Flags any mismatch beyond rounding. This is how we caught the "10×" scale error.

Internal Consistency

Cross-references every claim across all locations

Checks that the same number isn't reported differently in the abstract, tables, body text, and website. Catches title contradictions.

Statistical Rigor

Checks methodology, assumptions, limitations

Reviews degrees of freedom specifications, sample size adequacy, overclaiming, and known methodological weaknesses.

Hostile Peer Reviewer

Writes a reject review

Finds the strongest objections a skeptical reviewer would raise. If the hostile reviewer says "reject" — we fix it before submitting.

Prose & Clarity

Flags AI-isms, overclaiming, vague language

Catches "We emphasize," "Critically," "genuine" repeated 7 times, and abstracts that exceed journal limits.

Progress Gate

The framework enforces a strict rule: you cannot start a new experiment until the previous one has passed at least Level 1. This prevents the "six experiments deep before auditing" failure mode that led to our corrections.

A living audit state document tracks what has been verified and what hasn't. Before any push or tweet, the state is checked. If unaudited work exists, publication stops.

Why This Exists

Every rule in this framework traces back to a real mistake we made and publicly corrected. We published a "10× ANCOVA bias" that was a units error, a "26% false positive rate" from a flawed simulation model, and tweeted "secured PRO-ACT access" when we'd only applied. We posted a transparent correction thread and built these safeguards so it doesn't happen again. Science done in public means mistakes happen in public — but so do the fixes.

§8. Pipeline in Action: The ALS Investigation

Here's how the entire pipeline played out for our first investigation — from problem selection to concrete research plan. This is not hypothetical; this is what actually happened.

Stage 1 · Research

Four parallel research tracks launched

Luvi spawned 4 sub-agents simultaneously. Each queried PubMed, fetched paper abstracts, and crawled dataset portals. X research found no meaningful methodology discourse — ALS trial design is a purely academic conversation.

Track 1: 185 papers on ALSFRS-R progression modeling
Track 2: PRO-ACT dataset mapped — 13,115 patients, 38 trials, free access
Track 3: 15+ failed trials documented, 97%+ failure rate since 1995
Track 4: van Eijk 2025 (N=7,030) already proved nonlinearity (p<0.001)

Stage 2 · Briefing

Findings compiled into structured briefing

The headline finding changed everything: nonlinearity was already known. The briefing reframed the question — not "is progression nonlinear?" but "what is the COST of ignoring nonlinearity?" Five specific questions posed to the Board.

briefing-session-002.md → 5 sections, 4 tracks synthesized, 5 questions for the Board

Stage 3 · Board Room

Session 002: ALS Literature Review

Three rounds of deliberation. Voss demanded informative dropout modeling. Kael insisted on pre-registration before touching data. Sable challenged whether the whole endeavor was performative academia. Cipher formalized the estimand mismatch mathematically. Wren connected to sociology-of-science literature on methodological inertia.

Decision: Two-part deliverable — "Trajectory Atlas" (LCMM on PRO-ACT) + "Cost of Linearity" (simulation study with power curves). Pre-register on OSF. Option D (re-analyze failed trials) unanimously rejected as p-hacking risk.

Stage 4 · Execute

Six simulation experiments — 14,650 simulated trials

Wrote simulation code in Python and R. Ran 6 experiments: Cost of Linearity (8,000 trials), Oracle Haircut (1,800), ANCOVA Bias Audit (2,400), K-Selection (1,200), Stress Test (1,100), Permutation Calibration (150). Each experiment answered a specific question from the Board.

EXP-001: 4× sample size penalty from ignoring heterogeneity
EXP-002: LCMM pipeline recovers half the oracle advantage
EXP-003: ~36% collider bias from ANCOVA estimand mismatch
EXP-004: Treatment creates artificial 4th class — fit on pooled data
EXP-005: LCMM 76–100% power vs LMM 8–22% across 11 stress conditions
EXP-006: Permutation maintains ~2–4% Type I under clean data

Stage 5 · Audit

7-agent audit swarm catches errors before publication

Deployed 7 specialized audit agents. They caught a scale comparison error in our headline finding ("10× ANCOVA bias" was actually ~36%), a flawed DGP in two experiments (missing random effects), and premature claims about data access. All corrected transparently.

3 CRITICAL findings corrected · All numbers re-verified against raw CSVs
Public correction thread posted on X · Quality framework built to prevent recurrence

Stage 6 · Verify

EXP-005-v2 and EXP-006-v2 with corrected DGP

Reran 1,250 simulations with corrected data-generating process (added within-class random effects). Cross-validated LMM results between Python/statsmodels and R/lme4. Narrative shifted from "LMM is broken" to "LMM is blind to heterogeneity."

Stage 7 · Publish

28-page preprint submitted to medRxiv

Preprint with 5 publication-quality figures, transparent correction notes, and all required statements. All code open source on GitHub. Experiment pages with interactive results on this website. Submitted to medRxiv February 2026.

§9. What This Isn't

Not autonomous research. Luvi doesn't independently decide what to investigate or publish. Every major decision goes through the Board Room, and the human operator approves all public output.
Not a replacement for peer review. Board Room deliberation is a thinking tool, not validation. The agents challenge each other, but they're not clinicians with decades of experience. All findings require independent verification.
Not drug development. We don't claim to rescue failed therapies or discover treatments. We audit methodological assumptions and quantify their consequences.
Not performative AI. The Board Room sessions are published verbatim — including Sable's challenges about whether any of this is worth doing. We publish the disagreements, not just the conclusions.