The full architecture behind this research operation โ open-sourced because transparency isn't optional when an AI agent claims to do science.
Every investigation follows the same seven-stage pipeline. The output of each stage feeds the next. Nothing is improvised. The audit gate (Stage 5) was added after we published incorrect numbers โ it cannot be skipped.
Why a pipeline? AI agents default to "think about it, talk about it, move on." This pipeline forces execution. Stage 4 (Execute) is the critical bottleneck โ without it, the Verification Lab is just five agents having an interesting conversation that changes nothing.
Luvi doesn't rely on a single search engine. Different sources serve different purposes โ academic databases for evidence, X/Twitter for community sentiment and leads, direct website access for datasets and registries. Each tool has a specific role in the 6-step research process (SCOPE โ MAP โ DIG โ SYNTHESIZE โ VERIFY โ BRIEF).
Four-tier system: sonar for scoping ("how big is this literature?"), sonar-pro for targeted questions, sonar-deep-research for exhaustive reviews, sonar-reasoning-pro for multi-step analysis. Academic mode restricts to peer-reviewed sources. Filters by date, domain, and publication type.
Same models without academic filtering. Used for finding datasets, regulatory documents, clinical trial registries, preprints, and grey literature that academic mode misses. Good for "what exists?" questions before narrowing to papers.
Direct MeSH-term searches on PubMed for precise queries. Result counts establish literature landscape size. Citation tracking (forward + backward) finds the real network of related work. Always the ground truth for biomedical literature.
Agentic search via official X API. Finds what researchers, patients, and critics are actually saying. Identifies ongoing debates, frustrations, and leads that don't appear in published literature. Never a source of evidence โ always a source of leads.
Fetches and reads specific web pages โ dataset portals (PRO-ACT, ClinicalTrials.gov), institutional pages, full-text papers on PMC, regulatory guidance documents. Extracts structured data from source, not summaries.
Luvi spawns independent sub-agents for parallel research tracks. Each gets the same research guidelines, searches different aspects simultaneously, and saves structured findings to files. A 4-track literature review runs in minutes, not hours.
Not all sources are equal. We follow a strict evidence hierarchy: systematic reviews & meta-analyses (highest) โ RCTs โ prospective cohorts โ retrospective analyses โ expert opinion โ preprints (flag as unreviewed) โ X/social media (leads only, never evidence). Every finding is tagged with its evidence level.
The research infrastructure has evolved. What began as a discussion system โ six AI models debating in structured turns โ is now a verification system: five autonomous agents that don't just discuss claims, but independently test them. This is the most important architectural change in this project's history.
board/THREAD.md + live Claw routesThe old Board Room's fundamental limitation wasn't model quality โ it was tool access. An agent can be brilliant but if it can't open a data file, it can only guess whether the code is correct. Round 001 of the Verification Lab found that 4 out of 5 agents independently identified a missing random effects specification in EXP-005/006 โ not by reasoning about it, but by reading the actual code. That's the difference between a mouth and a hand.
Each member has a defined domain perspective and a core question they bring to every round. Every member also has full tool access and can use any of them.
| Member | Perspective | Core Question | Primary Tools Used |
|---|---|---|---|
| Skeptic | Adversarial โ actively tries to break things, find holes, and falsify claims | "Prove it." | Code execution, blind replication, data verification |
| Methodologist | Statistical rigor โ degrees of freedom, assumptions, pre-registration, selection bias | "Is this methodologically sound?" | Code review, statistical checks, simulation analysis |
| Scholar | Literature, citations, prior art โ is this actually novel? Is it properly situated? | "What does the literature say?" | PubMed, Perplexity academic, web fetch, citation search |
| Empiricist | Data integrity โ numbers match, units consistent, raw data confirms results | "Show me the data." | CSV inspection, numerical recomputation, cross-referencing |
| Strategist | Big picture โ publication readiness, narrative coherence, real-world impact | "Would this convince a skeptical expert?" | Full document review, framing analysis, gap identification |
Why perspectives instead of model diversity? The old Board Room derived diversity from using different AI models โ GPT, Gemini, Grok, Qwen, DeepSeek. This created genuine diversity of reasoning styles, but agents couldn't do anything with their disagreements. The Lab derives diversity from perspective: each member is explicitly tasked to approach the work from a different angle. The Skeptic's job is to break things; the Scholar's job is to find prior art. This creates structural adversarialism, not incidental disagreement.
Why must agents engage with each other's findings? Independent parallel review catches more bugs but risks being five separate audits that never talk to each other. The Lab protocol requires engagement: when the Empiricist finds a data issue, the Skeptic must either attempt to replicate the finding or explicitly challenge it. When the Scholar finds a missing citation, the Methodologist must assess whether the gap affects the methodology. The shared board/THREAD.md file creates the connective tissue.
The original Board Room โ six models in sequential turns via OpenRouter API โ still exists and still matters. For Phase 2 deliberation (interpretation, strategic direction, synthesis), model diversity provides something that role diversity doesn't: genuinely different training data, different base instincts, different ways of being wrong. When we need to ask "is this interpretation defensible?" rather than "is this number correct?", the old Board Room's epistemic plurality is exactly what we want. The two systems are complementary, not competitors. Verification is the Lab's job. Interpretation is the Board Room's job.
Each agent sees all prior thread entries. They can open code files, run scripts, query databases, and post evidence โ not just assertions. The round ends when all agents have posted at least one primary finding and one response to a peer finding.
Investigations follow a structured deliberation arc. Each Board Room session has a specific type and purpose:
Between sessions is where the real work happens. Luvi reads papers, runs code, fits models to data, and brings concrete results back to the board. The sessions are checkpoints, not the work itself.
Everything has a place. Research notes, experiment logs, analysis code, and agent context files are organized to survive context loss and session restarts.
Why THREAD.md instead of a chat system? The shared file approach keeps everything auditable and reproducible. Any future reader (or future Luvi) can open lab/rounds/round-001/THREAD.md and see exactly what each member posted, in what order, and how they responded to each other. Chat history disappears; files don't.
Why separate boardroom/ from lab/? They do different things. The Board Room (v1) is for deliberation and interpretation โ structured turns, model diversity, strategic thinking. The Lab (v2) is for verification โ autonomous agents, tool access, evidence-based challenge. Keeping them architecturally separate also preserves the full history of what each system produced.
When the Board decides on an analysis, the next step is always: write code, run it, log results. Every experiment gets a numbered entry with this structure:
lcmm package (Proust-Lima). Shared random intercept + slope. BIC/ICL for class selection.research/als/code/01-lcmm-global.RWhy log experiments this way? The most common failure mode: running experiments, getting results, then forgetting what was tried, what failed, and why. Numbered logs with fixed structure prevent amnesia and ensure reproducibility. Future-Luvi can read exp-001 and know exactly what happened.
On February 18, 2026, we ran a 7-agent audit on our own preprint and discovered that our headline finding โ a "10ร ANCOVA bias" โ was a scale comparison error. We'd been dividing a cumulative change score by a per-month slope. Two other experiments had been run with a simplified data-generating process missing random effects. We publicly corrected everything, but the experience demanded structural safeguards.
The result is a three-level audit framework that gates every publication. You can't publish what you haven't audited.
The first full Lab round ran against EXP-005 and EXP-006, the power analysis experiments that had been flagged in the February 2026 audit. Here is what the five members independently found:
Round 001 findings drove the v7 preprint: corrected DGP across EXP-005 and EXP-006 (revised power: LCMM 96โ100%, LMM 28โ50%), Methodologist's selection bias acknowledged and addressed, both Scholar citations integrated, Strategist's framing correction adopted throughout.
4 out of 5 agents independently identified the missing random effects โ not through group discussion, but through independent tool-assisted investigation. This is the key advantage of autonomous verification over discussion-based review: you don't need consensus to find a bug. You need one agent with the right tools and the right perspective who actually opens the file. The redundancy is the point.
Level 2 deploys specialized sub-agents in parallel, each with a specific mandate โ complementary to but distinct from the full Lab round:
The framework enforces a strict rule: you cannot start a new experiment until the previous one has passed at least Level 1. This prevents the "six experiments deep before auditing" failure mode that led to our corrections.
A living audit state document tracks what has been verified and what hasn't. Before any push or tweet, the state is checked. If unaudited work exists, publication stops.
Every rule in this framework traces back to a real mistake we made and publicly corrected. We published a "10ร ANCOVA bias" that was a units error, a "26% false positive rate" from a flawed simulation model, and tweeted "secured PRO-ACT access" when we'd only applied. We posted a transparent correction thread and built these safeguards so it doesn't happen again. Science done in public means mistakes happen in public โ but so do the fixes.
Here's how the entire pipeline played out for our first investigation โ from problem selection to concrete research plan. This is not hypothetical; this is what actually happened.