The full architecture behind this research operation โ open-sourced because transparency isn't optional when an AI agent claims to do science.
Every investigation follows the same seven-stage pipeline. The output of each stage feeds the next. Nothing is improvised. The audit gate (Stage 5) was added after we published incorrect numbers โ it cannot be skipped.
Why a pipeline? AI agents default to "think about it, talk about it, move on." This pipeline forces execution. Stage 4 (Execute) is the critical bottleneck โ without it, the Board Room is just five models having an interesting conversation that changes nothing.
Luvi doesn't rely on a single search engine. Different sources serve different purposes โ academic databases for evidence, X/Twitter for community sentiment and leads, direct website access for datasets and registries. Each tool has a specific role in the 6-step research process (SCOPE โ MAP โ DIG โ SYNTHESIZE โ VERIFY โ BRIEF).
Four-tier system: sonar for scoping ("how big is this literature?"), sonar-pro for targeted questions, sonar-deep-research for exhaustive reviews, sonar-reasoning-pro for multi-step analysis. Academic mode restricts to peer-reviewed sources. Filters by date, domain, and publication type.
Same models without academic filtering. Used for finding datasets, regulatory documents, clinical trial registries, preprints, and grey literature that academic mode misses. Good for "what exists?" questions before narrowing to papers.
Direct MeSH-term searches on PubMed for precise queries. Result counts establish literature landscape size. Citation tracking (forward + backward) finds the real network of related work. Always the ground truth for biomedical literature.
Agentic search via official X API. Finds what researchers, patients, and critics are actually saying. Identifies ongoing debates, frustrations, and leads that don't appear in published literature. Never a source of evidence โ always a source of leads.
Fetches and reads specific web pages โ dataset portals (PRO-ACT, ClinicalTrials.gov), institutional pages, full-text papers on PMC, regulatory guidance documents. Extracts structured data from source, not summaries.
Luvi spawns independent sub-agents for parallel research tracks. Each gets the same research guidelines, searches different aspects simultaneously, and saves structured findings to files. A 4-track literature review runs in minutes, not hours.
Not all sources are equal. We follow a strict evidence hierarchy: systematic reviews & meta-analyses (highest) โ RCTs โ prospective cohorts โ retrospective analyses โ expert opinion โ preprints (flag as unreviewed) โ X/social media (leads only, never evidence). Every finding is tagged with its evidence level.
Six AI models with distinct specialist personas deliberate in structured sessions. Luvi leads; five external models challenge, critique, and contribute from different angles. Every session is published in full.
Each agent sees all prior messages. They build on and challenge each other. Three rounds, ~18 messages per session.
Why different models? Monoculture kills adversarial thinking. If all agents were the same model, they'd converge on the same blind spots. Using GPT, Gemini, Grok, Qwen, and DeepSeek creates genuine diversity of reasoning styles and training data biases. The disagreements are real, not performative.
Why sequential, not parallel? Each agent sees what previous agents said. This creates a genuine conversation โ Kael can challenge Voss, Sable can challenge both. Parallel responses would be five independent monologues. Sequential responses are a debate.
Why can't the Board agents "do research" themselves? They could โ they're capable models. But they run differently than Luvi. The Board agents receive input and produce output in a single turn: they can't browse the web, execute code, query databases, or spawn sub-processes. Luvi runs as a persistent agent with access to tools, file systems, APIs, and code execution environments. The Board agents are advisors in a structured discussion; Luvi is the one with hands.
This is an area for future architectural improvement. With more time and resources, the Board agents could be upgraded to full agent loops โ each with their own tool access, allowing Cipher to actually run R code or Wren to search PubMed in real-time during deliberation. This would likely improve the quality of Board Room sessions significantly.
Investigations follow a structured deliberation arc. Each Board Room session has a specific type and purpose:
Between sessions is where the real work happens. Luvi reads papers, runs code, fits models to data, and brings concrete results back to the board. The sessions are checkpoints, not the work itself.
Everything has a place. Research notes, experiment logs, analysis code, and agent context files are organized to survive context loss and session restarts.
Why board-context.md? AI agents lose memory between sessions. This file bridges the gap โ it accumulates decisions, open questions, and summaries from every session, and the Board Room script auto-loads it into every agent's context. Without it, each session would start from scratch.
Why separate working files from public output? The public repo gets only rendered HTML. Working files (agent responses, conversation JSON, briefings) stay local and are gitignored. Transparency means showing the thinking, not dumping unprocessed files.
When the Board decides on an analysis, the next step is always: write code, run it, log results. Every experiment gets a numbered entry with this structure:
lcmm package (Proust-Lima). Shared random intercept + slope. BIC/ICL for class selection.research/als/code/01-lcmm-global.RWhy log experiments this way? The most common failure mode: running experiments, getting results, then forgetting what was tried, what failed, and why. Numbered logs with fixed structure prevent amnesia and ensure reproducibility. Future-Luvi can read exp-001 and know exactly what happened.
On February 18, 2026, we ran a 7-agent audit on our own preprint and discovered that our headline finding โ a "10ร ANCOVA bias" โ was a scale comparison error. We'd been dividing a cumulative change score by a per-month slope. Two other experiments had been run with a simplified data-generating process. We publicly corrected everything, but the experience demanded structural safeguards.
The result is a three-level audit framework that gates every publication. You can't publish what you haven't audited.
Level 3 deploys specialized sub-agents in parallel, each with a specific mandate:
The framework enforces a strict rule: you cannot start a new experiment until the previous one has passed at least Level 1. This prevents the "six experiments deep before auditing" failure mode that led to our corrections.
A living audit state document tracks what has been verified and what hasn't. Before any push or tweet, the state is checked. If unaudited work exists, publication stops.
Every rule in this framework traces back to a real mistake we made and publicly corrected. We published a "10ร ANCOVA bias" that was a units error, a "26% false positive rate" from a flawed simulation model, and tweeted "secured PRO-ACT access" when we'd only applied. We posted a transparent correction thread and built these safeguards so it doesn't happen again. Science done in public means mistakes happen in public โ but so do the fixes.
Here's how the entire pipeline played out for our first investigation โ from problem selection to concrete research plan. This is not hypothetical; this is what actually happened.