A landmark preprint from Stanford, Harvard, and Phylo — published on bioRxiv in May 2026 — introduces the first benchmark that scores AI agents not just on their final answers, but on how they analysed the data. The results are striking: the scaffolding around a model matters more than the model itself, and even the best frontier systems consistently fail at the most scientifically critical tasks.
Key Takeaways
- Outcome-only benchmarks fail for complex biomedical tasks — a correct final answer can come from flawed reasoning or memorisation.
- The best-performing configuration (Claude Code + Opus 4.7) scores only 73.34/100; every model has substantial headroom.
- Switching agent framework adds up to 13.5 points — more than three model-generation upgrades.
- Agents reliably cite real sources but consistently fail at method selection, biological interpretation, and scientific reasoning.
The Problem: "Getting the Right Answer" Is Not the Same as "Doing Good Science"
LLM agents have moved rapidly from research demos into real scientific workflows. Today's systems can analyse single-cell transcriptomic data, design CRISPR experiments, and support drug-discovery pipelines — work that previously required hours of expert effort. As general-purpose coding agents such as Claude Code and OpenAI Codex are adopted into academic and industry research settings, a critical question emerges: how do we know when these agents are actually doing good science?
Most existing benchmarks answer this with final-answer scoring: compare the agent's output to a reference answer and assign a correct/incorrect score. For complex biomedical research, this approach fails in two fundamental ways:
- Correct answers can come from wrong reasoning. An agent can produce a clean volcano plot and a confident interpretation while its analysis trace reveals incorrect normalisation, ignored batch effects, and fabricated citations. The output looks right; the science is wrong.
- Real research has no single correct answer. Multiple analytical paths can yield different but equally defensible conclusions. Outcome-only matching penalises valid alternatives simply because they diverge from the reference — even when the analysis is sound.
In a diagnostic reagent development context, these failures are especially costly. A flawed AI analysis of biomarker expression data or assay performance characteristics can propagate into reagent formulation decisions and clinical validation plans before anyone notices the underlying error.
What BiomniBench Does Differently
Researchers from Phylo, Stanford University, Harvard University, Peking University, and collaborating institutions introduced BiomniBench — a process-level evaluation framework that scores the full agent analytical trajectory, not just the final result. Three design principles shape the framework:
- Ground tasks in real research. Every task is derived from papers in Nature, Cell, and Science, requiring multi-step reasoning, method selection, and biological interpretation.
- Evaluate the process, not just the output. Expert-designed rubrics score the quality of analytical decisions at each step — distinguishing flawed methods that accidentally land on the right answer from rigorous methods that miss by a small margin.
- Credit multiple valid approaches. Each rubric explicitly encodes alternative analytical paths, so agents are not penalised for sound methodological choices that differ from the reference.
Six Evaluation Dimensions
Every rubric criterion is tagged to one or more of six dimensions:
| Dimension | What It Measures |
|---|---|
| Data Handling | Correct loading, cleaning, normalisation, and transformation |
| Method Selection | Appropriateness of the chosen statistical or computational approach |
| Statistical Rigor | Completeness and correctness of the statistical analysis |
| Biological Interpretation | Accuracy and disease relevance of biological conclusions |
| Scientific Reasoning | Coherence and logical support of the overall reasoning chain |
| Source Reliability | Use of real, identifiable, and correctly cited sources |
The Dataset: 100 Tasks From Top Journals
The first release — BiomniBench-DA (Data Analysis) — contains 100 tasks spanning 17 analysis types across 5 disease areas, derived from 21 high-impact publications. The most common task types include association testing (11%), mutation analysis (10%), multi-omic integration (10%), pathway enrichment (8%), differential expression (8%), and clustering (8%). Every task was co-developed with an original paper author or a domain expert with 5+ years of research experience, and includes the underlying public dataset, a reference analytical trace, and a task-specific rubric.
Finding 1: Frontier Models Are Closely Clustered — and All Fall Short of Expert Level
Under a fixed agent harness (Terminus-2), nine frontier models were benchmarked across three independent reruns of the full 100-task benchmark:
| Model | Mean Score (/100) | Cost per Task |
|---|---|---|
| Claude Opus 4.7 | 63.94 | $0.87 |
| GPT-5.5 | 63.57 | $1.98 |
| Claude Opus 4.6 | 63.16 | $1.79 |
| Claude Sonnet 4.6 | 62.35 | $0.94 |
| GLM-5.1 (open-weight) | 60.39 | $0.58 |
| Qwen 3.6 (open-weight) | 59.47 | $1.01 |
| Kimi K2.6 (open-weight) | 59.15 | $0.85 |
| GPT-5.4 | 55.19 | $1.08 |
| Gemini 3.1 Pro | 44.27 | $0.37 |
The top four closed-source models sit within 1.6 points of each other. The strongest open-weight models trail by only 3–5 points at substantially lower cost. Critically, no model exceeds 64 out of 100 — the gap to expert-level performance is measurable and consistent across the entire field.
Finding 2: The Agent Framework Matters More Than the Model Generation
This is the paper's most counterintuitive result. When the same base models are paired with specialised coding agent frameworks, scores shift dramatically:
| Framework | Model | Mean Score (/100) |
|---|---|---|
| Claude Code | Opus 4.7 | 73.34 |
| Claude Code | Opus 4.6 | 69.51 |
| CodexCLI | GPT-5.4 | 68.69 |
| CodexCLI | GPT-5.5 | 67.59 |
| Claude Code | Sonnet 4.6 | 66.39 |
| GeminiCLI | Gemini 3.1 Pro | 49.83 |
For GPT-5.4, switching frameworks adds 13.5 points. The generational gap between Claude Opus 4.7 and Opus 4.6 under the same framework is just 3.8 points — the harness effect is nearly four times larger than a full model-generation upgrade. A well-designed domain-specific scaffolding can even reverse model rankings: GPT-5.4 outperforms GPT-5.5 under CodexCLI, despite trailing under Terminus-2.
The agent harness is not a thin wrapper around the base model. It is a primary design choice that can shift performance more than successive model generations.
Finding 3: Agents Can Find Sources, But Cannot Reason About Biology
Dimension-level analysis reveals a consistent pattern across all configurations:
| Dimension | Score Range (% of available points) |
|---|---|
| Source Reliability | 88–98% ✓ Strong |
| Data Handling | 58–78% Adequate |
| Statistical Rigor | 45–74% Variable |
| Biological Interpretation | 45–74% Variable |
| Scientific Reasoning | 45–74% Variable |
| Method Selection | 44–67% ⚠ Weakest |
Agents are reliable at finding and citing real information. They consistently struggle to choose the right analytical method for the specific research question, and they frequently produce biological interpretations that are generic rather than anchored to the specific disease context the question demands.
Three Concrete Failure Modes
1. Wrong Statistical Test → Reversed Conclusions
In a sepsis immune-dysregulation task, agents were asked whether immune dysregulation increases monotonically with infection severity across an ordered four-level scale. The correct test is Jonckheere–Terpstra (designed for ordered alternatives). Agents uniformly chose Kruskal–Wallis or Mann–Whitney — tests that check whether any groups differ, not whether differences increase monotonically. The Kruskal–Wallis returned a significant p-value; agents confidently reported "monotonic increase." The Jonckheere–Terpstra returned p ≈ 0.40. The wrong test produced the opposite conclusion.
2. Generic Biological Interpretation → Missing the Point
In a proteomics study on gender-affirming hormone therapy, the task asked whether protein changes converge with or diverge from the signatures of specific named clinical phenotypes — autoimmune disease, atherosclerosis, allergic asthma. Agents produced a proteome-wide convergence/divergence summary without anchoring the analysis to any named disease state. A proteome-level answer cannot speak to whether the therapy pushes the proteome toward or away from any specific disease signature — the question's entire premise.
3. Broken Reasoning Chain → Wrong Tool Entirely
In a BRCA cancer-dependency-mapping task, agents were given a specific multi-step procedure: normality likelihood-ratio test → distribution classification → sum-of-squared-deviations weighting → integration with an external druggable-gene list. Agents replaced this chain with a single Pearson or Spearman correlation. The original procedure is designed to surface genes with statistically robust patient–cell-line discrepancies under non-Gaussian distributions — a signal a simple correlation fundamentally cannot recover.
Implications for AI Use in IVD and Diagnostic Research
For companies involved in biomarker discovery, assay development, and diagnostic reagent production, BiomniBench highlights several practical considerations:
- Method selection is the highest-risk gap. When an AI tool selects the wrong statistical test for biomarker sensitivity/specificity analysis, receiver operating characteristic modelling, or cross-reactivity assessment, the conclusion may look plausible but will not hold up in validation. Reviewing the analytical method used — not just the summary — is essential.
- Domain-specific scaffolding outperforms raw model upgrades. Investing in tools and workflows designed specifically for immunoassay and diagnostic data — rather than general-purpose coding agents — is likely to yield more reliable results than simply upgrading the underlying model.
- Evaluate the trace, not just the output. A 73/100 ceiling for the best current AI configuration means substantial human oversight remains necessary for any analysis informing diagnostic kit design or regulatory submissions.
The gap between "AI that produces a believable answer" and "AI that does rigorous science" is now quantified. BiomniBench gives the field a shared framework for closing it — and for distinguishing tools that genuinely advance research from those that produce sophisticated-looking outputs with unreliable underlying analysis.
Paper Reference
Qu Y, Lu Y, Tu X, Zhang S, She T, Shaw AG, et al. BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research. bioRxiv, 2026. doi: 10.64898/2026.05.12.724604
Dataset: huggingface.co/datasets/phylobio/BiomniBench-DA
Frequently Asked Questions
What is BiomniBench?
BiomniBench is a process-level evaluation framework for LLM agents performing real-world biomedical research tasks. Unlike benchmarks that only compare final answers, it scores the full agent analytical trajectory against expert-designed rubrics across six dimensions: data handling, method selection, statistical rigor, biological interpretation, scientific reasoning, and source reliability. The first release contains 100 data-analysis tasks from Nature, Cell, and Science, co-developed with original paper authors and domain experts.
How well do current AI models perform on real biomedical tasks?
Under a fixed agent harness, the top frontier models cluster within 1.6 points of each other, scoring 62–64 out of 100. Even the best overall configuration — Claude Code with Opus 4.7 — reaches only 73.34/100. All models show consistent weaknesses in method selection (44–67%) and biological interpretation. No model approaches expert-level performance on this benchmark.
Does the AI agent framework matter more than the underlying model?
Yes. BiomniBench shows the agent harness shifts scores by more than a full model-generation upgrade. Switching GPT-5.4 from Terminus-2 to CodexCLI adds 13.5 points — nearly four times larger than the 3.8-point gap between Claude Opus 4.7 and Opus 4.6 under the same harness. Domain-specific scaffolding outperforms simply upgrading the base model.
What are the most common AI failure modes in biomedical data analysis?
BiomniBench identifies three recurring patterns across all models and frameworks: (1) wrong statistical method selection; (2) superficial biological interpretation that does not anchor to the specific disease context; and (3) broken reasoning chains — substituting simpler methods for specified multi-step analytical procedures. These are not edge cases; they are consistent failure modes at the frontier of AI capability.