The Future of Biology
Is Agentic

From orchestration to validation: agent-native skill libraries as the path to clinical-grade genomic AI
Manuel Corpas
Senior Lecturer in Genomics, AI, and Data Science
University of Westminster
UK DRI seminar · Imperial College London · 29 April 2026
Two papers underpin this talk. First, a Perspective in revision at Cell Genomics with Segun Fatumo and Heinner Guio defining what agentic genomics is, what it is not, and the validation framework we propose. Second, an empirical benchmark with Alfredo Iacoangeli, Fatumo, and Guio, in submission to Briefings in Bioinformatics, that tests whether the framework actually delivers on clinical-grade pharmacogenomics. TIMING: 1 min.

Two Waves of LLMs in Biology

FIRST WAVE Information retrieval · chat · summarisation tool use, code execution SECOND WAVE Autonomous execution · multi-step planning Bottleneck: code production validation & judgement researcher: producer → evaluator

First wave: information retrieval

  • Summarising papers, answering pathway questions, extracting structured data from text.
  • Useful but incremental.

Second wave: autonomous execution

  • Modern LLMs can write, debug, and execute code.
  • Connected to file systems, databases, command-line tools, they plan multi-step operations and adapt on intermediate results.
  • The researcher's role shifts from producing analyses to evaluating them.

Corpas, Fatumo, Guio. Agentic Genomics: From Pipeline Automation to Autonomous Validation. Cell Genomics (in revision), 2026.

The point I want to plant: this is not a chatbot story. The interesting thing is autonomous tool use with consequence. Once the model is acting, the rate-limiting step shifts. TIMING: 1.5 min.

Defining Agentic Genomics

AGENTIC GENOMICS 1. Autonomy Decisions during execution, not a static workflow. 2. Domain constraint Structured library of validated skills, not ad hoc code. 3. Iterative refinement Evaluates intermediate results, recovers from errors. 4. Natural-language mediation Researcher describes intent; agent translates to execution.

Jointly necessary. Falsifiable via the perturbation test: an agent that ignores perturbed intermediate outputs is not agentic.

Full definition (delivered verbally): "The use of autonomous AI agents, powered by large language models and operating within domain-constrained skill libraries, to discover, plan, execute, and iteratively refine multi-step genomic analyses, where the agent exercises runtime decision-making over tool selection, parameterisation, error handling, and output evaluation." These four conditions are deliberately strict. They exclude workflow automation (Nextflow, Snakemake, Galaxy, no runtime decisions). They exclude AutoML (search within a fixed space). They exclude LLM-assisted scripting (you execute, not the agent). And they exclude general-purpose biomedical copilots (information retrieval, no multi-step execution against real data). TIMING: 2 min.

The Paradigm Shift: Code Production to Validation

Figure 1: Traditional vs Agentic Workflow

(A) Traditional workflow. Researcher writes code, configures tools, runs pipelines, interprets results. The bottleneck is code production.   (B) Agentic workflow. Researcher describes intent in natural language; an AI agent discovers and executes skills from a modular library; researcher validates results. The bottleneck shifts to validation and judgement.

This is the central diagram of the Cell Genomics Perspective (in revision). The skills shown in panel B are real ClawBio skills: pharmacogenomics, variant annotation, ancestry estimation, drug safety, genome QC, PRS calculation, nutri-genomics, structural variants. Same audience question for the rest of the talk: what does it take to make panel B trustworthy? TIMING: 1.5 min.

The Validation Bottleneck: Silent, Plausible-Looking Failure

AI agents produce results faster than humans can verify them.

AUTOBA

Pipelines omitted critical steps; wrong tool selected for the data type.

Zhou et al., Adv. Sci. 2024.

SINGLE-CELL AGENTS

Incomplete experimental designs; inconsistent recommendations for identical queries.

Zhou et al., Brief. Bioinform. 2025.

BOIKO ET AL. · NATURE 2023

Syntactically correct, scientifically invalid protocols passing basic execution checks.

Boiko et al., Nature 624, 570–578 (2023).

CLAWBIO EARLY AUDIT

A skill silently returned "all normal" for 51 drugs on an empty input file.

Independent audit (S. Kornilov, clawbio_bench); ClawBio v0.5.0, Zenodo 2026.

Silent degradation to plausible-looking but incorrect results.

Each of these is from an independently developed system. The convergence is the point: this is structural, not a one-off bug. The 51-drug ClawBio incident is mine, surfaced by a community auditor. We discovered it because the platform is open. That's an argument for transparency. TIMING: 2 min.

A Tiered Validation Framework

Research-grade Hypothesis exploration Unit tests, all outputs reviewed Benchmarked Publishable analyses Public references (GIAB), published metrics Clinical-grade Patient care, diagnostic reporting External multi-site, signed bundles, CLIA/CAP Validation rigour, consequence of error →

Research-grade

  • Hypothesis exploration
  • Unit tests, adversarial inputs
  • All outputs reviewed
  • False positives tolerable

Benchmarked

  • Publishable analyses
  • Public references (GIAB, scRNA-seq)
  • Independent benchmarking
  • Published metrics & failure modes

Clinical-grade

  • Patient care, diagnostic reporting
  • External multi-site validation
  • FDA/EMA alignment
  • Signed reproducibility bundles
  • CLIA/CAP compliance

Validation proportional to consequence. Agent platforms must enforce tier boundaries and expose deterministic replay: a logged sequence of decisions must be exactly reproducible.

Corpas, Fatumo & Guio. Cell Genomics (in revision), 2026 · Table 2.

This is the structural response to silent failure. The tiers are not prescriptive labels; they are calibrated to risk. A skill at research-grade can be invoked freely for hypothesis generation. The same skill cannot be invoked for clinical use without external multi-site validation, signed bundles, and deterministic replay. The next slides ask whether any currently shippable skill can satisfy clinical-grade. TIMING: 2 min.

ClawBio

An agent-native skill library for bioinformatics.

Open-source · local-first · reproducible

58
skills
767
GitHub stars
20+
contributors
MIT
license
pharmgx-reporter variant-annotation claw-ancestry-pca gwas-prs scrna-orchestrator mendelian-randomisation wes-clinical-report-en methylation-clock + 50 more

github.com/ClawBio/ClawBio · the artifact under test in the empirical benchmark that follows.

Hero slide. Establish ClawBio as a real, public, open-source artifact before the benchmark. The 58 skills span pharmacogenomics, variant annotation, ancestry/PCA, GWAS/PRS, single-cell, multi-omics, MR, clinical reporting (EN + ES). The benchmark in the next slides tests ONE skill: pharmgx-reporter. The framework slide above (tiered validation) is what ClawBio implements; this slide is the bridge between abstract framework and concrete empirical test. TIMING: 1 min.

Can a Plain-Text SKILL.md Reach Clinical-Grade?

An empirical test: does specification close the gap?

Frame the question crisply: pharmacogenomics is the right test bed because the guideline is fixed (CPIC) and the stakes are clinically measurable. If specification does not help here, it does not help anywhere. Next slide: the experimental design. TIMING: 1 min.

1,728 Evaluations: A Factorial Benchmark

8 frontier LLMs 4 vendors × 12 PGx test cases 8 genes × 3 populations EUR / AMR / AFR × 2 conditions with / without spec × 3 independent runs stochastic test = 1,728 total evaluations FACTORIAL DESIGN A ClawBio skill specification is the experimental treatment; the factor varied in the second condition.

Skill under test: pharmgx-reporter · ClawBio v0.5.0 · genotype → phenotype → drug recommendation

Models: Claude Opus 4, Sonnet 4 · GPT-5.2, GPT-4.1, o3, o4-mini · Gemini 2.5 Flash · DeepSeek V3.

Population contexts: European (Corpasome family WGS), admixed Latin American (Peruvian Genome Project, 109 WGS, 7 sub-populations), East African (Uganda Genome Resource, 6,407 WGS). Curated by Heinner Guio and Segun Fatumo respectively. The factorial design lets us isolate each factor: model, gene/test-case, population, treatment, run-to-run stochasticity. TIMING: 1 min.

Frontier LLMs Alone Are Not Clinical-Grade

92.4%
Mean phenotype
accuracy (A1)
61%
Worst model
(Gemini 2.5 Flash)
7
DPYD lethal-case
errors (of 67 parsed)
87.5%
Perfect 3-run
consistency

Average accuracy hides the tail. The tail is what hurts patients.

Average accuracy hides the tail. The tail is what gets people hurt. The consistency rate (87.6 percent) is computed as the fraction of model x test-case x population combinations where all parsed runs agreed and were correct, with at least 2 of 3 runs producing parseable output. Next slide shows the per-model bars. TIMING: 1.5 min.

Specification Closes the Gap on Every Tier-A Axis

Figure 2: Tier A clinical correctness, with vs without ClawBio specification

Three Tier A axes across 8 models: phenotype accuracy, drug recommendation, clinical safety. Grey = without specification — wide spread, long error bars (Gemini 2.5 Flash, o3 on drug recommendation). Green = with ClawBio specification — previewed here; the consistency mechanism is the next slide.

The figure makes the variance visible: it is not just that some models are worse on average; the spread within each model on each axis is large. The error bars are 95% CIs across 3 runs × 3 populations × 12 cases = 36 trials per model per condition. TIMING: 1 min.

Specification Eliminates Stochastic Variation

WITHOUT SPECIFICATION 87.5% perfect 3-run consistency 239 / 273 evaluable cells SKILL.md WITH SPECIFICATION 100% across all 8 models 288 / 288 evaluable cells SAME MODEL · SAME INPUT · SAME OUTPUT Evaluable = (model, test case, population) cells with ≥2 of 3 parseable runs.

Same model, same input, same output, across every Tier-A axis and every population.

The headline number: stochastic drift drops from 12.5% (34 of 273 cells) to 0% (288 of 288 cells). Three failure modes eliminated in one intervention: stochastic variation, format non-compliance, miscalls. TIMING: 1.5 min.

Per-Model × Per-Test-Case Consistency Heatmap

How to read: rows = 8 models · columns = 12 PGx test cases · cell colour = % of 3 runs that returned the correct phenotype (green = 100%, red = 0%).

Figure: Consistency heatmap, no_spec vs with_spec

Left without spec: scattered red cells — same model, same input, different runs disagree. Right with ClawBio specification: the red disappears entirely. Every cell across 8 models × 12 test cases × 3 populations returns the correct phenotype on every run.

Red cells on the left are stochastic (same model, same input, different runs disagree). With spec they vanish entirely across all 8 models. TIMING: 1.5 min.

The Population Gap: 6 of 7 Lethal Errors Were Non-European

DPYD homozygote · standard fluorouracil · without specification

EUROPEAN Corpasome family WGS 1 lethal-case error / 23 LATIN AMERICAN Peruvian Genome Project (109 WGS) 3 lethal-case errors / 20 EAST AFRICAN Uganda Genome Resource (6,407 WGS) 3 lethal-case errors / 24 6/7 non-European lethal errors DPYD HOMOZYGOTE LETHAL-CASE ERRORS BY POPULATION no_spec, all 8 models, 3 runs each

Models default to European-dominant training data. With specification, every error vanishes.

Errors counted as A1=0 on dpyd_hom (parsed runs only). Lethality per CPIC DPYD Guideline (Henricks et al., Clin Pharmacol Ther 2020; doi:10.1002/cpt.1830).

Why this is structural: models default to European resources because that is what is statistically abundant in their training data. The ClawBio specification carries curated population-specific data from Fatumo (East African) and Guio (Latin American) that the model would otherwise confabulate. The next slide shows the per-population accuracy curves. TIMING: 1.5 min.

Five Principles for Responsible Agentic Genomics

01
Domain expertise is irreducible
02
Validation proportional to consequence
03
Transparency is non-negotiable
04
Skills testable by design
05
Equity must be engineered
THE CENTRAL THESIS

Agentic genomics shifts the bottleneck from pipeline construction to validation. A plain-text skill specification can satisfy two of three clinical-grade requirements; external multi-site validation is the open work.

Corpas, Fatumo, Guio. Cell Genomics (in revision), 2026 · Corpas, Iacoangeli, Fatumo, Guio. Briefings in Bioinformatics, in submission 2026.

Close on the central thesis, not on a pitch. Read the principles, then the closing line. The question is no longer whether agentic genomics will be adopted; it is whether the field will establish the standards required to make it trustworthy before it becomes ubiquitous. TIMING: 2 min, leaving 5 to 7 min Q&A within the 30-min recorded slot.
1 / 16
manuelcorpas · UK DRI Imperial · 29 Apr 2026