The Future of Biology
Is Agentic

From orchestration to validation: agent-native skill libraries as the path to clinical-grade genomic AI

Manuel Corpas

Senior Lecturer in Genomics, AI, and Data Science
University of Westminster
UK DRI seminar · Imperial College London · 29 April 2026

Two papers underpin this talk. First, a Perspective in revision at Cell Genomics with Segun Fatumo and Heinner Guio defining what agentic genomics is, what it is not, and the validation framework we propose. Second, an empirical benchmark with Alfredo Iacoangeli, Fatumo, and Guio, in submission to Briefings in Bioinformatics, that tests whether the framework actually delivers on clinical-grade pharmacogenomics. TIMING: 1 min.

The shift

Two Waves of LLMs in Biology

First wave: information retrieval

Summarising papers, answering pathway questions, extracting structured data from text.
Useful but incremental.

Second wave: autonomous execution

Modern LLMs can write, debug, and execute code.
Connected to file systems, databases, command-line tools, they plan multi-step operations and adapt on intermediate results.
The researcher's role shifts from producing analyses to evaluating them.

Corpas, Fatumo, Guio. Agentic Genomics: From Pipeline Automation to Autonomous Validation. Cell Genomics (in revision), 2026.

The point I want to plant: this is not a chatbot story. The interesting thing is autonomous tool use with consequence. Once the model is acting, the rate-limiting step shifts. TIMING: 1.5 min.

Definition

Defining Agentic Genomics

Jointly necessary. Falsifiable via the perturbation test: an agent that ignores perturbed intermediate outputs is not agentic.

Full definition (delivered verbally): "The use of autonomous AI agents, powered by large language models and operating within domain-constrained skill libraries, to discover, plan, execute, and iteratively refine multi-step genomic analyses, where the agent exercises runtime decision-making over tool selection, parameterisation, error handling, and output evaluation." These four conditions are deliberately strict. They exclude workflow automation (Nextflow, Snakemake, Galaxy, no runtime decisions). They exclude AutoML (search within a fixed space). They exclude LLM-assisted scripting (you execute, not the agent). And they exclude general-purpose biomedical copilots (information retrieval, no multi-step execution against real data). TIMING: 2 min.

Figure 1 · Cell Genomics (in revision)

The Paradigm Shift: Code Production to Validation

Figure 1: Traditional vs Agentic Workflow

(A) Traditional workflow. Researcher writes code, configures tools, runs pipelines, interprets results. The bottleneck is code production. (B) Agentic workflow. Researcher describes intent in natural language; an AI agent discovers and executes skills from a modular library; researcher validates results. The bottleneck shifts to validation and judgement.

This is the central diagram of the Cell Genomics Perspective (in revision). The skills shown in panel B are real ClawBio skills: pharmacogenomics, variant annotation, ancestry estimation, drug safety, genome QC, PRS calculation, nutri-genomics, structural variants. Same audience question for the rest of the talk: what does it take to make panel B trustworthy? TIMING: 1.5 min.

The new bottleneck

The Validation Bottleneck: Silent, Plausible-Looking Failure

AI agents produce results faster than humans can verify them.

AUTOBA

Pipelines omitted critical steps; wrong tool selected for the data type.

Zhou et al., Adv. Sci. 2024.

SINGLE-CELL AGENTS

Incomplete experimental designs; inconsistent recommendations for identical queries.

Zhou et al., Brief. Bioinform. 2025.

BOIKO ET AL. · NATURE 2023

Syntactically correct, scientifically invalid protocols passing basic execution checks.

Boiko et al., Nature 624, 570–578 (2023).

CLAWBIO EARLY AUDIT

A skill silently returned "all normal" for 51 drugs on an empty input file.

Independent audit (S. Kornilov, clawbio_bench); ClawBio v0.5.0, Zenodo 2026.

Silent degradation to plausible-looking but incorrect results.

Each of these is from an independently developed system. The convergence is the point: this is structural, not a one-off bug. The 51-drug ClawBio incident is mine, surfaced by a community auditor. We discovered it because the platform is open. That's an argument for transparency. TIMING: 2 min.

Framework

A Tiered Validation Framework

Research-grade

Hypothesis exploration
Unit tests, adversarial inputs
All outputs reviewed
False positives tolerable

Benchmarked

Publishable analyses
Public references (GIAB, scRNA-seq)
Independent benchmarking
Published metrics & failure modes

Clinical-grade

Patient care, diagnostic reporting
External multi-site validation
FDA/EMA alignment
Signed reproducibility bundles
CLIA/CAP compliance

Validation proportional to consequence. Agent platforms must enforce tier boundaries and expose deterministic replay: a logged sequence of decisions must be exactly reproducible.

Corpas, Fatumo & Guio. Cell Genomics (in revision), 2026 · Table 2.

This is the structural response to silent failure. The tiers are not prescriptive labels; they are calibrated to risk. A skill at research-grade can be invoked freely for hypothesis generation. The same skill cannot be invoked for clinical use without external multi-site validation, signed bundles, and deterministic replay. The next slides ask whether any currently shippable skill can satisfy clinical-grade. TIMING: 2 min.

The skill library under test

ClawBio

An agent-native skill library for bioinformatics.

Open-source · local-first · reproducible

58

skills

767

GitHub stars

20+

contributors

MIT

license

pharmgx-reporter variant-annotation claw-ancestry-pca gwas-prs scrna-orchestrator mendelian-randomisation wes-clinical-report-en methylation-clock + 50 more

github.com/ClawBio/ClawBio · the artifact under test in the empirical benchmark that follows.

Hero slide. Establish ClawBio as a real, public, open-source artifact before the benchmark. The 58 skills span pharmacogenomics, variant annotation, ancestry/PCA, GWAS/PRS, single-cell, multi-omics, MR, clinical reporting (EN + ES). The benchmark in the next slides tests ONE skill: pharmgx-reporter. The framework slide above (tiered validation) is what ClawBio implements; this slide is the bridge between abstract framework and concrete empirical test. TIMING: 1 min.

Empirical question

Can a Plain-Text `SKILL.md` Reach Clinical-Grade?

Domain: pharmacogenomics. Genotype to phenotype to drug recommendation, ground truth from CPIC guidelines.
Stakes: real. DPYD rs3918290 T/T + standard fluorouracil = potentially lethal.
If specification cannot improve reliability here, where the guideline is fixed and the consequences are clinically measurable, it is unlikely to help in less structured domains.

An empirical test: does specification close the gap?

Frame the question crisply: pharmacogenomics is the right test bed because the guideline is fixed (CPIC) and the stakes are clinically measurable. If specification does not help here, it does not help anywhere. Next slide: the experimental design. TIMING: 1 min.

Experimental design

1,728 Evaluations: A Factorial Benchmark

Skill under test: pharmgx-reporter · ClawBio v0.5.0 · genotype → phenotype → drug recommendation

Models: Claude Opus 4, Sonnet 4 · GPT-5.2, GPT-4.1, o3, o4-mini · Gemini 2.5 Flash · DeepSeek V3.

Population contexts: European (Corpasome family WGS), admixed Latin American (Peruvian Genome Project, 109 WGS, 7 sub-populations), East African (Uganda Genome Resource, 6,407 WGS). Curated by Heinner Guio and Segun Fatumo respectively. The factorial design lets us isolate each factor: model, gene/test-case, population, treatment, run-to-run stochasticity. TIMING: 1 min.

Result 1 · without specification

Frontier LLMs Alone Are Not Clinical-Grade

92.4%

Mean phenotype
accuracy (A1)

61%

Worst model
(Gemini 2.5 Flash)

7

DPYD lethal-case
errors (of 67 parsed)

87.5%

Perfect 3-run
consistency

Same model, same input, different runs: stochastic drift across reruns.
"92% accuracy" sounds fine. For the DPYD homozygote, 92% means roughly 1 in 12 patients receives a potentially lethal recommendation.
Failure modes: confabulated population data, format non-compliance, miscalled metaboliser status.

Average accuracy hides the tail. The tail is what hurts patients.

Average accuracy hides the tail. The tail is what gets people hurt. The consistency rate (87.6 percent) is computed as the fraction of model x test-case x population combinations where all parsed runs agreed and were correct, with at least 2 of 3 runs producing parseable output. Next slide shows the per-model bars. TIMING: 1.5 min.

Result 1 · per-model view

Specification Closes the Gap on Every Tier-A Axis

Figure 2: Tier A clinical correctness, with vs without ClawBio specification

Three Tier A axes across 8 models: phenotype accuracy, drug recommendation, clinical safety. Grey = without specification — wide spread, long error bars (Gemini 2.5 Flash, o3 on drug recommendation). Green = with ClawBio specification — previewed here; the consistency mechanism is the next slide.

The figure makes the variance visible: it is not just that some models are worse on average; the spread within each model on each axis is large. The error bars are 95% CIs across 3 runs × 3 populations × 12 cases = 36 trials per model per condition. TIMING: 1 min.

Result 2 · with ClawBio specification

Specification Eliminates Stochastic Variation

Same model, same input, same output, across every Tier-A axis and every population.

The headline number: stochastic drift drops from 12.5% (34 of 273 cells) to 0% (288 of 288 cells). Three failure modes eliminated in one intervention: stochastic variation, format non-compliance, miscalls. TIMING: 1.5 min.

Result 2 · per-cell view

Per-Model × Per-Test-Case Consistency Heatmap

How to read: rows = 8 models · columns = 12 PGx test cases · cell colour = % of 3 runs that returned the correct phenotype (green = 100%, red = 0%).

Figure: Consistency heatmap, no_spec vs with_spec

Left without spec: scattered red cells — same model, same input, different runs disagree. Right with ClawBio specification: the red disappears entirely. Every cell across 8 models × 12 test cases × 3 populations returns the correct phenotype on every run.

Red cells on the left are stochastic (same model, same input, different runs disagree). With spec they vanish entirely across all 8 models. TIMING: 1.5 min.

Result 3 · equity

The Population Gap: 6 of 7 Lethal Errors Were Non-European

DPYD homozygote · standard fluorouracil · without specification

Models default to European-dominant training data. With specification, every error vanishes.

Errors counted as A1=0 on dpyd_hom (parsed runs only). Lethality per CPIC DPYD Guideline (Henricks et al., Clin Pharmacol Ther 2020; doi:10.1002/cpt.1830).

Why this is structural: models default to European resources because that is what is statistically abundant in their training data. The ClawBio specification carries curated population-specific data from Fatumo (East African) and Guio (Latin American) that the model would otherwise confabulate. The next slide shows the per-population accuracy curves. TIMING: 1.5 min.

Closing thesis

Five Principles for Responsible Agentic Genomics

01

Domain expertise is irreducible

02

Validation proportional to consequence

03

Transparency is non-negotiable

04

Skills testable by design

05

Equity must be engineered

THE CENTRAL THESIS

Agentic genomics shifts the bottleneck from pipeline construction to validation. A plain-text skill specification can satisfy two of three clinical-grade requirements; external multi-site validation is the open work.

Corpas, Fatumo, Guio. Cell Genomics (in revision), 2026 · Corpas, Iacoangeli, Fatumo, Guio. Briefings in Bioinformatics, in submission 2026.

Close on the central thesis, not on a pitch. Read the principles, then the closing line. The question is no longer whether agentic genomics will be adopted; it is whether the field will establish the standards required to make it trustworthy before it becomes ubiquitous. TIMING: 2 min, leaving 5 to 7 min Q&A within the 30-min recorded slot.

The Future of BiologyIs Agentic