From orchestration to validation: agent-native skill libraries as the path to clinical-grade genomic AI
Manuel Corpas
Senior Lecturer in Genomics, AI, and Data Science
University of Westminster
UK DRI seminar · Imperial College London · 29 April 2026
Two papers underpin this talk. First, a Perspective in revision at Cell Genomics with Segun
Fatumo and Heinner Guio defining what agentic genomics is, what it is not,
and the validation framework we propose. Second, an empirical benchmark
with Alfredo Iacoangeli, Fatumo, and Guio, in submission to Briefings in
Bioinformatics, that tests whether the framework actually delivers on
clinical-grade pharmacogenomics. TIMING: 1 min.
The shift
Two Waves of LLMs in Biology
First wave: information retrieval
Summarising papers, answering pathway questions, extracting structured data from text.
Useful but incremental.
Second wave: autonomous execution
Modern LLMs can write, debug, and execute code.
Connected to file systems, databases, command-line tools, they plan multi-step operations and adapt on intermediate results.
The researcher's role shifts from producing analyses to evaluating them.
Corpas, Fatumo, Guio. Agentic Genomics: From Pipeline Automation to Autonomous Validation. Cell Genomics (in revision), 2026.
The point I want to plant: this is not a chatbot story. The interesting
thing is autonomous tool use with consequence. Once the model is acting,
the rate-limiting step shifts. TIMING: 1.5 min.
Definition
Defining Agentic Genomics
Jointly necessary. Falsifiable via the perturbation test: an agent that ignores perturbed intermediate outputs is not agentic.
Full definition (delivered verbally): "The use of autonomous AI agents,
powered by large language models and operating within domain-constrained
skill libraries, to discover, plan, execute, and iteratively refine
multi-step genomic analyses, where the agent exercises runtime
decision-making over tool selection, parameterisation, error handling,
and output evaluation."
These four conditions are deliberately strict. They exclude workflow
automation (Nextflow, Snakemake, Galaxy, no runtime decisions). They
exclude AutoML (search within a fixed space). They exclude LLM-assisted
scripting (you execute, not the agent). And they exclude general-purpose
biomedical copilots (information retrieval, no multi-step execution
against real data). TIMING: 2 min.
Figure 1 · Cell Genomics (in revision)
The Paradigm Shift: Code Production to Validation
(A) Traditional workflow. Researcher writes code, configures tools, runs pipelines, interprets results. The bottleneck is code production.
(B) Agentic workflow. Researcher describes intent in natural language; an AI agent discovers and executes skills from a modular library; researcher validates results. The bottleneck shifts to validation and judgement.
This is the central diagram of the Cell Genomics Perspective (in revision). The skills shown
in panel B are real ClawBio skills: pharmacogenomics, variant annotation,
ancestry estimation, drug safety, genome QC, PRS calculation, nutri-genomics,
structural variants. Same audience question for the rest of the talk:
what does it take to make panel B trustworthy? TIMING: 1.5 min.
The new bottleneck
The Validation Bottleneck: Silent, Plausible-Looking Failure
AI agents produce results faster than humans can verify them.
AUTOBA
Pipelines omitted critical steps; wrong tool selected for the data type.
Zhou et al., Adv. Sci. 2024.
SINGLE-CELL AGENTS
Incomplete experimental designs; inconsistent recommendations for identical queries.
A skill silently returned "all normal" for 51 drugs on an empty input file.
Independent audit (S. Kornilov, clawbio_bench); ClawBio v0.5.0, Zenodo 2026.
Silent degradation to plausible-looking but incorrect results.
Each of these is from an independently developed system. The convergence is
the point: this is structural, not a one-off bug. The 51-drug ClawBio
incident is mine, surfaced by a community auditor. We discovered it because
the platform is open. That's an argument for transparency. TIMING: 2 min.
Framework
A Tiered Validation Framework
Research-grade
Hypothesis exploration
Unit tests, adversarial inputs
All outputs reviewed
False positives tolerable
Benchmarked
Publishable analyses
Public references (GIAB, scRNA-seq)
Independent benchmarking
Published metrics & failure modes
Clinical-grade
Patient care, diagnostic reporting
External multi-site validation
FDA/EMA alignment
Signed reproducibility bundles
CLIA/CAP compliance
Validation proportional to consequence. Agent platforms must enforce tier boundaries and expose deterministic replay: a logged sequence of decisions must be exactly reproducible.
This is the structural response to silent failure. The tiers are not
prescriptive labels; they are calibrated to risk. A skill at research-grade
can be invoked freely for hypothesis generation. The same skill cannot be
invoked for clinical use without external multi-site validation, signed
bundles, and deterministic replay. The next slides ask whether any
currently shippable skill can satisfy clinical-grade. TIMING: 2 min.
The skill library under test
ClawBio
An agent-native skill library for bioinformatics.
Open-source · local-first · reproducible
58
skills
767
GitHub stars
20+
contributors
MIT
license
pharmgx-reportervariant-annotationclaw-ancestry-pcagwas-prsscrna-orchestratormendelian-randomisationwes-clinical-report-enmethylation-clock+ 50 more
github.com/ClawBio/ClawBio · the artifact under test in the empirical benchmark that follows.
Hero slide. Establish ClawBio as a real, public, open-source artifact
before the benchmark. The 58 skills span pharmacogenomics, variant
annotation, ancestry/PCA, GWAS/PRS, single-cell, multi-omics, MR,
clinical reporting (EN + ES). The benchmark in the next slides tests
ONE skill: pharmgx-reporter. The framework slide above (tiered
validation) is what ClawBio implements; this slide is the bridge
between abstract framework and concrete empirical test. TIMING: 1 min.
Empirical question
Can a Plain-Text SKILL.md Reach Clinical-Grade?
Domain: pharmacogenomics. Genotype to phenotype to drug recommendation, ground truth from CPIC guidelines.
If specification cannot improve reliability here, where the guideline is fixed and the consequences are clinically measurable, it is unlikely to help in less structured domains.
An empirical test: does specification close the gap?
Frame the question crisply: pharmacogenomics is the right test bed
because the guideline is fixed (CPIC) and the stakes are clinically
measurable. If specification does not help here, it does not help
anywhere. Next slide: the experimental design. TIMING: 1 min.
Experimental design
1,728 Evaluations: A Factorial Benchmark
Skill under test:pharmgx-reporter · ClawBio v0.5.0 · genotype → phenotype → drug recommendation
Models: Claude Opus 4, Sonnet 4 · GPT-5.2, GPT-4.1, o3, o4-mini · Gemini 2.5 Flash · DeepSeek V3.
Population contexts: European (Corpasome family WGS), admixed Latin
American (Peruvian Genome Project, 109 WGS, 7 sub-populations), East
African (Uganda Genome Resource, 6,407 WGS). Curated by Heinner Guio
and Segun Fatumo respectively. The factorial design lets us isolate
each factor: model, gene/test-case, population, treatment, run-to-run
stochasticity. TIMING: 1 min.
Result 1 · without specification
Frontier LLMs Alone Are Not Clinical-Grade
92.4%
Mean phenotype accuracy (A1)
61%
Worst model (Gemini 2.5 Flash)
7
DPYD lethal-case errors (of 67 parsed)
87.5%
Perfect 3-run consistency
Same model, same input, different runs: stochastic drift across reruns.
"92% accuracy" sounds fine. For the DPYD homozygote, 92% means roughly 1 in 12 patients receives a potentially lethal recommendation.
Failure modes: confabulated population data, format non-compliance, miscalled metaboliser status.
Average accuracy hides the tail. The tail is what hurts patients.
Average accuracy hides the tail. The tail is what gets people hurt. The
consistency rate (87.6 percent) is computed as the fraction of model x
test-case x population combinations where all parsed runs agreed and were
correct, with at least 2 of 3 runs producing parseable output. Next slide
shows the per-model bars. TIMING: 1.5 min.
Result 1 · per-model view
Specification Closes the Gap on Every Tier-A Axis
Three Tier A axes across 8 models: phenotype accuracy, drug recommendation, clinical safety. Grey = without specification — wide spread, long error bars (Gemini 2.5 Flash, o3 on drug recommendation). Green = with ClawBio specification — previewed here; the consistency mechanism is the next slide.
The figure makes the variance visible: it is not just that some models
are worse on average; the spread within each model on each axis is
large. The error bars are 95% CIs across 3 runs × 3 populations × 12
cases = 36 trials per model per condition. TIMING: 1 min.
Result 2 · with ClawBio specification
Specification Eliminates Stochastic Variation
Same model, same input, same output, across every Tier-A axis and every population.
The headline number: stochastic drift drops from 12.5% (34 of 273 cells)
to 0% (288 of 288 cells). Three failure modes eliminated in one
intervention: stochastic variation, format non-compliance, miscalls.
TIMING: 1.5 min.
Result 2 · per-cell view
Per-Model × Per-Test-Case Consistency Heatmap
How to read: rows = 8 models · columns = 12 PGx test cases · cell colour = % of 3 runs that returned the correct phenotype (green = 100%, red = 0%).
Left without spec: scattered red cells — same model, same input, different runs disagree. Right with ClawBio specification: the red disappears entirely. Every cell across 8 models × 12 test cases × 3 populations returns the correct phenotype on every run.
Red cells on the left are stochastic (same model, same input, different
runs disagree). With spec they vanish entirely across all 8 models.
TIMING: 1.5 min.
Result 3 · equity
The Population Gap: 6 of 7 Lethal Errors Were Non-European
DPYD homozygote · standard fluorouracil · without specification
Models default to European-dominant training data. With specification, every error vanishes.
Errors counted as A1=0 on dpyd_hom (parsed runs only). Lethality per CPIC DPYD Guideline (Henricks et al., Clin Pharmacol Ther 2020; doi:10.1002/cpt.1830).
Why this is structural: models default to European resources because
that is what is statistically abundant in their training data. The
ClawBio specification carries curated population-specific data from
Fatumo (East African) and Guio (Latin American) that the model would
otherwise confabulate. The next slide shows the per-population
accuracy curves. TIMING: 1.5 min.
Closing thesis
Five Principles for Responsible Agentic Genomics
01
Domain expertise is irreducible
02
Validation proportional to consequence
03
Transparency is non-negotiable
04
Skills testable by design
05
Equity must be engineered
THE CENTRAL THESIS
Agentic genomics shifts the bottleneck from pipeline construction to validation. A plain-text skill specification can satisfy two of three clinical-grade requirements; external multi-site validation is the open work.
Corpas, Fatumo, Guio. Cell Genomics (in revision), 2026 · Corpas, Iacoangeli, Fatumo, Guio. Briefings in Bioinformatics, in submission 2026.
Close on the central thesis, not on a pitch. Read the principles, then
the closing line. The question is no longer whether agentic genomics will
be adopted; it is whether the field will establish the standards required
to make it trustworthy before it becomes ubiquitous. TIMING: 2 min,
leaving 5 to 7 min Q&A within the 30-min recorded slot.