@manuelcorpas
Agentic Genomics:
From Pipeline Automation to
Autonomous Validation
Manuel Corpas
Senior Lecturer, Research Centre for Optimal Health, University of Westminster
MSc AI + Digital Health Course Director · Creator of ClawBio
SCGG Away Day · King's College London · 15 April 2026
Thank you. I'm Manuel Corpas, Senior Lecturer at Westminster and creator of
ClawBio, an open-source library of agentic AI skills for bioinformatics.
Today I want to make a narrower claim than "AI changes genomics". My claim
is that for some genomic analyses, orchestration is becoming cheap enough
that validation is becoming the binding constraint. I'm going to show you
what this looks like on real data from the genomes of my own family, and propose a
falsifiable validation framework I think this department could help build.
TIMING: 1 min
The problem
I Had a Folder. I Didn't Remember Which Was Who.
10 WGS genomes sitting untouched for 2 years.
- I knew my genome, my father, my mother, and my sister were in there
- I had another reference sample of my own (30x Dante Labs)
- I asked the agent: "can you tell me which is which?"
- Within a constrained bioinformatics environment. No scripts. No flags.
The agent downloaded KING,
inspected each folder,
ran a kinship analysis,
and gave me the answer with confidence scores.
No genome assembly wrangling. No install debugging. No format conversions. No hand-holding.
This is where this talk started. I had a folder on my external drive with 10
whole genomes. I hadn't touched it in two years. I knew four of those samples
were my family, me, my dad, my mum, and my sister, but I didn't remember
which PT ID was which. The other six were from other pilot study participants from the company I was involved in, Cambridge Precision Medicine.
I asked the agent to resolve identity within a constrained bioinformatics
environment. It downloaded KING, inspected the contents of each folder,
figured out the genome assembly, ran a kinship analysis, and came back with the
correct family structure with confidence scores. I didn't have to worry about
installation, configuration, reference genome version, or any of the usual
bioinformatics overhead. That is agentic genomics. That is the shift I want
to talk about today.
TIMING: 1.5 min
Definition
What I Mean by "Agentic Genomics"
Autonomous agents making runtime decisions
inside domain-constrained skill libraries.
Pipeline
Fixed DAG, pre-specified.
Every step known before execution.
Agent
Runtime decision points under constraints.
Steps chosen during execution.
Example: the agent chose KING over PLINK for kinship because it inspected the VCF headers, found no PLINK-compatible .bed files, and selected the tool that could work directly on VCF input. A static pipeline cannot make that decision.
Corpas, Fatumo & Guio. Cell Genomics (in revision, 2026)
Precise definition. Agentic genomics is autonomous agents making runtime
decisions inside domain-constrained skill libraries. The key distinction
is operational: a pipeline is a fixed DAG where every step is pre-specified.
An agent has runtime decision points under constraints. Here is a concrete
example: when I asked the agent to resolve sample identity, it inspected
the VCF headers, found no PLINK-compatible .bed files, and chose KING
because KING can work directly on VCF input. A static pipeline cannot make
that decision. That is what "runtime" means in practice.
TIMING: 1 min
The Library
🦖 ClawBio
Open-source bioinformatics infrastructure for agentic analysis
🔒
Constrained
skill library
📜
Full execution
provenance log
Genetic data never leaves your machine. 48 skills, 16+ contributors, fully open source.
The key properties for this talk are the provenance log and validation hooks. Without those, none of the results I show would be auditable.
Scan to open the repo
github.com/ClawBio/ClawBio
docs.clawbio.ai
Corpas M. Bioinformatics (in revision, 2026)
ClawBio is the library I'll demo from. It is designed as infrastructure
for science, not a product. Five properties matter: a constrained skill
library so the agent cannot do arbitrary things; runtime planning so it
chooses tools at execution time; local-first execution so genetic data
never leaves your machine; a full provenance log so every step is
auditable; and validation hooks so outputs can be checked automatically.
48 skills, 16+ contributors, fully open source. The key properties for
this talk are the provenance log and validation hooks. Without those, none
of the results I show would be auditable.
TIMING: 0.5 min
Stat Gen Toolkit
Skills Relevant to This Room
How skills compose in a single conversation:
👪
Kinship check
KING-robust on chr22. Verify family structure before analysis.
🌍
Ancestry check
PCA against SGDP/1KG. Flag cross-ancestry samples.
🎯
PRS scoring
PGS Catalog scores against 1KG EUR reference distribution.
✅
Concordance check
Compare against prior published results or orthogonal methods.
48 skills on shelf today. The chain above is what this talk demonstrates.
Built by 16+ open-source contributors across 4 continents.
Rather than listing all 48 skills, let me show you the chain this talk
demonstrates. Kinship check first: verify sample identity before doing
anything else. Then ancestry check: flag cross-ancestry samples. Then PRS
scoring against a reference distribution. Then concordance: compare against
prior published results. That chain, from raw VCFs to validated percentiles,
is what you are about to see.
TIMING: 0.5 min
Case Study
Let Me Show You on Real Data
My family has been part of an open-access pilot WGS cohort for 15 years.
The raw data: 10 whole genomes on an external drive. Sentieon Haplotyper, GRCh37, ~170MB VCF each.
But which sample is which person? The VCFs are labelled PT00001A through PT00010A.
The point of using my family data here is transparency, not generalisability.
This is not yet a general result. It is a controlled demonstration that suggests where systematic evaluation should go next.
Ethics: UNIR PI:029/2020 (2021), open-access CC0 ·
prior publications:
Front Genet 2021,
BMC Med Genomics 2022
I've got a family who has been part of an open-access pilot WGS cohort for over a decade,
the Corpasome project, published under CC0. I have 10 whole genomes on an
external drive right now, but they are labelled PT00001A through PT00010A with
no metadata. The point of using my family data here is transparency, not
generalisability. This is not yet a general result. It is a controlled
demonstration that suggests where systematic evaluation should go next.
The point is not that this analysis is impossible without agents. It is
that the cost of doing it correctly, with full validation, drops enough
that we can afford to do it routinely. Before I run any polygenic scoring,
I want the agent to verify the family structure. Validation first, analysis
second.
TIMING: 0.5 min
Step 1: Validation
Family Verified by KING-Robust Kinship
Wife
PT00004A (British)
φ = -0.014
Parents unrelated to each other (φ = -0.020). PT00003A flagged as likely different ancestry (φ consistently negative).
The agent computed KING-robust kinship coefficients for every pair of samples
in my cohort on chr22 alone. Chr22 is adequate for kinship because KING
needs only a few thousand common variants to estimate relatedness
coefficients; chr22 provides roughly 10,000, well above the minimum for
first-degree resolution. It correctly identified my father, mother, sister,
and confirmed my wife is unrelated. The parents are unrelated to each other,
as expected. And it flagged one sample, PT00003A, as having consistently
negative kinship with everyone, a hallmark of different ancestry, because
KING returns negative values across population boundaries. The result here
is correct relational reconstruction under a logged execution trace.
TIMING: 1.5 min
Family PRS — Khera 2018
5 Genome-Wide PGS · Family vs 1KG EUR
Khera 2018 genome-wide PGS (CAD: 6.6M variants) · 1KG EUR (n=498, CEU/GBR/IBS/FIN/TSI) · 7.5 min local compute
Five genome-wide PGS from Khera 2018, scored on my whole family WGS plus a
498-sample 1000 Genomes EUR reference. I should note that 1KG EUR is a
convenience reference, not the definitive one. It is adequate for a
methodological demonstration, but the percentiles are reference-relative
and ancestry-sensitive. A larger, ancestry-matched reference would tighten
these estimates. Each panel is one trait; vertical lines show family member
positions. The most notable pattern is in the top-left panel, which is what
the next slide drills into.
TIMING: 1 min
Drill-down
Coronary Artery Disease: Maternal Line, Top Decile
Mother 99th ·
Sister 97th ·
Manuel 94th ·
Father 52nd ·
Wife 13th
Three family members cluster at the upper end of the reference distribution. Father at the median. Wife (unrelated, British) at 13th.
100% rank concordance (20/20 family-score comparisons vs 2022 benchmark).
Methodological demonstration, not clinical interpretation. Percentiles are reference-relative and ancestry-sensitive.
Drill-down on CAD.
Mother is at the 99th percentile. Sister 97th. I am 94th. Father at the
median, 52nd. Wife, unrelated and British, 13th.
Three family members cluster at the upper end of the reference distribution
for the same trait, all on the maternal side.
This is consistent with familial aggregation under this scoring framework,
but the important point here is not the trait story. It is that the agent
recovered a stable, interpretable pattern that we can then try to falsify.
This is a methodological demonstration, not a clinical interpretation.
These percentiles are reference-relative and ancestry-sensitive.
TIMING: 1 min
Reliability
Same Family, 18 Different CAD PRS
Mother in family top-2 across 17/18 PRS ·
Manuel in top-3 across 16/18 ·
Sister in top-3 across 17/18
18 CAD PRS from PGS Catalog spanning LDpred, LDpred2, PRS-CSx, PRSmix, AnnoPred, P+T, GWS-only · 64 to 6.6M variants · 2018 to 2026
Reliability check. The previous slide showed Mother + both children at the
top decile for CAD using one PRS. But which PRS did I pick? PGS000013, the
Khera 2018 LDpred score. There are 84 CAD PRS in the PGS Catalog as of today.
I scored my whole family against 18 of them, spanning the methodological
landscape: LDpred, LDpred2, PRS-CSx, PRSmix, AnnoPred, P+T, GWS-only —
from 64 variants to 6.6 million, and from 2018 to 2026. The headline:
Mother is in the top 2 of the family across 17 of 18
scores. The relative within-family ordering is largely preserved across
18 CAD scores. This is Tier 2 cohort-grade validation: the conclusion
does not depend on which PRS you pick.
TIMING: 1.5 min
Tier 3 Setup
The Benchmark: Same Family, 2022
Implementation of individualised polygenic risk score analysis: a test case of a family of four
Corpas, Megy, Metastasio, Lehmann
BMC Medical Genomics (2022) 15:207 ·
doi: 10.1186/s12920-022-01331-8
- Same four family members. Same Sentieon VCFs.
- 15 phenotypes, ~37M SNPs, 1000 Genomes IBS/EUR reference.
- Manual pipeline: 95% SNP-overlap filter, no REF-aware imputation.
- Open access, CC-BY 4.0. Ground truth for slide 12.
Four years later, the agent re-scores this cohort from scratch. Next slide: what matched, what shifted, and why.
Before I show the agent-vs-paper comparison, I want to name the paper.
This is Corpas, Megy, Metastasio, Lehmann. BMC Medical Genomics, 2022.
Same four family members, same Sentieon VCFs, fifteen phenotypes, scored
manually against a 1000 Genomes Iberian and European reference.
Two things matter about this paper for the next slide.
One: the methodology is deliberately different from what the agent does
today. The 2022 pipeline used a 95% SNP-overlap filter, which discards
informative positions. The agent uses REF-aware scoring via samtools
faidx. Different missing-variant handling, different reference
construction.
Two: it is open access, CC-BY, and the tables are public. So the
benchmark is verifiable. Anyone in this room can re-run it.
That is the ground truth. Next slide is the comparison.
TIMING: 0.75 min
Validation Tier 3
Agent vs Peer-Reviewed Benchmark
Same 4 saliva WGS samples · same Sentieon VCFs (2021-02-18) · same PGS000013–17 weights ·
only missing-variant handling + reference differ
| Score |
Member |
2022 paper |
Agent 2026 |
Δ |
| CAD | Father | 29.6 | 52 | +22 |
| Mother | 96.6 | 99 | +2 |
| Sister | 89.9 | 97 | +7 |
| Manuel | 81.9 | 94 | +12 |
| IBD | Father | 43.5 | 43 | −0.5 |
| Mother | 70.8 | 70 | −0.8 |
| Sister | 46.7 | 47 | +0.3 |
| Manuel | 46.7 | 46 | −0.7 |
20/20 rank orderings preserved · IBD within 1 point · CAD maternal-line top-decile replicates
Absolute percentile shifts are explainable from reference-distribution effects, not reversal of biological signal.
SAY THIS: Same samples, same VCFs, same scoring weights. Only
missing-variant handling and reference distribution differ.
Despite those methodological differences, rank order is preserved across
all 20 family-score comparisons. Absolute differences arise from reference
distributions, not from reversal of biological signal.
STOP THERE.
---
The route to that 20 of 20 was not clean. The agent's first attempt at
scoring produced CAD percentiles that were off by roughly threefold,
because it treated missing-from-VCF positions as dosage zero rather than
homozygous reference.
Pre-fix, rank concordance with the 2022 benchmark was below 50%. Post-fix,
20 of 20 family-score comparisons preserve rank order.
That bug was caught by comparing against the published benchmark. The fix,
REF-aware scoring via samtools faidx, is what you see now. That failure is
the reason I am arguing for tiered validation, not despite it.
---
ONLY IF ASKED about the larger CAD shifts for Father and Manuel: the
z-score shift is roughly 0.5 SD for everyone. Father sits where the CDF
slope is steep, so the same shift moves more percentile points. IBD
barely moves because its effect weights cancel the monomorphic-position
contribution. That is the cleanest calibration diagnostic.
TIMING: 1.5 min
Framework
Tiered Validation for Agentic Genomics
Tier 1
Research-grade
Analytically derived ground truth.
e.g. KING kinship.
Tier 2
Cohort-grade
Population distributions + QC.
e.g. 1KG EUR percentiles.
Tier 3
Clinical-grade (aspirational)
Orthogonal gold standards.
e.g. NIST GIAB, CAP/CLIA.
The contribution here is not the agent. It is a framework for evaluating agent-generated genomic results.
Corpas, Fatumo & Guio. Cell Genomics (in revision, 2026)
Tiered validation framework. Tier 1: analytically derived ground truth, like
KING kinship which has a closed-form expectation of 0.25 for first-degree
relatives. Tier 2: cohort-grade, comparing against population distributions
like the 1000 Genomes EUR reference. Tier 3 is the level we would need for
clinical-grade trust, using orthogonal standards like NIST Genome in a Bottle
or CAP/CLIA proficiency panels. That tier is aspirational infrastructure,
not current clinical equivalence. The contribution here is not the agent.
It is a framework for evaluating agent-generated genomic results. Agentic
genomics ships Tier 1 for free and makes Tier 2 routine. Tier 3 is what
we should be building towards.
TIMING: 1 min
Recap
For Many Analyses, the Bottleneck Is Moving
Yesterday: Pipeline construction
- Conda hell
- Format wrangling
- Dependency conflicts
- Two weeks of plumbing for one analysis
Today: Validation
- Reproducibility: is every run logged and rerunnable?
- Clinical safety: can we trust the answer?
- Equity: does it transfer across populations?
- Domain expertise: what must a human still verify?
Agents do not remove the need for domain expertise. They change where it is applied.
This is a single controlled case. The open question is how broadly this holds.
Current failure modes that require human intervention (not edge cases; routine in real-world data):
Ambiguous metadata
Ancestry mismatch
Low-quality VCFs
Incompatible builds
Corpas, Fatumo & Guio. Cell Genomics (in revision, 2026)
For a growing class of routine genomic analyses, particularly those
involving standardised pipelines on well-characterised data, orchestration
is becoming cheaper. That makes validation, calibration, and provenance more
important. Agents do not remove the need for domain expertise. They change
where domain expertise is applied. The validation challenge has four
dimensions: reproducibility, clinical safety, equity across populations,
and the role of domain expertise. Each dimension needs infrastructure that
does not yet exist at scale. That is increasingly where this department and
the broader stat gen community come in. This is a single controlled case.
The open question is how broadly this holds. I should also be explicit about
where this currently breaks. Ambiguous metadata, ancestry mismatch between
sample and reference, low-quality VCFs, and incompatible genome builds are
all current failure modes. These are not edge cases. These are routine in
real-world data. They require human intervention.
TIMING: 1 min
Where I'd Love SCGG's Help
👪
Primary: TwinsUK validation
Test whether this framework is genuinely useful on a cohort KCL trusts.
🧬
Wrap a method as a skill
If any SCGG method owners want to wrap a method, I would welcome that.
🏆
Co-build a benchmark
If there is appetite, we can discuss a benchmark effort.
Any collaboration has to be scientifically fair and visibly reciprocal.
github.com/ClawBio/ClawBio
docs.clawbio.ai
linkedin.com/in/manuelcorpas
Test it on data you trust.
TwinsUK would be an obvious place to start. Thank you.
Test it on data you trust. [PAUSE] TwinsUK would be an obvious place to
start. The cards on screen show the other options if people are interested,
but the core invitation is that simple. Any collaboration has to be
scientifically fair and visibly reciprocal. Thank you.
TIMING: 1 min
Supplementary
Learn More
docs.clawbio.ai
- Tutorials — build your first skill in 10 min
- Skill reference for all 48 skills
- Past presentations & talks
- Contributing guide & SKILL.md spec
- Hackathon materials
Supplementary slide. Everything you need to get started is at
docs.clawbio.ai. Tutorials walk you through building your first skill in
10 minutes. The skill reference covers all 48 skills. Past presentations
and talks are archived. The contributing guide and SKILL.md specification
explain how to ship your own skill.
Supplementary
Next Hackathon: 23 April, London
AI Agents for Health
ClawBio Hackathon
- Thursday 23 April 2026
- University of Westminster, 115 New Cavendish St
- Build new ClawBio skills, extend existing ones
- Genomics, pharmacogenomics, digital health
- Beginners welcome — ClawBio & CS students
luma.com/8qtu0xaz
Supplementary slide. The next ClawBio hackathon is in 8 days, on
Thursday 23 April, at the University of Westminster on New Cavendish
Street. Build new skills, extend existing ones, work in teams.
Topics range from genomics to pharmacogenomics to digital health.
Beginners are explicitly welcome. Scan the QR or visit luma.com/8qtu0xaz.
Q&A Backup
Q: So what do you tell Mother?
- 99th percentile CAD PRS → ~3x baseline lifetime risk (Khera 2018 odds ratios)
- This is a risk modifier, not a diagnosis
- Appropriate action: enhanced cardiovascular screening, lipid panel, discussion with GP, standard primary prevention assessment
- NOT appropriate: statin prescription or lifestyle change based on PRS alone
- PRS integrates with family history, blood pressure, lipids, smoking, lifestyle — it's one variable in a multivariable risk model
- Requires Tier 3 clinical-grade validation before any individual-level clinical action — this is exactly the boundary the agent must not cross