Agentic Genomics:
From Pipeline Automation to
Autonomous Validation

Manuel Corpas

Senior Lecturer, Research Centre for Optimal Health, University of Westminster
MSc AI + Digital Health Course Director · Creator of ClawBio
SCGG Away Day · King's College London · 15 April 2026

Thank you. I'm Manuel Corpas, Senior Lecturer at Westminster and creator of ClawBio, an open-source library of agentic AI skills for bioinformatics. Today I want to make a narrower claim than "AI changes genomics". My claim is that for some genomic analyses, orchestration is becoming cheap enough that validation is becoming the binding constraint. I'm going to show you what this looks like on real data from the genomes of my own family, and propose a falsifiable validation framework I think this department could help build. TIMING: 1 min

The problem

I Had a Folder. I Didn't Remember Which Was Who.

10 WGS genomes sitting untouched for 2 years.

I knew my genome, my father, my mother, and my sister were in there
I had another reference sample of my own (30x Dante Labs)
I asked the agent: "can you tell me which is which?"
Within a constrained bioinformatics environment. No scripts. No flags.

The agent downloaded KING,
inspected each folder,
ran a kinship analysis,
and gave me the answer with confidence scores.

No genome assembly wrangling. No install debugging. No format conversions. No hand-holding.

This is where this talk started. I had a folder on my external drive with 10 whole genomes. I hadn't touched it in two years. I knew four of those samples were my family, me, my dad, my mum, and my sister, but I didn't remember which PT ID was which. The other six were from other pilot study participants from the company I was involved in, Cambridge Precision Medicine. I asked the agent to resolve identity within a constrained bioinformatics environment. It downloaded KING, inspected the contents of each folder, figured out the genome assembly, ran a kinship analysis, and came back with the correct family structure with confidence scores. I didn't have to worry about installation, configuration, reference genome version, or any of the usual bioinformatics overhead. That is agentic genomics. That is the shift I want to talk about today. TIMING: 1.5 min

Definition

What I Mean by "Agentic Genomics"

Autonomous agents making runtime decisions
inside domain-constrained skill libraries.

Pipeline

Fixed DAG, pre-specified.
Every step known before execution.

Agent

Runtime decision points under constraints.
Steps chosen during execution.

Example: the agent chose KING over PLINK for kinship because it inspected the VCF headers, found no PLINK-compatible .bed files, and selected the tool that could work directly on VCF input. A static pipeline cannot make that decision.

Corpas, Fatumo & Guio. Cell Genomics (in revision, 2026)

Precise definition. Agentic genomics is autonomous agents making runtime decisions inside domain-constrained skill libraries. The key distinction is operational: a pipeline is a fixed DAG where every step is pre-specified. An agent has runtime decision points under constraints. Here is a concrete example: when I asked the agent to resolve sample identity, it inspected the VCF headers, found no PLINK-compatible .bed files, and chose KING because KING can work directly on VCF input. A static pipeline cannot make that decision. That is what "runtime" means in practice. TIMING: 1 min

The Library

🦖 ClawBio

Open-source bioinformatics infrastructure for agentic analysis

🔒

Constrained
skill library

⚙️

Runtime
planning

💻

Local-first
execution

📜

Full execution
provenance log

✅

Validation
hooks

Genetic data never leaves your machine. 48 skills, 16+ contributors, fully open source.

The key properties for this talk are the provenance log and validation hooks. Without those, none of the results I show would be auditable.

Scan to open the repo

github.com/ClawBio/ClawBio

docs.clawbio.ai

Corpas M. Bioinformatics (in revision, 2026)

ClawBio is the library I'll demo from. It is designed as infrastructure for science, not a product. Five properties matter: a constrained skill library so the agent cannot do arbitrary things; runtime planning so it chooses tools at execution time; local-first execution so genetic data never leaves your machine; a full provenance log so every step is auditable; and validation hooks so outputs can be checked automatically. 48 skills, 16+ contributors, fully open source. The key properties for this talk are the provenance log and validation hooks. Without those, none of the results I show would be auditable. TIMING: 0.5 min

Stat Gen Toolkit

Skills Relevant to This Room

How skills compose in a single conversation:

👪

Kinship check

KING-robust on chr22. Verify family structure before analysis.

🌍

Ancestry check

PCA against SGDP/1KG. Flag cross-ancestry samples.

🎯

PRS scoring

PGS Catalog scores against 1KG EUR reference distribution.

✅

Concordance check

Compare against prior published results or orthogonal methods.

48 skills on shelf today. The chain above is what this talk demonstrates.

Built by 16+ open-source contributors across 4 continents.

Rather than listing all 48 skills, let me show you the chain this talk demonstrates. Kinship check first: verify sample identity before doing anything else. Then ancestry check: flag cross-ancestry samples. Then PRS scoring against a reference distribution. Then concordance: compare against prior published results. That chain, from raw VCFs to validated percentiles, is what you are about to see. TIMING: 0.5 min

Case Study

Let Me Show You on Real Data

My family has been part of an open-access pilot WGS cohort for 15 years.

The raw data: 10 whole genomes on an external drive. Sentieon Haplotyper, GRCh37, ~170MB VCF each.

But which sample is which person? The VCFs are labelled PT00001A through PT00010A.

The point of using my family data here is transparency, not generalisability.
This is not yet a general result. It is a controlled demonstration that suggests where systematic evaluation should go next.

Ethics: UNIR PI:029/2020 (2021), open-access CC0 · prior publications: Front Genet 2021, BMC Med Genomics 2022

I've got a family who has been part of an open-access pilot WGS cohort for over a decade, the Corpasome project, published under CC0. I have 10 whole genomes on an external drive right now, but they are labelled PT00001A through PT00010A with no metadata. The point of using my family data here is transparency, not generalisability. This is not yet a general result. It is a controlled demonstration that suggests where systematic evaluation should go next. The point is not that this analysis is impossible without agents. It is that the cost of doing it correctly, with full validation, drops enough that we can afford to do it routinely. Before I run any polygenic scoring, I want the agent to verify the family structure. Validation first, analysis second. TIMING: 0.5 min

Step 1: Validation

Family Verified by KING-Robust Kinship

Father

PT00007A
φ = 0.227

Mother

PT00008A
φ = 0.220

Sister

PT00009A
φ = 0.261

Wife

PT00004A (British)
φ = -0.014

Parents unrelated to each other (φ = -0.020). PT00003A flagged as likely different ancestry (φ consistently negative).

The agent computed KING-robust kinship coefficients for every pair of samples in my cohort on chr22 alone. Chr22 is adequate for kinship because KING needs only a few thousand common variants to estimate relatedness coefficients; chr22 provides roughly 10,000, well above the minimum for first-degree resolution. It correctly identified my father, mother, sister, and confirmed my wife is unrelated. The parents are unrelated to each other, as expected. And it flagged one sample, PT00003A, as having consistently negative kinship with everyone, a hallmark of different ancestry, because KING returns negative values across population boundaries. The result here is correct relational reconstruction under a logged execution trace. TIMING: 1.5 min

Family PRS — Khera 2018

5 Genome-Wide PGS · Family vs 1KG EUR

Khera 2018 genome-wide PGS (CAD: 6.6M variants) · 1KG EUR (n=498, CEU/GBR/IBS/FIN/TSI) · 7.5 min local compute

Five genome-wide PGS from Khera 2018, scored on my whole family WGS plus a 498-sample 1000 Genomes EUR reference. I should note that 1KG EUR is a convenience reference, not the definitive one. It is adequate for a methodological demonstration, but the percentiles are reference-relative and ancestry-sensitive. A larger, ancestry-matched reference would tighten these estimates. Each panel is one trait; vertical lines show family member positions. The most notable pattern is in the top-left panel, which is what the next slide drills into. TIMING: 1 min

Drill-down

Coronary Artery Disease: Maternal Line, Top Decile

Mother 99th · Sister 97th · Manuel 94th · Father 52nd · Wife 13th

Three family members cluster at the upper end of the reference distribution. Father at the median. Wife (unrelated, British) at 13th.
100% rank concordance (20/20 family-score comparisons vs 2022 benchmark).
Methodological demonstration, not clinical interpretation. Percentiles are reference-relative and ancestry-sensitive.

Drill-down on CAD. Mother is at the 99th percentile. Sister 97th. I am 94th. Father at the median, 52nd. Wife, unrelated and British, 13th. Three family members cluster at the upper end of the reference distribution for the same trait, all on the maternal side. This is consistent with familial aggregation under this scoring framework, but the important point here is not the trait story. It is that the agent recovered a stable, interpretable pattern that we can then try to falsify. This is a methodological demonstration, not a clinical interpretation. These percentiles are reference-relative and ancestry-sensitive. TIMING: 1 min

Reliability

Same Family, 18 Different CAD PRS

Mother in family top-2 across 17/18 PRS · Manuel in top-3 across 16/18 · Sister in top-3 across 17/18

18 CAD PRS from PGS Catalog spanning LDpred, LDpred2, PRS-CSx, PRSmix, AnnoPred, P+T, GWS-only · 64 to 6.6M variants · 2018 to 2026

Reliability check. The previous slide showed Mother + both children at the top decile for CAD using one PRS. But which PRS did I pick? PGS000013, the Khera 2018 LDpred score. There are 84 CAD PRS in the PGS Catalog as of today. I scored my whole family against 18 of them, spanning the methodological landscape: LDpred, LDpred2, PRS-CSx, PRSmix, AnnoPred, P+T, GWS-only — from 64 variants to 6.6 million, and from 2018 to 2026. The headline: Mother is in the top 2 of the family across 17 of 18 scores. The relative within-family ordering is largely preserved across 18 CAD scores. This is Tier 2 cohort-grade validation: the conclusion does not depend on which PRS you pick. TIMING: 1.5 min

Tier 3 Setup

The Benchmark: Same Family, 2022

Implementation of individualised polygenic risk score analysis: a test case of a family of four

Corpas, Megy, Metastasio, Lehmann

BMC Medical Genomics (2022) 15:207 · doi: 10.1186/s12920-022-01331-8

Same four family members. Same Sentieon VCFs.
15 phenotypes, ~37M SNPs, 1000 Genomes IBS/EUR reference.
Manual pipeline: 95% SNP-overlap filter, no REF-aware imputation.
Open access, CC-BY 4.0. Ground truth for slide 12.

Four years later, the agent re-scores this cohort from scratch. Next slide: what matched, what shifted, and why.

Before I show the agent-vs-paper comparison, I want to name the paper. This is Corpas, Megy, Metastasio, Lehmann. BMC Medical Genomics, 2022. Same four family members, same Sentieon VCFs, fifteen phenotypes, scored manually against a 1000 Genomes Iberian and European reference. Two things matter about this paper for the next slide. One: the methodology is deliberately different from what the agent does today. The 2022 pipeline used a 95% SNP-overlap filter, which discards informative positions. The agent uses REF-aware scoring via samtools faidx. Different missing-variant handling, different reference construction. Two: it is open access, CC-BY, and the tables are public. So the benchmark is verifiable. Anyone in this room can re-run it. That is the ground truth. Next slide is the comparison. TIMING: 0.75 min

Validation Tier 3

Agent vs Peer-Reviewed Benchmark

Same 4 saliva WGS samples · same Sentieon VCFs (2021-02-18) · same PGS000013–17 weights · only missing-variant handling + reference differ

Score	Member	2022 paper	Agent 2026	Δ
CAD	Father	29.6	52	+22
	Mother	96.6	99	+2
	Sister	89.9	97	+7
	Manuel	81.9	94	+12
IBD	Father	43.5	43	−0.5
	Mother	70.8	70	−0.8
	Sister	46.7	47	+0.3
	Manuel	46.7	46	−0.7

20/20 rank orderings preserved · IBD within 1 point · CAD maternal-line top-decile replicates

Absolute percentile shifts are explainable from reference-distribution effects, not reversal of biological signal.

SAY THIS: Same samples, same VCFs, same scoring weights. Only missing-variant handling and reference distribution differ. Despite those methodological differences, rank order is preserved across all 20 family-score comparisons. Absolute differences arise from reference distributions, not from reversal of biological signal. STOP THERE. --- The route to that 20 of 20 was not clean. The agent's first attempt at scoring produced CAD percentiles that were off by roughly threefold, because it treated missing-from-VCF positions as dosage zero rather than homozygous reference. Pre-fix, rank concordance with the 2022 benchmark was below 50%. Post-fix, 20 of 20 family-score comparisons preserve rank order. That bug was caught by comparing against the published benchmark. The fix, REF-aware scoring via samtools faidx, is what you see now. That failure is the reason I am arguing for tiered validation, not despite it. --- ONLY IF ASKED about the larger CAD shifts for Father and Manuel: the z-score shift is roughly 0.5 SD for everyone. Father sits where the CDF slope is steep, so the same shift moves more percentile points. IBD barely moves because its effect weights cancel the monomorphic-position contribution. That is the cleanest calibration diagnostic. TIMING: 1.5 min

Framework

Tiered Validation for Agentic Genomics

Tier 1
Research-grade

Analytically derived ground truth.
e.g. KING kinship.

Tier 2
Cohort-grade

Population distributions + QC.
e.g. 1KG EUR percentiles.

Tier 3
Clinical-grade (aspirational)

Orthogonal gold standards.
e.g. NIST GIAB, CAP/CLIA.

The contribution here is not the agent. It is a framework for evaluating agent-generated genomic results.

Corpas, Fatumo & Guio. Cell Genomics (in revision, 2026)

Tiered validation framework. Tier 1: analytically derived ground truth, like KING kinship which has a closed-form expectation of 0.25 for first-degree relatives. Tier 2: cohort-grade, comparing against population distributions like the 1000 Genomes EUR reference. Tier 3 is the level we would need for clinical-grade trust, using orthogonal standards like NIST Genome in a Bottle or CAP/CLIA proficiency panels. That tier is aspirational infrastructure, not current clinical equivalence. The contribution here is not the agent. It is a framework for evaluating agent-generated genomic results. Agentic genomics ships Tier 1 for free and makes Tier 2 routine. Tier 3 is what we should be building towards. TIMING: 1 min

Recap

For Many Analyses, the Bottleneck Is Moving

Yesterday: Pipeline construction

Conda hell
Format wrangling
Dependency conflicts
Two weeks of plumbing for one analysis

Today: Validation

Reproducibility: is every run logged and rerunnable?
Clinical safety: can we trust the answer?
Equity: does it transfer across populations?
Domain expertise: what must a human still verify?

Agents do not remove the need for domain expertise. They change where it is applied.

This is a single controlled case. The open question is how broadly this holds.

Current failure modes that require human intervention (not edge cases; routine in real-world data):

Ambiguous metadata

Ancestry mismatch

Low-quality VCFs

Incompatible builds

Corpas, Fatumo & Guio. Cell Genomics (in revision, 2026)

For a growing class of routine genomic analyses, particularly those involving standardised pipelines on well-characterised data, orchestration is becoming cheaper. That makes validation, calibration, and provenance more important. Agents do not remove the need for domain expertise. They change where domain expertise is applied. The validation challenge has four dimensions: reproducibility, clinical safety, equity across populations, and the role of domain expertise. Each dimension needs infrastructure that does not yet exist at scale. That is increasingly where this department and the broader stat gen community come in. This is a single controlled case. The open question is how broadly this holds. I should also be explicit about where this currently breaks. Ambiguous metadata, ancestry mismatch between sample and reference, low-quality VCFs, and incompatible genome builds are all current failure modes. These are not edge cases. These are routine in real-world data. They require human intervention. TIMING: 1 min

Where I'd Love SCGG's Help

👪

Primary: TwinsUK validation

Test whether this framework is genuinely useful on a cohort KCL trusts.

🧬

Wrap a method as a skill

If any SCGG method owners want to wrap a method, I would welcome that.

🏆

Co-build a benchmark

If there is appetite, we can discuss a benchmark effort.

Any collaboration has to be scientifically fair and visibly reciprocal.

github.com/ClawBio/ClawBio

docs.clawbio.ai

linkedin.com/in/manuelcorpas

Test it on data you trust.

TwinsUK would be an obvious place to start. Thank you.

Test it on data you trust. [PAUSE] TwinsUK would be an obvious place to start. The cards on screen show the other options if people are interested, but the core invitation is that simple. Any collaboration has to be scientifically fair and visibly reciprocal. Thank you. TIMING: 1 min

Supplementary

Learn More

docs.clawbio.ai

Tutorials — build your first skill in 10 min
Skill reference for all 48 skills
Past presentations & talks
Contributing guide & SKILL.md spec
Hackathon materials

Supplementary slide. Everything you need to get started is at docs.clawbio.ai. Tutorials walk you through building your first skill in 10 minutes. The skill reference covers all 48 skills. Past presentations and talks are archived. The contributing guide and SKILL.md specification explain how to ship your own skill.

Supplementary

Next Hackathon: 23 April, London

AI Agents for Health
ClawBio Hackathon

Thursday 23 April 2026
University of Westminster, 115 New Cavendish St
Build new ClawBio skills, extend existing ones
Genomics, pharmacogenomics, digital health
Beginners welcome — ClawBio & CS students

luma.com/8qtu0xaz

Supplementary slide. The next ClawBio hackathon is in 8 days, on Thursday 23 April, at the University of Westminster on New Cavendish Street. Build new skills, extend existing ones, work in teams. Topics range from genomics to pharmacogenomics to digital health. Beginners are explicitly welcome. Scan the QR or visit luma.com/8qtu0xaz.

Q&A Backup

Q: So what do you tell Mother?

99th percentile CAD PRS → ~3x baseline lifetime risk (Khera 2018 odds ratios)
This is a risk modifier, not a diagnosis
Appropriate action: enhanced cardiovascular screening, lipid panel, discussion with GP, standard primary prevention assessment
NOT appropriate: statin prescription or lifestyle change based on PRS alone
PRS integrates with family history, blood pressure, lipids, smoking, lifestyle — it's one variable in a multivariable risk model
Requires Tier 3 clinical-grade validation before any individual-level clinical action — this is exactly the boundary the agent must not cross

Agentic Genomics:From Pipeline Automation toAutonomous Validation

I Had a Folder. I Didn't Remember Which Was Who.

What I Mean by "Agentic Genomics"

Pipeline

Agent

🦖 ClawBio

Open-source bioinformatics infrastructure for agentic analysis

Skills Relevant to This Room

Kinship check

Ancestry check

PRS scoring

Concordance check

Let Me Show You on Real Data

Family Verified by KING-Robust Kinship

Father

Mother

Sister

Wife

5 Genome-Wide PGS · Family vs 1KG EUR

Coronary Artery Disease: Maternal Line, Top Decile

Same Family, 18 Different CAD PRS

The Benchmark: Same Family, 2022

Agent vs Peer-Reviewed Benchmark

Tiered Validation for Agentic Genomics

Tier 1Research-grade

Tier 2Cohort-grade

Tier 3Clinical-grade (aspirational)

For Many Analyses, the Bottleneck Is Moving

Yesterday: Pipeline construction

Today: Validation

Where I'd Love SCGG's Help

Primary: TwinsUK validation

Wrap a method as a skill

Co-build a benchmark

Learn More

docs.clawbio.ai

Next Hackathon: 23 April, London

AI Agents for HealthClawBio Hackathon

Q: So what do you tell Mother?

Agentic Genomics:
From Pipeline Automation to
Autonomous Validation

Tier 1
Research-grade

Tier 2
Cohort-grade

Tier 3
Clinical-grade (aspirational)

AI Agents for Health
ClawBio Hackathon